.’ .‘.O.o.‘. .00.... n. I. I . x p. . I
z. .5 It! ﬁftew ...; 42.... .m. ... 1. .. -ﬂtIvt -. . .
. . .05 .. j . . .u 2.0. 1‘. 13.10.. .3 q 1:. .... V .
I an. «ﬁr ‘0 “-0.0 ﬂak-”Q”; r ”...!“C‘d... aw i 7.0 ..84v09...8"0....ou‘..‘..oo.2..=..t£. 12:23:01.1: :...1 . V
a... :38. . I 5.. C .v. I 1... 20.0.30! 0. ‘00. ,{0 I..e‘.c ...-oh 5.00[cor-3.06........n.....!.-..Iv 00c...'130.o0!?!. {..III1Y..1.J.J 1
g #3012!" 03.... ’J’ .0 . 9.. L... I .K.“Q~‘.—.~8 h’l‘ £06.... ‘2.”‘t:!:..1370‘ .X.. 2.5.3.: 8.1.9.2». .u..:¢.1&....l....E....l\...!vo.2£l.. . lit. :30... no .
a. . C I 09“”? .09 3.. (a .7. I .s t 50m. ...... 0...... t.l13..!'l. 40.“...0. n. ...-1......“0... 4.1..H0n...u<u ... r . ... ...O .. r .73. .~ ..ov ; 0.3.0... ......Ju.......hu.... ...... 01“....3JI. ..rflﬁ! . 9. ... 2.00.8301 a.
. . u v . . . . n o 0' .v'ov. ... V. .- 4t. .... w . . . . . a . t |. ... .. .n ‘4 .l‘l.ur.l . .. a
. .. ..ﬁ... M: ....m. ”...... .... 4...... . . 4MW§AV§ﬁzw new...“ ......ﬁf. . 43.... “Em u.......:...:............:.............. 2...... . ...... .2.....e..,.....r,.....2..r....u., r...“........,.........,... Wan....mm...........u.mnun.$..:nuﬁm.ﬁt .
’ Jot... '0 JV! ... .C ... . V .0 35.3.. o. _ nvD‘. [...-......l. 0.0.20.6.rP: .0 .-..!I?o.....0.loc..f!!o.u...2 -.. . ..- .!~ I o. . I I." .0 H. a. .0. .9, . ... .. ..l . ... . 6.. . o . C
. A I 0 3‘0..d0 . 0 I t . J5 . $7 0... ... 9‘... I. 80.. 01.900012, ..0 .02.... l.o..o........b...o.0.ll..... . ....w. . ... . . 3.... . .- ... ... . Y .23.: ....Q... .. .. l. c Q? o o
. 1 o . (5.... Ban. 3 .4 . t I ; . v . .. . . . . 3 I 1 c a: . o. w. L... .J.... . 9.82.... ...... 19?. ... 06.1.9.1 ......l‘ K‘
A .3 *F V 0,. 3......)I'ot C ‘..o.!b-.-_.’.o"l...'7!.ol o. 33..."..1. ,ouvofn. “133‘“...‘huw3W-wd“.uMu\!v..ulVI-..u«.‘ ...: ..OI. .. .. I ‘v‘ 3!...- .C_..020”..\ rQ-‘\..£" ...... €.u..J~0'l‘.uL~ Lli‘U0C 3S5
. .I’ll'. ill... ......9.‘ :0... .3 ...-H.303} 13.3....206. r: Q. 3.... . ... ...... 030. .....r 9’. 1.534%.“- .1.—
Yr..... ... .. t..!€r.....£..0: . f. .0...‘..1:alt ..9...‘ .1... 29.? 1'. I or
......3 .... . .0. u‘l‘o-ISSSICI . I: 3' ......L-YOQQooiju :fl'q‘. .- Q? ‘
.Y‘ : ... . .0 ... ......) .....J’..\IJ caveat-v.1!) 3.; 9.V0‘:t
. ..’.l . ...‘I‘o ‘IO‘ _

I
: ‘ .0. 4 0 no .
u 3 at I V, . ‘I. Qt '0 o u 0’.- . v. .
01000.3»- . as .- ...: -.. JV! 0 . .28 .0. . ) .. .... I .o ..
“do..." 20$ . be”, Jaw-84 . .. ’iflﬂhﬂnﬂc myfu‘. CH“. 00-9 ‘ﬁwiomwﬂﬂur;.4.llhrfdfclf1.. {3‘ (30-23. .01070.;.!: . .3 o o. o o . V. .. .
{gnumﬁar r Murry... . u... ...»...an .ﬁ.m......3.....3......rug..z.. ...}... .;&w%:ﬁ:&mu..u..nﬁw.3......ﬂunuqsaﬁuun.".93...."3 2.9.13.5...» V. in? .... e... . .. . ....
.I a g .u I u. 3 do 0 v . « k "u. o!‘oé-.¢u.- 0v 300.0,. ..1' .. €103.36.:.:....o.’4o-.O.'..‘...- .v. ‘3'... 30... . . . . . . ... 32“," 9.: .... .2 . .... .10, o 9':
J 4.”): par 0.2019. ... 4.3.2.119306203.....W. . . . . . . I . ; .... ..n\.. ... cichwVSC.‘ 3.20:1...0; £3.23...

..{“9 ... ......o...‘
,0. ...Ir .... . .3. a. .01..
. t ... .10! 119......3310 «‘49.. C. .00 . Q .
.... 3.10.... 101': ..Iughzdbr"... .112..." I
0 . .

      
  

  
  

 
          

 

          

 

 

      
  

     

                        

 

        

              

 

 

 

 

..
Q 3 . ...; k“; . I .
.1... 0‘ I: ...-0‘00 .011 a. . o. . . In...
(40. 131”}; .31.. 0’ .1. p o. 0' 0...... 3:3 ... . giro.» . ..IV .- . lo- tziiuﬂww} 2 t 40. ...-.0120!
0 4.0.5.. u 0’”! .f‘d’z. .0; ... 0 0 . .- .. . be. .0. 1:03,..00 .c :14 . .7905... .09
.rl. 1...}: r tip... _ ".34.“, no. ......31 .... . 313?... ......iis .. “.1 ...}..cc ..qrxasxx:
I......~ . 03' o. 5.}... r .. .09 V . ... - 1!..0: 2 0 .. v0 .309. 9’9 (8
a 0.. 0. 1.393;! I 34. . . . , . .. c . -. . . .
r. 0 3“: J . . , . V . . V. . V .30.. t4...’3lo¢..:2 c .30 ... o.
c . 35' Q n. J.“.? o. . 3 uror . . . .. 3.0.3.3 ...!!!6! .13.... CDCﬂfS: v“ I!!'§i'72b!‘.ﬂ~p. .9... . .I?‘-\I’J ..., 0...... 0-. to...
. £30...."9 . . f... _ 3.; 2 . . . . 90.. o I! ......28 .50.. .....3...:. ..ﬁlc.t.P.§-33..v ...... , ...... ...... .. 3‘1 . a. ...-.1 9...... ...: ... Qt. 30.1... a... 03“.. .7
‘00 ...-8.. . I300. dlrﬁllr v6.9.1.2... . ... q :0.- I....I. 1.2.00 $030.90.}... C:......YP:01005.II...:.. .... ... ... ..c V01..o.~ ..."... 1.5 1.03:5...100
0|. . JR...“ J. 10.... . 93......99- ......tll.§..0!3..5939...-1...... 9002 . It. ....«w 00.2....) 26. l .11....JHL tQJe'..§.a’...1.2. ...!t’lztttfa... .
I.” 3! ‘1 ... i To. 3.0-1... {‘1 ...: .. .1 1-358‘ ...-9'..- . . D? ...-...... .09: ‘ '0’; :90. . .....I..? O‘(‘ m:«.00..'c_..§' R a
r .P‘ .03: 00.8.3. 0dr...) .amtﬁfd.0 _ . . in}. . 9...... . .06.... ...l . 1...... ... 5.130...8192:...30....I..Qts~ ...v’slfoetu \
. J” ..‘..C~ﬂ3z.3.‘1 . . . V . 8 .0 . u u .rbv........3.......».!l .23.... V ..l'... .\..30 lr"«ll ... .KO ...! .g‘ 0.... 0....xu'd9 . Y
. ..(ia‘...‘1 . . . . . ...3‘ ... 0.02.... cor-:9..- . c.-....q’ ... , o... ..‘2’.|l.+ﬁ. .ﬁtl' ... .0 0 ‘v: OUonPI’91230 ’ ‘
‘8 $3,334 wt 1...)...2...‘ . 20.992800 .. 22.0.... .09.. .e!9£.3:01.o0 1:39.. :0:'. o.l\n « 9'12... ...»? ‘3)... 90»... .1... . ..i
it : etl .0 0.211).. 9. . . V . . . .. .u ....0"?.E.......o.t. o... 2.... ... l: ......b‘ :0:- tl..060. at ...S 3...: M». I 106:.00
33%.... ..uﬁéB...’ .5. ﬁx... “Hf... 231.5% ... 23...... . , u $4.: Vania... ..ﬁ...,._...............,.......b.n in... .. .. .. .. .. ....
A. .0: v. ... . . . a 830' ..v 0‘3!!! . . . .. . . . . . I. , . _ ..
as .. t1la\..:..!hﬂ. . .....Juﬂ..v..lh.m.vmv . Jami-H.“firsmu:”tauﬁvzarlﬁﬂhﬂztrdnvz.zRWIIQF. . . ..."...zisutcr. 7.. 1...: .2. c... . t , u o. 01.3. 00¢¢.....n.zc..?.' .3. 75.30.010.2- 1..t . .. Oklmh-t'v N 0..
...: ......‘U" D at... A”! at. ”...-o. .13 .l . ......cg'o.’ 03’. '09) 5?..118J ...Au.‘..6..-'ic'0o0.! ... I'm-$200.94}... 3...... s....—.«. V V . . . . . . I}... . ...0- I . 14:, ~v~o...2§ = not ‘§t..~k.’ ....uo‘...‘ .0. “. . « .\
$1.05!! o . .l. ...-{9. . .3 0110000....‘0 ’00-...“0800 10.0.80...- Jcti. 30:02.. ....“W.....l.o...:..09.0.0.3; . . . . . . . . . . . . . V _. 1 “rind-10. on... w...0dv..ut0lluil!£at~. «up! the: Ir. §§~O0X
. I . _ . . n o . .n A n \ . . . ~ . . . I .
...-9 0.300.. .0. :0 301.... ......JUVUUQI: . 0..u 0.1:... . .... .11....» . . . . ..- ... 3.2.3:. ... I. it....1t. 0! 4% It? 00.70.! 0“
z . , .1....00..I..,. ......a .o l. ....f'uuwicou . I! 6 0:00... . \
.... I. .... 710.... £000....1! p Iv o. 0 US 90.0.03."
..u...ot.....‘r.. .0..u...0l....00.40 I... '3‘ cl... 00% la
.,..v..!0%u.0..000l0).fvl.nsn. ...-0.0 i O :30! 3 X
)It'...‘ 0 ! .... 3% O
v.0. .03.! 0.00
n a l ill. 0..

    

 

 

       
         

            

 

 

 

         

 

   

 
 

     

 

I.
. .2. 77.9 I. O. . .
’ o. .D . A '1 n. . . .32... o.’o~o.?..03..!oo.$.1n
. 00c... 9". ...... .I_.‘...nb'?t..ii‘l:bt 09'2020
031.00%}:

   

i .. 2.5. 3:
. I f .394... 33....0
«p: at . «“4: I88 ..513..:.‘3§..l .l.ﬂ;...‘5:93203......qf r . . . .
. .v s. I .0. .0. ’3’0 06— ‘:‘ov‘ C...‘.t‘11"|0. .0. .
0 Wall“. .1 u: “8;..." 3: .0 ‘18.. J. .... £8.u1150.10329..1oul n..t.¢.:$..t..:l1..leA...¥Ju.a?r.....u.. ... 2:10.201 . . . .
:10.- ..35 ......0333000893:....023..60:..0.:0 a. I 0.3.3.3... 0.00 02.. ......2. #530! r owns..." ....Jw . . . . . . .
It.) its}... 01:. ....)!030011363383:101570............p........... «0.2.2.1192: 032030.02... la ..0 : . .. . . . . . . . . . . . . V .
..1:¢z;§.. r! 80. 20.. .. Ni. 8...... ....s. 1.2.8.... -!.:Iv.io.:t._..u.:..v.. .. . . .35.}... .v..3.civlu0l g :00
090’ 89.”...- 5: :3! I 8... 001.: ....s: :0. $200.13.... 120.. ...... ...10 S \0 3a.... i (5'1
It ..2: . [9A. .01.... 0.10:0..012‘0.‘ ...-0....OY0 3.10.20.01.0v ... Cir... ...‘0 u. v.0... o 319““.1} O
a. . .9001 4 v .. 58:05:05....00‘3 21.1.0. ....0380. ......J.....- .38.. 9:... t 3.0.“... . 5%.)... ... 0 ﬁlth. . 0 .
8 . at: ..3 0 .. ...: I 1. ... . .... .J..o.:.....:s.r ...: . . 6...... a . I. .... 9 I! ... 0 .
r v. 0 810 8!, I . 0 . . . a : .... ...... 0... ‘\..08 . . . .409.t01«.h.9¢ 7130:...23000 30) :3 000000..
. “N. ..I . u 0 0.. J! ...... u :8... d . . ..I I... . 09:. .. Suit-.-.... ~24. . .... ... s.lf~.;. '9..r.2070.0!l._0¥
109.511.9322 . O........:8£:l..l.. ((3....V01...‘ l ....0. 000238.... 03569.19... tiQIA‘.
9‘19... ‘0 ...-30:5,. 1.90.0...- o!‘~ ...! ~ [I c . 1900‘. .V..O.:9¢ o..‘.l'...‘. :n.'\o."0v! ...-.3170.
... 81,339... ... . u 70.2.10! ......C... ..c ..‘o....‘ Qﬁgu *5...I”.Ia
...-82092’ ...-.1129, ..‘sn.‘i'._,9I..

 

     

 

 

 

 

 

 
  

  
       

  

 

. . Il- ‘.ﬂo.- I: .
. . .l
9:... £3,226.32. 020.- . ...-{A 6. . .ﬁ 0.3.80. VWISJ .. A“. 09:10.. Y 244010.. ..00 .0
and. 9131?»..."39 sung... . . 94...... ($9.33.... 5.... ...- .33.... ...........r...n........~ ...”. ...ﬁri..n....:5.un.._...
.- ... 0.. u . .I p . . .
K.) "a”? Juan-Ilse.rdnln.i..liu 0?- !)8:0.. «$30.... 0808......20 .9“VI:9:.03:11..¥0.n.:.n.c..:l‘:hh.t .
‘3 . I ‘4’ :J" .0. O... : c. . ...... ... .. 3...: no. .
“SLVﬁaZJ Juiwuv‘z huh“: V . 0-.zr.-::03..7 030$§0$¢t9000091$bl
12"...t‘. -9 f3::&.--..‘ . I). ..~' VI ...... ...: J -..-1 9% . 7$I'(‘a"2t~a¥
. I 2.3.50.3...) .... 09:01.... I... 2.99.. ... . ......e 13...... ...-10... 2.3.9.. .
“133133.201.- Su‘l’:3. .0 It.“ . . . . . o :3... .30......... 1... t!......l.o.. 101?».199 .. .029! ...:x VTICTPQ
C 9I¢$.._o’:.’.: ...‘ﬁraz . . . , 0?: 20.02.... 1.. . . . ..t.voVI4_!..rp._ II: 9....:.....oo...ol...019-...)23310 0.7....
I. . 4..O.~OOII.PIIIJ!CIOOC . 61.2.30: ...vc .. .4. ... . I: .. . . . . . . .. . . . . . . , ..le ......illu. :3 _ ...rou} 1:200. ..o *0..Ot..!(.o.i...}‘f
ct}: 20.0.0133! 11.1 g....-‘ . . . ...... .15. IF... ...;lvttl. 09... . . . . . .. 30.2.0.1..Yxlla 0.1.6963... 0):- (00.00 I 02).;9 .o. n)... .
.3 so: 3.21.390000...’ ......) 0!..0-...'....:29......l02.9.. |x . V . 0 1.31.. ...... lttlln... .0...I72:l..fo.. 1.; 003060.. .43.! a. . 9%.It z
vaF3s1353022I0230 . . Jr”. .....30 it‘dha 39.2.3.1... 2.3.. o. .1... .. . . . i .....I.. . . . ...-...... 0 {.0 1 :8..000§00._$!V7.080b..99 -. 0?! w. I! .19§h‘
v... 60:00....3. $30 .I .o le:s-.ﬂl.td: .2. V... ..r. ...... «......VOUJIvvl-i t... 04v... .2..tm.u..ur...0\\ I“... ....J. 11.. b D.- I'. .5
I! 3......) ...! S. .63. 0'2! 332' .‘%‘§R$I$£Cl )1. 0..£..u¢lo:o... . ’0...- ‘lel. (‘0! .... 8.100(701 5...; o. fa ‘1 'Di‘vpii
I... 0. . . . 0. I: ... $5.00 0.. 33002-1... I . a... . (...-:3... J... . 6.10:0... s00 ...010..01\.0Q0.. 0“. 00‘!
.0. “ o.3.a...£ .03... .31.: 9.0.2.0, . 4F 10......v on. . ..‘v... . ., ..‘10"0...IO.¥3?."I990“?"93‘8’I‘%1\¢J4.lit“ , 03 ..0. Co“. .0.
. . . . . . .- .... .. A031 3. ‘0... F ... , V , . . I“ § .3 . I.
... n. : ... 03.03.... 39:20.“! . .0 .. ...... -.v .... : . . . 00M... ... 1 5‘31 6:. 00
. . . .9. .\II D.&..’I-
V . cool...’ ,3 i
. £200 9!! I
00.10.. '09 I.

... fs ‘90-..20 .

$0.. ...-02...... .
u, a. 930! 11w. ﬁ‘h.u’!ﬂ.ﬁvfﬁ‘.llgtz
9.11.292):

.0 ‘£.O.%¢"..’fn.v¢t:avo .
10.1.3.0. vo..‘1!..ov.'.‘.. '0- .
.l.‘1’.¢8........: .0... .
$3.03.?! is...
3.13:0“), to I. ..I
800...:‘19g’3 9 O

..810..!8P.l..8.r3€...: .’ 390.13.3010 .p
. . Ammfmummmmmwﬁ

‘0‘ u‘.’ 0";
’l'..t.‘o.!3~an.o. ...}
no.~oXl “.-.’.-x .. . .
. . V ..1. 22...... .....0’1...0.. . . H
. . ..P. 0.... ..., ...... - 20. .. . x... .9...c0. . :tl.o-.'..Qo, .-
. . . . 0!..9010'00330nD‘Oo”...¢$’.3ﬂa 13:19.9“
.0 rt! 0. 11.24.333.010. ..2‘..:DI.{... 3.3.91.3! 0?.9’03 3“ I .
00003.0... ...4 1.2.30...» 5. iii}... 0 ... 10.0.: 0%..
0562!...19 . I

 

            

 

 

    

a. ., 331‘s.) 3 ...00
1’ . 0 1:10". ‘Jléco'
.9. n.-.:’o...z.’. ...).- .

 

 

   

  

  

    
             

 

0.... . I Ola-Jig!!!) 3.31:...-
5 “9:33.. 3:02. i:0!.:!0..ﬂ
Ill00.301.ﬁvq34..§i.uvns.-th.ldit.ﬁl0 2 ... 0400. I .....t . 102.23,... 0:“...3 . ....
5.580; .. .. 030,. .. 1.1. c?! ... 0 .3 I .- 1.30.00.01.32.‘ 000 . . 30...... ...... . 3..
miuilhﬁéhﬁJaﬁa 230.963.003.9’ .. 333.05.“.BJHVJWVIYH."......h’018h00533 in“). .ncn..0.l‘tonlox\0.3‘o.|:".|.. ..o‘. “...”..0..: “Roda”... 0.. I: .
. 03‘ 3 . ‘0.- l .4 £350.. ...! 0 I. o . I... ..V. _. . . ...-v.0 : . .. ... o.
la..0.vll....h... ...-‘0‘... u . 09‘0:.Wtqﬂ.oail....€. . .....illli 2.300880%..itélxs 306......IO. ....II? ... . . .. "...-2...“... 1.9.2.3...
E...- s... .01 .u.. 3 a“ . 5.1.00 0.5.6:. 2330.. ......ﬁ... .0 .....Jl..'ltl.ilo. .3! ...-902000 .13 .I . 9 :- tfz 1.10.332; £112... .
"nun”..oﬂﬁt:uz I :04. ts-.8030u:nl£0.l iiﬁ. .k!..0&l.vtun.."rhuhﬁ.uvhu. . wadﬂwuﬂvﬂtsett .... . .. . . .. .. V . . .. «...; ...-...! 20:1... t 133.090.3100.. 90 5. 0:00.00}.
5... 3......an hung... 3.... in... .. 8.1.6. ....Gm..!k..i............. "... Lazar. c...:..:...-..... 1... .I..-..:....... . . . . . . . .......ﬁ.n...,...w.......a.....S: 2. an......d............e........m....€...§anuhuhﬁxsz «..e13.rf8o.i.3
- . n ‘ I! . a. 4-:— ‘ c. o 0 .q ’3 .t u . . . . . .. )V v. v
3363... ... 5.5.2.339 ...... .. L... inﬂatﬂhvﬁnn......ubﬁdﬁpﬂmﬁzl. .. ........ . , . .ﬁ..-..2.§.:.. 15...... .......§.....;.....--.. 3.1.2.........1..c........$.4.3.... ......63
v, ‘3 .0!- ..vo..‘.v'.-.O:i a. onrofooxulo‘.lizo .o‘ozcoo-ni 9.303,..3'3 ..ioov‘p.o..iwb..3"snﬁ. $31.... u.\.4.l-'ol 9.'..:..o...t?.oo . .... 901......0 .‘o. ..."..‘J I I}... .. 0 .QiUQXFZr.‘ .....3-.'O..o:~:. .' .1 .‘Q‘:Q".
G. 3.1.340... .3151..." 05’ "...-.03; .. 1...}... . .. I. 3-1!... c‘. 05500.9(3’.ul:¥n ....0.-I.v..~plo...t.£.0.....YQ ...... 3:40.91. 2.30.... 705.13... 9 :3. . ... .37.... X...- ..fu..90~..n.n . git-4023‘ an: 01¥unnla$lcv9gxrﬂuwﬁ0 I0 goltfgtb.’
32:200. 2:32.30 32.7.33... 2": .6 .. . I 4...... 1038.25.33.02! . ...:0:0x..0.§ 2:”. -... 32.1... ...l... 5233.0...0! .. :1. :3. . _ . . 0.. ...»...ayuc .02»... V. _ a .....0 £00....“ 0 003000.. 0.. 0 d ‘03....
00.31.10. ...! 1.0.0....20... 3A.”! . "u . .9301 . S I _ 0.1.‘5 . 2:231... 5.3000 .. . 00330:... 0.100. . . . . , . ., . . A... ....... .. .. 1...... . . .9..0..I.! 9.....J1..‘0h....uwv~n.v|o$0 300‘ ...ﬁhr}’i.x.¢
v.....1!.t..l..r.... .....YIIJJ..JJ.. . . 3313..-..3: . ...V ... ... 13"... 8 _v....,.ol0).l.)?.... .\ ...!!S‘L... 3...}... . d . . . . . ..2. to Vs 20.5.0.0? .\...Q?I. 0001:0111! . 03.... “...-h. ‘fﬁigls
v . . ......) . . . 8000030..I..ll:0¢!|0t1.:)uc 3933.43 lebltti..l..? 9.1.1.0....arl: ..< ............o.:... . . .121: a . V. . .....0. “...... 10. 10.1.5.0) .‘LA-«s’tQ .009...“
.033‘ £2.11 .0 :3..t..:3..00..83.068.7uw..B‘I. 1.011.. .....000..8:v’.. 33...... #3023 . o... I.Uo00 ... . . . . . .. V . . . . .. . V . . . . . . 03s.. , . :57vl .... I‘D. 02": al.t"t.3 #90000!
0.2.03.303003 Silt}... (Jyrrrlaxto u. . . . . . . : . . . ...: . . . V . o 31-33000... 10.....VPI00‘ 1’0.- ..Ilﬂ..)1 I)!
. . . . . . . . . 0.. 90!...» ....2 . . . . V . . ... .5‘0". 3-... 3:31.. 5:0."gw‘ .‘.'T’}‘
V . . . V .. . . . ...... ...II. .340! . . . . . , . . . 520310970003ztniu503.5,..J39991‘0D:
. . . c 20...; .......l...01..rv 50...... .0: .. . . 001380900909 VII-5.99307000OIO 1.1.7 .
.. ......10. .. .25....tt‘ ((900.2... ? .... n...9:.'b$ . ...3’40; 3.0.0.0....'1‘50193:1.¢r1 D
. 002.31% 3.? 00.10003090013... 00:3!!0..09
.... $0:.‘!OI.Q‘€

ionc“:s3’

  

‘77. '

     
    

0.9-...u: .Ifﬂv!l .PO...‘ V0!0....;. v.
n. 33.1- Q‘Q..I.v!u.2“.n..l.$.‘§01

 

 

 

        

  

 

 

 

 

          
 

  

 

.3} .033... 051.8'§.-t‘.3
.33. 3.9! . b _. ...»... 03:80:88.3... .3. 1.}..ﬁa
.Hc‘ftsqluﬂuﬂt8!...‘tac’lo.ltzt7 tld...0..l0.8.3nﬂ.! .......o....... .51.
. .... o..- ..5 ....J.3.o..00:00...:l..:. .....v . 22.63:... . .... ..‘o..01lsor? .
0.3S.-§I:SEIII.‘.01 ...-1.... 1.". .21...“ lthﬁw... .10....vsnrumuuﬂhouua.
‘. o . c p 003‘ .0... o .0“ a a 3.0.0 .
. ..‘hﬂuv". may“?! 30".. 2.0.3.500 3o... 3.1. 33.0. .-..J'. . 5.1.... E... .
.0...3.1122720:Q.b..08....f¢!¢l..: .. ..:...:€J:.t.0... ... 100...... d... h.” .zzeJJ . ... . 3)... 9:39.}... o. .
. ‘iv .03." ...». 33.3....“at- . ... ~ :1 . o. . ' .NO"u-....II..O ...:..‘...u‘. .....1 ...! to . ......u. ...!r».... 9o.r . . . . . . . . t, ‘1'”! ......Qo‘so“... 3.t’.:§:’
. .23.... \‘J: tux. 0...Ho:..1.. 3....“ 1.4.0 . 35.... 46:11:24. 2. . 5.203.231. .. . . ...... 0: . ...: if! .521! 010.“.00902..in00 [0000‘
- 3.0::an 39‘1”": no: . V . .02‘. .‘Av €I...‘0~o-ﬂ . ..000 ... a ...."v"... 91.1?! ’0 ‘3‘"...2’ .0.
"0.”. .Iv 9a ...-11.00.00. 5.030%}... 'u...-'.o..0.. Pt.- .. .OX. . . 0.0.... ~.’..“.‘.l‘n.9. .‘orﬁa. "I". ...... 0.. n
. ... 106.1... .0“..0.081.11...8. ‘3'! .5 .... 0:65.. (.33... . 0...... . . .79....51 v. 8 w‘ﬁ00-«43h00 _ 01If'h0 80..
I. V v. . ...! . )l,a.bll.o?3.t;£. I: . 0.1.0. .0 X. 3.3.... 13.03:. 8...... u. . .01.. .9 2h . . 10.0.3.0. 399! . ....o .500 ....Vo.§0‘a.
.....u “ht...“40du...r8 ...... . .0...o¢..0..... ...-72...... ...... 1.13.1.5}: 8005.33.11.59}. 1.. 3:- .... . .. .. .. .. . . . . ..9....c..~ .s :0 .‘uai‘ttor00rw U00
07.5””..30H4a 3.44.. 03.00.330.304.‘.3.u060|.o...51¢.15¢.. Q ..i....0‘u£!'”o'l "n.n.'..ﬁ’ouu.ll‘.v‘ .VJ.aquc..n..|-.’v|.-Jon.on .to. ‘t’. 3... . .0.- o ..0. ... . ... o ..v. ... V V .. , , . . .. ....- $¢..‘?§ "Q‘.’ 90." A
:0... i0'0...kl.n...o.o.o.:......103JQ;...'.RQ‘ .(lo.UHO1 . . inn.“ .....oug‘ui “0...... a... 5.... . 9.. 3.023... t..;‘\ ......30 . A a. . . . . . . . 2.0... q '8... ..L 9.05. o 1'. :“‘.O‘.:
......)‘t. ...o ..0. ... Q ...... 3 . $50.00!. .00..»1‘.‘u.t\...f2l. . 0...... ...l... :1... . .. x. v. . . . f . . . .. N!H~.........Wnr.‘wd.uto I. 0.0 (...-0.: Jug-V0000
. u . I... : . .0 of :I
\....5l..o.:.p..3 ... ......C. . . «910. n ., L._...u.rl.o..!u¥..Wf1ﬂt 0.0.30. '00.}
. a... ... 3‘00 .‘I’b‘ll C 00-“..0ititpni!
. "5 03“.}.Vl‘ 0.0 .1}: ‘1’... ...-OI.” .2 —
'00.. . 0..!0«0.'0(.09.‘C’09I5
o {gitgt .‘c 0.
I099;

 

. In.

  

 

 

 

0....- 0.... a.
o!
3»
73.3.. v

 

....)1801: c .
. .. 4-0.0-ogn,
V 2:331:10...‘ 9.. .
... .I' .o-...‘0.‘O .‘f ..c- 3.20.1 . ..
30.0300...

 

.. In"!
onolvu.o‘wn'.u}no
0 o. ... I I.
.9..0..Dx00:02131".
...-Giobsoizya n0...§.;v§:‘§o.
a..~19‘..00&~"0.~!-

 

.....x. ..I......I..

...) 90.. 0‘58 .
. . . 33.....3. '5 .. V
.I‘I’..1.‘.3o.... .

             

 

 

 

 

                    
            

.... . .»‘....
81......» ...-3 ..‘..o,.,
x. 9... a. . .
.0. .‘~ .....(onoo-I‘.
.{O '0 . T...‘oo.vo.'.. ...... 10...... .‘l...§. .00:
'0’... .
.... .Vl'..)'...... 2.01.! C .09....3: ..V
...1. ...I If!!! g ‘3. .. .... - .. . .c . . ..w. «i. v ......00 2.... ..{3 90.qu $110.. .062}... 001:...03’0. I‘CIQ:
. . u .0... . ~ I. ...l. ...-.... 39059.7. - . ... ..Oh 10h..-?5....s ......90039590010390t1zlob‘l‘lviltﬁz...
Q ...‘I 0.. . .... .. I. 0 2‘. 20.0.1. t... (3.9. 000;. .....O‘QI‘K0009. $900,010.00
1 .501... 00.0.0.9... 13020. 10.04....‘0 t?
. . 09’1F1‘

 

2 :J A...0.o.0.3 .. . . .
a :0 a O... .1; a 5039’.
. -70 u . f3 .
0.0 T. 9.. ..
.O\~'. . O ...-u
....5 . .... .
... 0.... :0 not!
a... u.- .

 

 

    

...: .
...-10.. ...o.
..

 

 

.. 0-... 0‘4.
. .300 all. oi ..‘l.1 x...- Y.
. . Q...0¢v.l ...: ... :QI‘; 0.080.. V o

lint-o.-. .... ... O .

O to ....c 0.
- .-

.AOOO. .
. ...; .

  

. I Q n . ....Ou... :- 0.. .
. v. o... ... ......L 0. ...: ‘0. . . . .
r. .....0: L . ~. V ... 3...... 9... 97380.... ...... V
. . . 13093.1...O-o '03 \C\
. '...‘c ‘3: I

       

 

. ... 3r.-.“ 3".-.

 

   

                

      

 

 

 

 

 

      

       

 

 

 

 

    

... . .... n. P. O...
110:1...- .., . . .. 3
.. . , u ‘0. .3 9 .001
l ..‘...n.qt.o u. «a . V .
.. . . .. .. z ... - ﬁ..............3.:.;. .
.I...—. ...a. .. . o
o . ..i...-3.43.33053.3:..92.’ . 000. . . i .. . . a 5:311 . .l s ......
......l..3..!.: . u... ..‘v ..al...o‘..€.,oI.. V I;- 691713 . 0. ...-0?. it... .0 .
‘t.“£w|...o[9¢. 40?..0.I . 0.. o... .. .. ...O. V... .. \
. 3.2.7.2... ...-5.0.. ... l. at... tn. .3 u . _ ... r. .
3.0...0. .0... .b . ....‘....!..
1.... ....u .-

   

 

 

 

.01
3-. v-...lr.30 ..-..15

 

 

.0 .... ..

 

 

 

       
 

 

       

   

at. .-..o.‘ .

v... . V . .. . . . . .. . . . .. V

A.’ u. .1 . .. . ..V . . . .. . ..V .

v3.0 . .. . . . . . .. . . . . . .. . V .. . .

7...... I; .. V . . . . . . . . . . . 0 . . h. '0
.c o 0.3.. I... .. «... ......l... ...”..uliuw'. .. . . . . . V .. .‘ .. . ......O-o'. o‘h-I'ﬁ.‘ 9.- . .9 ....Q
. . .. 01 .0 “.40..u1u0..5 .... . . .. . . u ....u 9.0.. ..0 1 6.. .9. o. [0. 0-?! I. 30.3.. . lb. 0,010....»39-1. .
. . .. .Xo. . t": 9 . '0... . o. 3...; v. . o. .395 1630.0. ... 93:30.0!300‘ 0000.16.39

.. u 0 .. .ua.~ . . ‘.:‘.¢?Q‘.QI I . .... 9- o . ff'. ‘4 o... 000.0.“II; 0". 0".‘o.l.n
a... o...atl..~.. .. .082010f'2QYrt’-

u ll....o.,..

 

0.0-... . .09

. u$’ A o a... .
o I o ...-X... 3.9.

 

 

 

 

              

                   

 

 

 

3.1.). u...
. . .... 0.... on. 9|. .4.‘ n
1.. ...... . .3.u..t..:_o...3 . .
.- «A. . . ...-.... C proo.t....._on.i
.o'.v...90 0.. . .. ‘).-_ . .
.I.’ "u... .11. .1. ......I .-
132 ... o I. .P .....v u. ...o’.‘ ... fo'c. . .. ...-.n Q.- ~ v ... o .
.... 00a. ...? ' ...c ... ...)...l. ... o...‘.. V: ..V. n .f u . .0: )o ... 90.
...-.3. . .K . .......98..!~..0..... ....uY...V.'0.... . 0.80.9 ..
,b. 'bouoavl. ..a.4...o.4.9’o.~..o‘..o.-.O out...
........0 ... . v ..... u . 0"... 97-4. .. .. ‘ .o .. ..0...
. ...... . 0 Y. ... ... . . ‘1. . .....V’. P. o - To. ... K
.........0...,...o .... .0. .... L.....o~0.|
..........t. .'vV.. .. . .9‘. ......ou. ....
...V... ... . ... ......o.‘v.

      

......

O 9 .- .0...
. .D0.T§.O.'P¢.D.. '0...

 

  

     

             
  

            

         

   

. ...... 50... V . .
c. .I‘ O... . n . u . -
..‘C..O.oo.. . . . . . _ . . ..
OI . . . V. . . . . . .. . . . ... ..In. .. o...
If! .. . . . . 838.03..“ .. . . ..........
.. .. ,. €00 . . lc‘ ... . .. . . . .. no I‘.......-...
0 .. ....u . . . . . .. ...-... . ..
...“. v. . ..ul.u-.1...b:.:o....9f0 . . . .. . . p .Y. 0-. . .. .... .. . 3. . _ 3... . . v.0... 0 ....” ..Inf...o~0....090ufv.ludv..0“09o.
. 1 . ...... . ~ . . ....... a... .. v .D . , I... .00. . .
. 6.. . . . ... Aw. ...: . o V....o’.....l. .3 ...: . .... . .... . ... . . ... in..." ...u . 90.1.0
. . ... o I: Lu.'. (......‘0. ..- ucsonu. .. o . o . . 0.. I'.DII.!UO§
. 2.7.0123! . v. ...-.1. . 0 u . . . v.90. ... .v..§r‘.0u r .....l
......é! . 2.0.0.. . . .... .... h-..f?,."c.o.v1r....
. . . . . . a . 1.0. o . ... ....o \O... {0.9;
. c . I'C.I$ "0
0...... d........... v . .... . . .. 00.50.-..) ...O.
. . ... 3....-. ......u.......'o ...... . . . ..Th0t100.5.
... n.....VA-§. 3.....: .........v. ...,03.. . . .. . i.\“0|00‘.¢00’
. V . . . . . . In... \Ct;.~f.'l .-
. . . . o, 9. .0930 v v 0......

. . ...-00.0.0 ..0

....u o c ...: .0
.’.O..u.. .
. .o c.
. . . ..n . . .
Val. .s—
. ......un.

 

....c t...‘.l3.oo...d._..
.....- .o 235......0... ... .0.

 

 

    

1.20:0!
. a 91.....11‘0; .. .6. .
2;... .. . . . . . .. .. .
_ .V . .... ... . . . . .. . ..., . .. . . . .0... ...... .
. .. . .. .. .. .0. ...... ....
I .. .0... .Ov ...... . . V . .
.30.... .‘c....v.V¢o‘ ...‘o.... .........--h..
3.0.... ....~ 0118-0. .... .1. ..--......q...
. .5
.0). .....50891v0.

. .....I..-.u
A .

. .4.
gl‘VOQ.

  

3.1.0.. riot. \.00 u ...vi

       
       

. 3.0.71... ..ao...

. ... ......o...ou. .. .-
.....o..: .I. .1. .. .... .... .ao.. ..... o..0o..03....hvo.ua. .
...... .. ... ... ......5 v... -. .. l..v..’.-Q !.t..}f‘.oﬁ.~-l.vp1a
. no.» O-‘u.00.0.u. O..o‘..'..Q|IIO~7L‘10.\.0.)l..
.n0.01.09.80.50‘.00-0)..\\09...l

        

 

 

        

 

        

. Li..- o......,.vwi.. . .
.I a
. V ...... .... .
.. :30. ..‘t. 1;... . . .
.100.‘. 0. ..... .0. V. . .... 'o... . .... ......—
. ... ... ... ...'. ..‘tnl... . u .\....¢O:o...
.... .. o.......u.. . .....o...... . . . . o ...._..o... v.:..Jo-.O.Q....l.... .0...2C.0\|0r.100¢1.
..|.vl. ..-.0: . . u» . ...... Q . . ‘f'.....00. ..o‘... 2| .0. ‘9‘?
....V............ J o . v. .. ....u.. c.... ......9?..-..... ....0 .
v . 771...}..3 . “H. u . .35....7... v. . ...-.... \ﬁ..o\o.o. .. ....o...0 : .-
. I IIHJ: ........V.. . .. ... ..d..v..... . ..n.......u.-..
no». ...? .. ..... . :0 ... ... .o 3 0.... .. .... . .v.....¢>.2.,..¢.
co... ....) . .. ...~.o .. 0...... ...... 4.. . .. .VO.
.. . ......o01.|..oo....om ... 0,90. .
......L... .01... . . .14,o.o..c.0¢§00 .. . . .
. .0... .. . a... _ cc .....-n90.f0....9

 

 

l .

     

 

      

        

   

   

.0... . 0‘ 0
...o .... . . . .
.r Q ...;3. . 0.0. .0.- .. 01-..\. .
4.. 2 .. ...‘2. . . ....»o... . -0: . o... ....0 .
..- .. . v. 0:: . . I... . . .s-o..- .l..v...-. o. . 0..v...p
. . . In .. ........p.o..o o s ..V v.--. . .90. ‘0...
. .... ..... . . .. .....- .. 25.... ~.. .. . V... o. o ...... u. . ... t. .
. 9..0u....o........v... .. .O.| 9... l... ... . .......n..o.-...... .s..... o .... .. .....Ia o.vo¢ ......u.? I...
.. ... O . . 0 1... ... .. . ._ O. . ..V ... ... 1...... ._ . A. l... ...-O
\ .. .. \‘ ... . o u V 30... .. ... ......o~...1. .- 0.. .. .A. . .. a. . a. ...5 ... ......0
" u |.I - .... ...l. 2...... .. . ..... V ...-... u . .... ....... ........... ... .... ... . .... c . .oo. ... .. . .
ti 3' .o .. .. .... ...... .0: .a. ... .u .. . 0 .o . ..~. 0 1.... . .0 u. _. .... n. . .?.. ...... do. .0 a .. .... ...... u. c . ... ... I.\ at. ... . . :
p . A . .. . ..0 ..a.» . .1. . no. . u . i ....A o ... A...... .... . . .....02 V n . ~.. .........A. . . A. .. ....0...,Q;n" .30.. u.. ......Ocio- 0...... .ut.b.i\"00»tz
, o . . . . ~n. . ...... o .....-.0..... ‘4...- ... ...b...0-....A... .... It. . .0....09 ....I....~O.OI0...C cl. .Ju 0.00
. . < . . .-. . ..V0.l.. O .. ... n. .. I . ...-0.. ..~‘ ~ ....loo.o ‘.oco~.q (I000 on.
. n ..o A .. . . o u o. o n 0.. o. o .....-..u .0 O... InOVOo
. u a. . o‘-. 60 o a . a .0 \I
. o . . .
- . a .u u v.
I . o n t. 0 n .
. .10 I n. . I . . a- . . n. u. o- .
no. 0 0 0| 0. .00 3 I. .0 9: on. nu
.....-o\ ....n. . ..o' ..OOOOI..

....
.0 .. .
...

  

.u.. .0.

9 ‘l. ... -.. V 0 .I
. .‘0 ...I“ 0'... .
.$$O.K v." .. “.vth., [at ﬂirt. v6n00‘ﬁl1. I. .10 c 1.... | 0 \ 0 0h . I 0V .
g t- . .3 I . a V o . . , y
x Q. 1.". . ‘Wo ‘.~$ "O‘I‘QJ ‘d‘...o’" n l h“ﬂ0‘.“up.hg'§‘hf . .l’ -A Us}..- 00.10 .‘ .... . 0J0; 0s. . v. . - ‘ o. .. . o '. I I t v o . 0 v .
.... . ......n ... 3.321 . e:..2...§...si. .-.}...s... ......i; ....-.2."...13........ﬁu.....,..v..x.uﬁ....?......am - L. 3 .3........,......, 5...}-.. .5. . . . .. .. .
in! » .. .1 ...... .....Ictal .... voic5itatiiotddw. “1.0...uwsvuo...$pt.?1.35332}: ‘01. ...; .....20: ..s..2.v$m..lltiv...91’.‘.lt.... ..Qc ... 50:310.... ......9...9 .. 21 .. ...... It... ... 39hr. . .. .. . ... . .V V . u .. Y: . .. u... .
... .. .. . .4 . o. n . ..v 1.. .... i... . .... .. ..- .I ._ r0330. :‘l..2ra...:. -.¢.:.|.0300 1403......Ill52982....32I1.1I-‘. ’33....310... 08:..041II....1!: ..d 1;. ......(311... ’; .. ......x; v... . . . . p . .
. . . o - ...- . 3... v. o . . ... ...... 2.. ; :0. . ......9... . ..l.......:n!.t.. .0::0...L.03:...Q\‘..’..v.._o2¢tltft 00 €.!.t..t&31..a..?..‘9o I0... ‘326 .‘c’ﬁ‘.’l!0!!...o -..00 ...... . 00....‘0005 . J .
. . _ .... .. .... . .. ... ... .o .. .92....2 00......2ooo‘... .v!!....!..~1000r. ....1’. r§.!0:....vvl.‘.. .V!..3.E.OV 9.0... 0. .99....t.0..0x.~9 06-60109.
. . . . ~ .. ... c...-.ul...oo« .... olbl..- .. .9....,..0 . ... . on .I IO.I0'u0\.00
. c v. . .0 .D 9. a. .l nulu00.v.
I

1‘...I5..o..vt. to
. O _ .. .

|| .l
'|
{I‘ll
I' . In" | ||
‘1‘
-.1

0-. .0093... O O
.. ....‘u.

 

 

LIBRARY I
Michig?m State
University

 

 

This is to certify that the
dissertation entitled

NATURAL LANGUAGE INFERENCE: FROM TEXTUAL
ENTAILMENT TO CONVERSATION ENTAILMENT

presented by

CHEN ZHANG

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Comjuter Science

 

 

 

 

 

M516? Frofe’sfsoﬁsjsrig’nature
Ey}é /&o l (7

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5’08 IQIProllAchreaICIRClDltoDuaindd

 

 

 

NATURAL LANGUAGE INFERENCE: FROM TEXTUAL ENTAILMENT TO
CONVERSATION ENTAILMENT

By
Chen Zhang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2010

 

 

ABSTRACT

NATURAL LANGUAGE IN FERENCE: FROM TEXTUAL
ENTAILMENT TO CONVERSATION ENTAILMENT

By
Chen Zhang

Automatic inference from natural language is a critical yet challenging problem for
many language-related applications. To improve the ability of natural language in-
ference for computer systems, recent years have seen an increasing research effort on
textual entailment. Given a piece of text and a hypothesis statement, the task of
textual entailment is to predict whether the hypothesis can be inferred from the text.

.T he studies on textual entailment have mainly focused on automated inference
from archived news articles. As more data on human-human conversations become
available, it is desirable for computer systems to automatically infer information from
conversations, for example, knowledge about their participants. However, unlike news
articles, conversations have many unique features, such as turn—taking, grounding,
unique linguistic phenomena, and conversation implicature. As a result, the tech-
niques developed for textual entailment are potentially insufﬁcient for making infer-
ence from conversations.

To address this problem, this thesis conducts an initial study to investigate conver-
sation entailment: given a segment of conversation script, and a hypothesis statement,
the goal is to predict whether the hypothesis can be inferred from the conversation
segment. In this investigation, we ﬁrst developed an approach based on dependency
structures. This approach achieved 60.8% accuracy on textual entailment, based on
the testing data of PASCAL RTE-3 Challenge. However, when applied to conversa-
tion entailment, it achieved an accuracy of 53.1%. To improve its performance on
conversation entailment, we extended our models by incorporating additional linguis-

tic features from conversation utterances and structural features from conversation

 

discourse. Our enhanced models result in a prediction accuracy of 58.7% on the
testing data, signiﬁcantly above the baseline performance (p < 0.05).
This thesis provides detailed descriptions about semantic representations, compu—

tational models, and their evaluations on conversation entailment.

 

 

ACKNOWLEDGMENT

My acknowledgments to Dr. Joyce Chai, my advisor. You lead me in the ﬁeld of
research for all these many years, you gave me so many advises, and you worked so

much with me on this thesis.

Acknowledgments to my guidance committee, Dr. John Hale, Dr. Rong Jin,
and Dr. Pang-Ning Tan. Thanks for your valuable comments and directions. They

greatly helped this thesis.

To my fellow workers, Matthew Gerber, Tyler Baldwin, Zahar Prasov, Shaolin Qu,
and Changsong Liu. You shared your ideas and knowledge. They are very important
to this work.

To Marie Lazar, Timothy Aubel, Sarah Deighan, Jeff Winship and many others.
Your contributions to the data collection and annotation are very much appreciated.
They made this work possible.

To mom and dad. You are always with me.

Thank you, Jean.

iv

TABLE OF CONTENTS

List of Tables .................................

List of Figures ................................

Introduction
1.1 Research Objectives and Overview ...................
1.2 Outline ...................................

Related Work
2.1 Textual Entailment ............................
2.1.1 Logic-based Approaches .....................
2.1.2 Graph-based Approaches .....................
2.1.3 Comparing Logic-based and Graph-based Approaches .....
2.1.4 Performance Analysis .......................
2.2 Studies on Conversation Scripts .....................
2.2.1 Recognition of Conversation Structures .............
Dialogue Acts ...........................
Opinion Frames ..........................
2.2.2 High Level Applications .....................
Latent Biographic Attributes ..................
Social Networks and Biographical Facts . . . .. .........
Agreements and Disagreements .................
Meeting Summarization .....................
Predicting Success in Task-oriented Dialogues .........

A Dependency Approach to Textual Entailment

3.1 A Framework of the Dependency Approach ...............
3.1.1 Representation ..........................
Syntactic Decomposition .....................

3.1.2 The Alignment Model ......................
3.1.3 The Inference Model ..... . ..................

3.2 Learning the Entailment Models .....................
3.2.1 Learning the Alignment Model .................
3.2.2 Learning the Inference Model ..................

3.3 Feature Design ..............................
3.3.1 Features for the Alignment Model ................
Features for Noun Term Alignment ...............

Features for Verb Term Alignment ...............

V

viii

 

An Example of Feature Estimation for Verb Alignment . . . . 45

3.3.2 Features for the Inference Model ................. 46
Features for Property Inference Model ............. 46

Features for Relational Inference Model ............. 47

An Example of Feature Estimation in Inference Model . . . . 48

3.4 Post Processing .............................. 48
3.4.1 Polarity Check .......................... 49
3.4.2 Monotonicity Check ....................... 50

3.5 Experimental Results ........................... 53
3.5.1 Alignment Results ........................ 53
3.5.2 Entailment Results ........................ 55

An Initial Investigation on Conversation Entailment 56
4.1 Problem Formulation ........................... 56
4.2 Types of Inference from Conversations ................. 57
4.3 Data Preparation ............................. 59
4.3.1 Conversation Corpus ....................... 60
4.3.2 Data Annotation ......................... 60
4.3.3 Data Statistics .......................... 61

4.4 Experimental Results ........................... 65
4.4.1 Experiment Setup ......................... 65
4.4.2 Results on Verb Alignment ........ ‘ ............ 66
4.4.3 Verb Alignment for Different Types of Hypotheses ....... 67
4.4.4 Results on Entailment Prediction ................ 68
Incorporating Dialogue Features in Conversation Entailment 70
5.1 Linguistic Features in Conversation Utterances ............. 70
5.1. 1 Disﬂuency ............................. 71

5.1.2 Syntactic Variation ........................ 73
5.1.3 Special Usage of Language .................... 75

5.2 Modeling Linguistic Features in Conversation Utterances ....... 77
5.2.1 Modeling Disfluency ....................... 78
5.2.2 Modeling Polarity ......................... 78
5.2.3 Modeling Non—monotonic Context ................ 80
5.2.4 Evaluation ............................. 81
Evaluation on Verb Alignment .................. 81

Evaluation on Entailment Prediction .............. 83

5.3 Features of Conversation Structure ................... 86
5.4 Modeling Structural Features of Conversations ............. 89
5.4.1 Modeling Conversation Structure in Clause Representation . . 89
5.4.2 Modeling Conversation Structure in Alignment Model ..... 93
5.4.3 Evaluation ............................. 95
Evaluation on Verb Alignment .................. 95

Evaluation on Entailment Prediction .............. 97

vi

 

A

6 Enhanced Models for Conversation Entailment 101

6.1 Modeling Long Distance Relationship .................. 103
6.1.1 Implicit Modeling of Long Distance Relationship ........ 103
6.1.2 Explicit Modeling of Long Distance Relationship ........ 104
6.2 Modeling Long Distance Relationship in the Alignment Model . . . . 105
6.2.1 Implicit Modeling of Long Distance Relationship in the Verb
Alignment Model ......................... 106
6.2.2 Explicit Modeling of Long Distance Relationship in the Verb
Alignment Model ......................... 107
6.2.3 Evaluation of LDR Modelings in Alignment Models ...... 108
6.3 Modeling Long Distance Relationship in the Inference Model ..... 109
6.3.1 Implicit Modeling of Long Distance Relationship in the Rela-
tional Inference Model ...................... 111
6.3.2 Explicit Modeling of Long Distance Relationship in the Rela-
tional Inference Model ...................... 112
6.3.3 Evaluation of LDR Modelings in Inference Models ....... 113
6.4 Interaction of Entailment Components ................. 116
6.4.1 The Effect of Conversation Representations .......... 117
6.4.2 The Effect of Alignment Models ................. 119
7 Discussions 123
7.1 Cross-validation .................. _ ............ 123
7.2 Semantics ................................. 125
7.3 Pragmatics ................................ 128
7.3.1 Ellipsis ............................... 128
7.3.2 Pronoun Usage .......................... 129
7.3.3 Conversation Implicature ..................... 130
7.4 Knowledge ................................. 130
7.4.1 Paraphrase ............................ 131
7.4.2 World Knowledge ......................... 131
7.5 Eﬁciency ................................. 132
8 Conclusion and Future Work 135
8.1 Contributions ............................... 135
8.2 Future Work ................................ 137
Data ............................. 137
Semantics and Pragmatics ................. 137
Applications ......................... 137
Appendices 139
A Syntactic Decomposition Rules 140
B List of Dialogue Acts 150

 

1.1

3.1

3.2

3.3

4.1

4.2

4.3

.5.1

A.1

B.1

B.2

LIST OF TABLES

Examples of text—hypothesis pairs for textual entailment .......

Calculating the features of inference model for the example in Figure 3.2 48

The list of negative modiﬁers used for polarity check .........
The list of non-monotonic contexts ...................
Examples of premise-hypothesis pairs for conversation entailment
Distribution of hypothesis types ......... ' ............
The split of development and test data for conversation entailment . .
The expanded set of negative words used for polarity check ......
Rules for syntactic decomposition ....................
The dialogue act labels used by Switchboard annotation system

The dialogue acts used in this thesis ..................

50

58

64

66

79

142

150

 

2.1

2.2

3.1

3.2

3.3

3.4

4.1

4.2

4.3

4.4

5.1

5.2

5.3

5.4

LIST OF FIGURES

An example for dependency graph .................... 12
An example for graph matching ..................... 13
An example of syntactic decomposition ................. 32
The decomposition of a premise-hypothesis pair ............ 34
An alignment for the example in Figure 3.2 ............... 36
Evaluation results of verb alignment for textual entailment ...... 54
Agreement histogram of entailment judgements ........ , . . . 62

Evaluation results of verb alignment using the model trained from text
data .................................... 67

Evaluation results of verb alignment for different types of hypotheses 68

Evaluation results of entailment prediction using models trained from
text data .................................. 69

Evaluation of verb alignment for system modeling linguistic features
in conversation utterances ........................ 82

Evaluation of verb alignment by different hypothesis types for system
modeling linguistic features in conversation utterances ......... 84

Evaluation of entailment prediction for system modeling linguistic fea-
tures in conversation utterances ..................... 85

An example of dependency structure and clause representation of con-
versation utterances ............................ 90

 

5.5

5.6

5.7

5.8

5.9

5.10

6.1

6.2

6.3

6.4

6.5

6.6

6.7

6.8

7.1

7.2

The conversation structure and augmented representation for the ex-
ample in Figure 5.4 ............................

An alignment for the example in Figure 5.5 ...............

Evaluation of verb alignment for system modeling conversation struc-
ture features ................................

Evaluation of verb alignment by different hypothesis types for system
modeling conversation structure features ................

Evaluation of entailment prediction for system modeling conversation
structure features .............................

An example of measuring the relationship between two terms by their
distance ..................................

A copy of Figure 5.5: the structural representation of a conversation
segment and the corresponding hypothesis ...............

Evaluation of verb alignment with different modelings of long distance
relationship .................... . ............

Evaluation of inference models with different modelings of long distance
relationship ................................

Evaluation of inference models with different LDR modelings for dif-
ferent hypothesis types ..........................

Effect of different representations of conversation segments on entail-
ment performance .............................

Effect of different conversation representations for different hypothesis
types ....................................

Effect of different alignment models on entailment performance . . . .
Effect of different alignment models for different hypothesis types

Comparing the cross—validation model and the model learned from de-
velopment data for verb alignment results ...............

92

94

96

97

98

102

110

114

115

117

118

120

121

The dependency structures for examples of shallow semantic modeling 126

 

Chapter 1

Introduction

While we human, based on our linguistic and world knowledge and reasoning capa-
bilities, are able to make inference and derive knowledge and conclusions from what
we communicate to each other, automated inference from natural language has been

a signiﬁcant challenge for N LP systems. This is due to many reasons:
1. The variability, ﬂexibility, and ambiguity from the language itself.

2. The representation of knowledge in computer systems and the scope of the world

knowledge.
3. The capabilities that support automated reasoning.

A tremendous amount of research has been done in pursuing all the above direc—
tions. Recent efforts which have touched upon all these directions are the ﬁve events
of PASCAL RTE (Recognizing Textual Entailment) Challenge [8, 10, 22, 36, 37].

The PASCAL RTE Challenge formulates natural language inference problem as a
textual entailment problem. It provides a concrete, yet informal deﬁnition of the
problem: a textual entailment is a directional relationship between pairs of text
expressions, denoted by T - the entailing “Text”, and H - the entailed “Hypothe-

sis”. T is said to entail H if we can infer H from the meaning of T. Examples of

1

 

Table 1.1: Examples of text-hypothesis pairs for textual entailment

 

 

 

 

 

 

 

 

Text Hypothesis Entailed
iTunes software has seen strong Strong sales for iTtmes True
sales in Europe. in Europe.

Cavern Club sessions paid the Beatles The Beatles perform at True
£15 evenings and £5 lunchtime. Cavern club at lunchtime.

Sharon warns Arafat could be targeted Prime minister targeted False
for assassination. for assassination.

Mitsubishi Motors Corp.’s new vehicle Mitsubishi sales rose False
sales in the US fell 46 percent in June. 46 percent.

 

 

text-hypothesis pairs from the PASCAL RTE Challenge, together with the labels of
whether H is entailed from T, are shown in Table 1.1.

Because complete, accurate, open-domain natural language understanding is far
beyond current capabilities, nearly all efforts in this area have sought to extract the
maximum mileage from quite limited semantic representations. There are three major
classes of approaches to the textual entailment problem: the IR-based approaches,
the logic-based approaches, and the graph—based approaches. An overview of these
approaches can be found in Section 2.1.

Successfully recognizing textual entailment has many potential applications such
as text retrieval, question answering, information extraction, document summariza-
tion, and machine translation evaluation.

While PASCAL provides a concrete platform for studying natural language in-
ference, its particular focus is on text. The data are all from well-formed newswire
articles in a monologue fashion. Nowadays, more and more conversation scripts has
become available, such as call center records, conference transcripts, public speeches
and interviews, court records, online chatting, and so on. They contain vast amount
of information, such as proﬁling information of conversation participants and infor-
mation about their social relations, beliefs, and opinions. Therefore, the capability
to automatically infer knowledge and facts from these data has become increasingly

important. One question is, can we follow the PASCAL practice and study natural

2

 

language inference from the dialogue setting?

On the one hand, although a conversation is a communication by two or. more
people, it is essentially a kind of information expressed by natural language, as is
the case for text. Therefore, making inference from conversations requires similar
techniques as textual entailment such as language modeling, lexical processing, syn-
tactic parsing, and semantic understanding, and also shares the same tools such as
reasoning and world knowledge.

On the other hand, conversations also have many unique characteristics that dis-
tinguish them from text. The key distinctive features include turn-taking, grounding,
implicature, and different linguistic phenomena. They can also contain information
that is unique to themselves. For example, in a task-oriented conversation, we are in-
terested in whether the task is accomplished in the end; in a cooperative conversation,
we may be interested in how well the participants cooperated with each other; and in
a debate, we may want to know which party performs better or which one actually
wins the debate. These tasks involve not only the processing of lexica, syntax, and
semantics, but also the recognition of dialogue intention and conversation structure.
Therefore the inference from conversation scripts is a more challenging task.

Thus, it is the goal of this thesis to take an initial investigation on natural language
inference from conversation scripts. Inspired by textual entailment, we formulate this
problem as conversation entailment: given a segment of conversation discourse D
and a hypothesis H, the goal is to identify whether H can be entailed from D. For
example, below is a short segment of conversation script, together with a list of

hypotheses.

Conversation segment:

A: Um, yeah, I would like to talk about how you dress for work, and,
and, um, what do you normally, what type of outﬁt do you normally
have to wear?

 

B: Well, I work in, uh, Corporate Control, so we have to dress kind of
nice, so I usually wear skirts and sweaters in the winter time, slacks,
I guess.

Hypotheses:

1. A wants to know B’s dress code at work.
2. B works in Corporate Control.

3. The speakers have to dress nice at work.

In this example, the ﬁrst two hypotheses can be entailed from the conversation

segment, while the third one cannot.

1.1 Research Objectives and Overview

To study the problem of conversation entailment, this thesis particularly examines

the following issues:

1. To what degree the techniques developed for textual entailment can be re—used

for conversation entailment?

2. What unique characteristics of conversations should be modeled and incorpo-

rated for conversation entailment?

3. How to combine linguistic, discourse, and context features together to develop

an automated system for conversation entailment?

To address the above questions, in this thesis we have conducted the following

work:

1. We created a database of examples on conversation entailment following the
PASCAL practice of textual entailment to facilitate our research objectives.
We selected 50 conversations from the Switchboard corpus [38] and had 15 vol-

unteer annotators read the selected conversations and create hypotheses about

4

a.
if

 

participants. As a result, a total of 1096 entailment examples were created.
Each example consists of a conversation segment, a hypothesis statement, and
a truth value indicating whether the hypothesis can be entailed from the conver-
sation segment given the whole history of that conversation session. Inspired by
previous work [34, 49], we particularly asked annotators to provide hypotheses
that address the proﬁling information of the participants, their opinions and de-
sires, as well as the communicative intents (e.g., agreements or disagreements)

between participants.

The entailment judgement for each example was further independently anno-
tated by four annotators (who were not the original contributors of the hypothe-
ses). As a result, on average each entailment example (i.e., a pair of conversation
segment and hypothesis) received ﬁve judgements. We removed the entailment
examples that have less than 75% agreement among human annotators, and
divided the remaining data into a development set of 291 examples and a test

set of 584 examples.

. We developed a probabilistic framework that facilitates the solution of both
textual entailment and conversation entailment problems. This framework ﬁrst
represents all forms of language in terms of dependency structures, and then

conducts a two-stage procedure to predict the entailment relation.

In the ﬁrst stage, the nodes in the dependency structure of the hypothesis
side are aligned to the nodes in the dependency structure of the premise side
(i.e., text or conversation segment). In the second stage, the relations in the
dependency structure of the hypothesis are predicted to be entailed or not
entailed. Probabilistic decomposition allows the system to break down the
decision of whether the entire hypothesis is entailed into a series of decisions

that whether each relation in the dependency structure of the hypothesis is

5

 

entailed.

We developed a baseline approach based on this framework that is driven by

textual entailment, and applied it to solve the conversation entailment problem.

. We identiﬁed unique language behaviors that distinguish conversations from
text, which may have potential influence on the entailment decision. We devel-
oped a representation of conversation structure that augments the dependency
structure representation. This is done by expanding the dependency structure
of conversation segment, incorporating turn-taking, speaker, and dialogue act
information. We show through experiments that this feature is very important
in predicting conversation entailment. Combined with enhanced computational
models (introduced below), the modeling of conversation structure improves the
performance by an absolute difference of 4.8% on the test data. Particularly,
we have found that such modeling is especially important for the inference of

participants’ communicative intents.

. We developed enhanced computational models that integrates shallow semantic
characterization for predicting conversation entailment. String representation
is used to describe the long distance relationship between any two language
constituents in a dependency structure. Such relational features in syntactic
parse structures, which have been used in other language processing tasks such
as semantic role labeling [74], are known as an effective way to model “shallow
semantics” in language. However, their usage in entailment tasks has not yet

been explored.

We demonstrated through our experiments that the enhanced feature is an
important way to characterize the (shallow) semantic relation between two lan-
guage constituents. This feature helps to make the prediction of whether a

certain kind of relation in the hypothesis statement is entailed from the con-

6

 

versation segment. It is especially effective with the modeling of conversation
structure, in which case it improves the system’s prediction accuracy by an

absolute difference of 3.9% on our test data set.

1.2 Outline

The remaining thesis is organized as follows:

0 Chapter 2 gives a brief overview of the recent work related to conversation
entailment. They are from two areas: 1) textual entailment; and 2) automated

processing of conversation scripts.
0 Chapter 3 describes a dependency approach to textual entailment.
0 Chapter 4 gives a preliminary investigation on conversation entailment.

0 Chapter 5 describes our approach to incorporate different conversation features

in conversation entailment, including conversation structure.

0 Chapter 6 describes the enhanced models for conversation entailment, by in—
corporating string features to capture semantic relation between language con-

stituents.

0 Chapter 7 provides discussions based on our experiments on conversation en-

tailment, unveiling the challenges in the conversation entailment problem.

0 Chapter 8 concludes our work and discusses future research directions.

Chapter 2

Related Work

There are two groups of work that are related to conversation entailment: one is in the
area of textual entailment, the other concerns various studies based on conversation

scripts.

2. 1 Textual Entailment

This thesis work is inspired by a large body of recent work on textual entailment
initiated by the PASCAL RTE Challenges [8, 10, 22, 36, 37].

Because complete, accurate, open-domain natural language understanding is be-
yond current capabilities, researchers have attempted to extract the maximum mileage
from limited semantic representations. To address the problem of textual entailment,
this section gives a brief overview of these approaches.

Perhaps the most common representation of textual content is “bag-of—words” or
“bag-of-n-grams” [71]. Based on this representation, simple measures of semantic
overlap has been experimented for textual entailment, such as simple overlap count-
ing on bag-of-words or bag-of-n-grams, or weighting by TF-IDF scores, and so on [48].

These models are similar to those typically used in the area of information retrieval

8

 

(IR). Treating the text as a document and the hypothesis as a query, the strength
of entailment is then assessed by their IR score. However, such models are too im-
poverished to be of much use, because they do not account for syntactic or semantic
information which is essential to determining entailment. For example, the following
text-hypothesis pair can get a high IR score, but the hypothesis is not entailed from

the text:

Text: The National Institute for Psychobiology in Israel was established
in 1979.

Hypothesis: Israel was established in 1979.

Apart from the IR—based approaches, more interesting approaches take into ac-
count the structure information in natural language. Based on different representa-
tions of the language structure, they can be classiﬁed into two major classes: logic-

based approaches and graph-based approaches.

2.1.1 Logic-based Approaches

Since the terms entailment, inference, and equivalence all originated from logic [87],
it is perhaps the most natural idea to target this problem by logic proving. By
converting the natural language sentences into logic representations, one can decide
that the text entails the hypothesis if the hypothesis can be proved from the text.
Logic representations of natural language ranges from traditional ﬁrst-order logic [1,
32] and Discourse Representation Theory [12] to neo-Davidsonian—style quasi-logical
form [65, 76], but they are in essence similar. Take the one used by Rain-a et a1. [76]

for example,

T: Bob purchased an old convertible.
H: Bob bought an old car.

can be represented as

T: (3A, B, C)Bob(A) A convertible(B) A old(B) A purchased(C', A, B)
H: (3X, Y, Z)Bob(X) A car(Y) A old(Y) A bought(Z, X, Y)

With this representation, the hypothesis is inferred from the text if and only if
it can be logically proved from the latter. A strict theorem prover ﬁnds a proof
for the hypothesis given the text using the method of resolution refutation. It adds
the negation of the goal logical formula (i.e., the hypothesis) to a knowledge base
consisting of the given axioms (i.e., the text), and then derives a null clause through
successive resolution steps. This corresponds to justifying (i.e., “proving”) the goal
by deriving a contradiction for its negation. For example, the following clauses are

obtained for the previous example:

(3A, B, C)Bob(A) A c0nvertible(B) A old(B) A purchased(C, A, B)
(VX, Y, Z)-1B0b(X) V -vcar(Y) V -10ld(Y) V -b0ught(Z, X, Y)

However, approaches relying on strict logic proving has limited use in practice
due to two major reasons. First, they require full understanding of the language and
accurate representation of all semantic relations in terms of logic. However, accurate
logic representation for natural language is not currently available, and the state-of-
the-art semantic parsers extract only some of the semantic relations encoded in a
given text. Second, world knowledge is often required in the process of reasoning.
For example, one must either know or assume that “a convertible is a car” in order
to correctly infer the entailment “Bob bought an old car” in the previous example. As
a result, previous approaches relying on mapping to ﬁrst order logic representations
with a general prover without using rich knowledge sources [12] have not borne much

fruit.

10

Because logic entailment is a quite strict standard, logic-based approaches tend
to lead to high precision but low recall [12]. Facing this issue, researchers have
been seeking for various compromises to relax the strictness and increase ﬂexibility.
The abductive reasoning approach [76] relaxes the uniﬁcation of logic terms to an
approximate one, and encode their knowledge about the semantics into a cost function
assessing the plausibility of the approximated uniﬁcations. As the function is trained
on a labeled set of data statistically, this approach is more robust and scalable and

results in higher recall.

To incorporate world knowledge into the logic proving model, some systems em-
ploys hand-crafted semantic axioms to enrich the logic representation of natural lan-
guage before the proving process [65]. This provides an enrichment to the semantic

relations, but it is less scalable to be applicable to large data or broader domains.

MacCartney and Manning [58] introduced natural logic to model containment and
exclusion in the entailment problem. They classiﬁed all entailment relations into seven
mutually exclusive classes: equivalence (couch = sofa); forward entailment (crow
I: bird) and its converse (EurOpean :1 French); negation, or exhaustive exclusion
(human “ nonhuman); alternation, or non-exhaustive exclusion (cat I dog); cover,
or non-exclusive exhaustion (animal v nonhuman); and independence (hungry #
hippo). They then form the entailment of a compound expression as a function of
the entailments of its parts. Semantic functions f () are categorized into different
pro jectivity classes, which describe how the entailment relation between f (z) and
f (y) depends on the entailment relation between a: and y. For example, simple
negation (not) projects =, #, and ‘ without change (not happy = not glad, isn’t
swimming # isn’t hungry, and not human ‘ not nonhuman), and swaps I: and I]
(didn’t kiss :I didn’t touch) and | and v (not French v not German, not more
than 4 I not less than 6). This allows the system to determine the entailment of

a compound expression recursively, by propagating entailments upward through a

11

 

 

 

nn num
I Mitsubishi] I 46 . I

Figure 2.1: An example for dependency graph (from MacCartney et a1. [59])

 

 

 

semantic composition tree according to the projectivity class of each node on the
path to the root. For example, the semantics of Nobody can enter without a shirt
might be represented by the tree (nobody (can ((without (a shirt) ) enter) ) ) Since
shirt I: clothes, so without shirt :1 without clothes, and Nobody can enter without a
shirt I: Nobody can enter without clothes. As we can see, the judgement of entailment
here still follows a rather strict standard. Therefore the system’s performance on the

PASCAL RTE Challenge resulted in relatively high precision but low recall.

2.1.2 Graph-based Approaches

The graph-based approach is to formulate the entailment prediction as a graph match-
ing problem. It represents the text and the hypothesis as semantic graphs derived
from syntactic dependency parses [25, 40]. Figure 2.1 shows an example of the graph
representation for a sentence “Mitsubishi sales rose 4 6 percent”.

Given the graph representations for both the text and the hypothesis, semantic
alignments are performed between the graph representing the hypothesis and a por-
tion of the corresponding graph(s) representing the text. Each possible alignment
of the graphs has an associated score, and the score of the best alignment is used
as an approximation to the strength of the entailment. Figure 2.2 shows an exam-
ple of matching the hypothesis “Bezos established a company” to the text “In 1991,

Amazon.com was founded by Jeﬁr Bezos” and the cost of this match.

12

 

establish
(VBD)

Synonym
Match
gCost: 0.4

       
   
     
 

Exact Hyponym
Match 5 Match

Cost: 0.05
' In(Temporal)

Jeff Bezos Amazoncom
(person) (organization)

(date)

 

Vertex Cost: (0.0 + 0.2 + 0.4)/3 = 0.2
Relation Cost: 0 (Graphs Isomorphic)
Match Cost: 0.55 X 0.2 + 0.45 X 0 = 0.11

 

 

 

Figure 2.2: An example for graph matching (from Haghighi et al. [40])

MacCartney et a1. [59] used a two-stage approach to ﬁrst ﬁnd the alignment be-
tween the two graphs and then make the entailment prediCtion. In the ﬁrst step, the
algorithm searches for a good partial alignment from the typed dependency graph
representing the hypothesis to the one representing the text, which maximizes the
alignment score. In the second step, a classiﬁer was trained to determine the entail-

ment relationship given the complete aligned graph.

MacCartney et al. [60] has taken the alignment step further. Their work aligns
phrases in the sentences rather than nodes in the graph (or tokens in the sentences).
In their notion, “phrase” refers to any contiguous span of tokens, not necessarily cor-
responding to a syntactic parse. The phrase-based alignment is to eliminate the needs
for many-to-many alignments, since they can be reduced to one-to-one alignments on
phrase level. For example, in “In most Paciﬁc countries there are very few women in
parliament.” and “Women are poorly represented in parliament.” they can align very
few and poorly represented as units, without being forced to make a difﬁcult choice

as to which word goes with which word.

13

 

Because ﬁnding the best alignment between two graphs is N P-complete, exact
computation is intractable. Therefore researchers have proposed a variety of approx-
imate search techniques, such as local greedy hill-climbing search [40], or incremental
beam search [59].

Similar to the semantic axioms [65] in logic-based approaches, de Salvo Braz et al.
[25] use “rewriting rules” in the graph-based approach to generate intermediate forms
from the original text, with a good supply of additional linguistic and world knowledge
axioms. The cost of matching the text to the hypothesis is then determined by the
minimal cost among matches from all the intermediate forms to the hypothesis.

Such rewriting rules are also referred to as inference rules [27, 55], entailment
rules [85], or entailment relations [86]. They are acquired from large corpora based on
the Distributional Hypothesis [41]. The Distributional Hypothesis states that phrases
in similar context tend to have similar meanings. For example, if X prevents Y and X
provides protection against Y are repeatedly seen in a large corpora, it can be induced
that prevent implies provide protection against, and thus prevent —> provide protection
against is an inference rule. The current largest collection of such rules is DIRT [55].
These rules were widely applied to solve the textual entailment problem [27].

Besides DIRT and other efforts to acquire binary rules (rules templates with two
variables) [78, 86], recent work [85] has proposed unsupervised learning of unary rules
(e.g., X take a nap —-> X sleep). However, their applications on the textual entailment

task have not yet been explored.

2.1.3 Comparing Logic-based and Graph-based Approaches

Although they use different forms of representations for natural language, logic-based
and graph-based approaches are considered isomorphic by MacCartney et a1. [59].
In a graph representation, the nodes and edges can be seen as the logic terms in

a logic representation. For instance, the graph in Figure 2.1 can be represented in

14

 

neo-Davidsonian quasi-logic form as follows:

rose(e1), nsubj(el, $1), sales(a:1), nn(:c1, 11:2), Mitsubishi(:c2), dobj(e1, 11:3),
percenttzg), names, x4), 46(2.)

In fact, the logic representations are often derived from dependency graphs by a
semantic parser.

The alignment between the hypothesis graph and the text graph can be seen as
resolving logic terms in logic proving. They both consider matching an individual
node or term of the hypothesis with some counter part from the text. And weighting
different semantic features in the procedure of calculating the graph matching (or
entailment) score is similar to the “abductive reasoning” approach [76], where logic

terms are resolved by some score calculated over a set of features.

2.1.4 Performance Analysis

The PASCAL RTE Challenge, which has been held for ﬁve times, provides a bench-
mark for evaluating systems’ performance on judging the entailment. Here we give a
brief overview of the results of the last three, the third [36], fourth [37], and ﬁfth [10]
PASCAL RTE Challenges.

In the RTE-3 task [36], a development set and a test set were provided, each of
which contained 800 text-hypothesis pairs. A system’s performance was evaluate by
its accuracy on the test set, that is, how many entailment relationships (true or false)
were correctly predicted out of the 800 pairs. A natural baseline by random guess
would obtain 50% accuracy.

There were 45 systems who participated in this evaluation. Among them the best
system achieved an accuracy of 80.0%, and the mean and median accuracies were

61.7% and 61.8%, respectively.

15

 

It should be well noted from our previous discussion, that the main architectures
for different systems are more or less the same. So the critical part that makes the
performance difference is how much knowledge is incorporated in the systems. The
participating systems in the PASCAL workshop made wide use of various sources of
public knowledge bases, such as WordNet [30, 64], DIRT [55], FrameNet I7], Verb—
Net [51], and PropBank [50]. But the most successful systems [43] (with the highest
accuracy) have used additional knowledge sources, including Extended WordNet [47],
XWN-KB [88], TARSQI [88], and Cicero/Cicero-Lite [44], most of which were not
publicly available. MacCartney et al. [60] indicated that such systems are “idiosyn-
cratic and poorly-documented”, “often using proprietary data, making comparisons

and further development difficult”.

The fou'rth PASCAL RTE Challenge [37] attracted participation of 45 systems.
Their prediction accuracies range from 49.7% to 74.6%, with an average of 57.9%
and median of 57.0%. The ﬁfth PASCAL RTE Challenge [10] had participation of
54 systems. The prediction accuracies range from 50.0% to 73.5%, with an average
of 61.1% and median of 60.4%. The data collections of these two challenges followed
the same setting as the third challenge. Comparing these three evaluations on textual
entailment, although different data were actually used to evaluate the participating

systems, there are no signiﬁcant variations in their result statistics.

Among the participating systems in the last three PASCAL RTE Challenges,
although some of them have explored very in depth into speciﬁc technical aspects
(e.g. entailment of temporal expressions [90]), the overall framework of methodology
has not evolved much. In other words, they were continuously solving the textual

entailment problem either by logic proving or by graph matching.

Nevertheless, a conversation discourse is very different from a written monologue
discourse. The conversation discourse is shaped by the goals of its participants and

their mutual beliefs. The key distinctive features include turn-taking between par-

16

 

ticipants, grounding between participants, and different linguistic phenomena of ut-
terances (e.g., utterances in a conversation tend to be shorter, with disﬂuency, and
sometimes incomplete or ungrammatical). It is the goal of this thesis to explore how
techniques developed for textual entailment can be extended to address these unique

behaviors in conversation entailment.

2.2 Studies on Conversation Scripts

Recent work has applied different approaches to extract and acquire various kinds
of information from human-human conversation scripts. Related work ranges from
low-level recognition of conversation structure to high-level applications such as iden-
tifying biographical facts, attributes, and social relations, detecting agreements and
disagreements between participants, meeting summarization, and predicting success

in task-oriented dialogues.

2.2.1 Recognition of Conversation Structures

Related work on recognizing conversation structures based on conversation scripts

includes the recognition of dialogue acts and discourse structures.

Dialogue Acts

The ability to model and automatically detect discourse structure is an important
step toward dialogue understanding. Dialogue acts are the ﬁrst level of analysis of
discourse structure. A dialogue act represents the meaning of an utterance at the
level of illocutionary force [5], such as Statement, Question, Backchannel, Agreement,
Disagreement, and Apology. Although speciﬁc applications only require relevant dia-
logue act categories, Allen and Core [3] developed a dialogue act labeling system that

is domain-independent.

17

 

Stolcke et a1. [84] presented a domain independent framework for automated dia-
logue act identiﬁcation, which for the most part treats dialogue act labels as a formal
tag set. The model is based on treating the discourse structure of a conversation as
a hidden Markov model [75]. The HMM states correspond to dialogue acts and the
observations correspond to utterances. The features that are used by Stolcke et a1.
[84] to describe the utterances are mostly based on conversation transcripts, including
transcribed words and recognized words from the speech recognizer. But they use

some of the prosodic features too, such as pitch, duration, energy, etc.

The HMM representation allows efﬁcient dynamic programming algorithms to

compute relevant aspects of the model, such as
o The most probable dialogue act sequence (the Viterbi algorithm).

0 The posterior probability of various dialogue acts for .a given utterance, after

considering all the evidence (the forward-backward algorithm).

The Viterbi algorithm for HMM [89] ﬁnds the globally most probable state se-
quence. When applied to a discourse model, it will therefore ﬁnd precisely the dialogue
act sequence with the highest posterior probability. Such Viterbi decoding is funda-
mentally the sarne as the standard probabilistic approaches to speech recognition [6]
and tagging [19].

While the Viterbi algorithm maximizes the probability of getting the entire dia-
logue act sequence correct, it does not necessarily ﬁnd the dialogue act sequence that
has the most dialogue act labels correct [26]. To maximize the total accuracy of utter-
ance labeling, it is needed to maximize the probability of getting each dialogue label
correct individually, which can be eﬂiciently carried out by the forward-backward
algorithm for HMM [9].

18

 

Opinion Frames

Opinions in conversations are deﬁned [80, 91] in two classes: sentiment includes
positive and negative evaluations, emotions, and judgments; and arguing includes
arguing for or against something, and arguing that something should or should not
be done. Opinions have a polarity that can be positive or negative. The target of
an opinion is the entity or proposition that the Opinion is about. For example (a

conversation about designing a remote control, from Somasundaran et a1. [82]):

C: . . . shapes should be curved, so round shapes. Nothing square-like.

C: . . . So we shouldn’t have too square corners and that kind of thing.
B: Yeah okay. Not the old box look.

In the utterance “shapes should be curved” there is a positive argument with the target
curved, and in the utterance “Not the old boa: look” there is an negative sentiment,
with the target the old boa: look.

It is argued that while recognizing opinions of individual expressions and their
properties is important, discourse interpretation is needed as well [82]. In the above
example, we see from the discourse that curved, round shapes are the preferred types
of design, and square-like, square corners, and the old born look are not.

The discourse level association of opinions are modeled as opinion frames [82].
An opinion frame consists of two opinions that are related by virtue of having related
targets. There are two relations between targets, same and alternative. The same
relation holds between targets that refer to the same entity, property, or proposition.
Here the term “same” covers not only identity, but also part-whole, synonymy, gen-
eralization, specialization, entity-attribute, instantiation, cause-effect, epithets and

implicit background topic. The alternative relation holds between targets that are

19

 

related by virtue of being opposing (mutually exclusive) options in the context of
the discourse. In the above example, there is an alternative relation between tar-
gets curved and square-like, and there are same relations between targets square-like,

square corners, and the old boa: look.

An opinion frame is deﬁned as a structure composed of two opinions and their re-
spective targets connected via their target relations. For each of the two opinion slots,
there are four possible type/ polarity combinations (sentiment / arguing combined with
positive / negative). So combined with two possible target relations (same / alternative),
there are totally 4 x 4 x 2 = 32 different types of opinion frames. In the above exam-
ple, shapes should be curved and Nothing square-like constitutes an opinion frame of

APAN alt (positive arguing and negative arguing with alternative targets).

Somasundaran et al. [82] argued that recognizing opinion frames will provide more
opinion information for NLP applications than recognizing individual opinions alone,
because opinions regarding something not lexically or even anaphorically related can
become relevant. Take the alternative relation for instance, opinions towards one
alternative can imply opinions of opposite polarity toward the competing options. In
the above conversation example, if we consider only the explicitly stated Opinions,
there is only one (positive) opinion about the curved shape. However, the speaker
' expresses several other (negative) opinions about alternative shapes, which reinforce
his positivity toward the curved shape. Thus, by using the frame information, it is

possible to gather more opinions regarding curved shapes for TV remotes.

Further, if there is uncertainty about any one of the components, they believe
opinion frames are an effective representation incorporating discourse information to
make an overall coherent interpretation [46]. In particular, suppose that some aspect
of an individual opinion, such as polarity, is unclear. If the discourse suggests certain
opinion frames, this may in turn resolve the underlying ambiguity. Again in the

above example, the polarity of round shapes may be unclear. However, the polarity

20

 

of curved is clear, and by recognizing there is a same relation between these two
targets, it is possible to resolve the ambiguity in the polarity of round shapes, which
is also positive.

Somasundaran et al. [82] proposed a machine learning approach to detect opinion
frames. This is formulated as a classiﬁcation problem: given two opinion sentences,
determine if they participate in any frame relation. Their experiments assume oracle
opinion and polarity information, and consider frame detection only between sentence
pairs belonging to the same speaker. The data used in their work is the AMI meeting
corpus [16], with annotations [81] for sentiment and arguing opinions (text anchor
and type). A variety of features including content word overlap, focus space overlap,
anaphoric indicator, time difference, adjacency pair, and standard bag of words were
used in their experiment to determine if two opinions are related.

Somasundaran et al. [83] used the opinion frames to improve the polarity classiﬁ-
cation of opinions. In their work they ﬁrst implemented a local classiﬁer to bootstrap
the classiﬁcation process, and then implemented classiﬁers that use discourse infor-
mation (i.e., opinion frames) over the local classiﬁer. They explored two approaches

for implementing the discourse-based classiﬁer:

1. Iterative Collective Classiﬁcation [56, 69]: instances are classiﬁed in two phases,
the bootstrapping phase and the iterative phase. In the bootstrapping phase,
the polarity of each instance is initialized to the most likely value given only the
local classiﬁer and its features. In the iterative phase, discourse relations and
the neighborhood information brought in by these relations are incorporated as

features into a relational classiﬁer.

2. Integer Linear Programming: the prediction of opinion polarity is formulated
as an optimization problem, which maximizes the class distributions predicted

by the local classiﬁer, subject to constraints imposed by discourse relations.

21

 

 

2.2.2 High Level Applications

Recent work has studied multiple types of speciﬁc inference that can be made from
conversation scripts. These include biographic attributes, social networks and bio-
graphical facts, agreements and disagreements, summarization, and success in task-

oriented dialogues.

Latent Biographic Attributes

Biographic attributes of conversation speakers include gender, age, and native/non-
native speaker. Such information is derivable from acoustic properties of the speaker,
including pitch and f0 contours [11]. Recently, however, Garera and Yarowsky [35]
worked on modeling and classifying such speaker attributes from only the latent in-
formation found in conversation scripts. In particular, they modeled and classiﬁed
biographic attributes such as gender and age based on lexical and discourse factors
including lexical choice, mean utterance length, patterns of participation in the con-
versation and filler word usage.

Garera and Yarowsky [35] built their work upon the previous state-of-the—art [13],
which models gender of speakers using unigram and bigram features in an SVM
framework. For each conversation participant, they created a training example using
unigram and bigram features with tf-idf weighting, as done in standard text classiﬁ-
cation approaches. Then an SVM model was trained to learn the weights associated
with the n-gram features. They found some of the gender-correlated words proposed
by sociolinguistics are also assigned with more discriminative weights by this empirical
model, such as the frequent use of “oh” by females. They evaluated the performance
of their approach on the Fisher telephone conversation corpus [23] and the standard
Switchboard conversational corpus [38].

Garera and Yarowsky [35] further argued that a speaker’s lexical choice and dis—

course style may differ substantially depending on the gender, age, and dialect of the

22

 

 

other person in the conversation. The hypothesis is that people tend to use stronger
gender-speciﬁc, age-speciﬁc or dialect-speciﬁc word, phrase and discourse properties
when speaking with someone of a similar gender, age, or dialect, compared to speak-
ing with someone of a different gender, age, or dialect. In the latter case, they may
adapt a more neutral speaking style. So Garera and Yarowsky [35] proposed to add
performance gains in gender classiﬁcation by using a stacked model conditioning on
the predicted partner class. They trained several classiﬁers identifying the gender of
each speaker, the gender combination of the entire conversation, and the conditional
gender prediction of each speaker given the most likely gender of the other speaker.
They then used the score of each classiﬁer as a feature in a meta SVM classiﬁer.

There has also been substantial work in the sociolinguistics literature investigating
discourse style differences due to speaker properties such as gender [20, 29]. Those
works have shown gender differences for speakers due to features such as speaking
rate, pronoun usage and ﬁller word usage, suggesting that non-lexical features can
further help improve the performance of gender classiﬁcation on top of the standard
n-gram model. Garera and Yarowsky [35] investigated a set of features such as speaker
rate and percentage of pronoun usage, motivated by the sociolinguistic literature on
gender differences in discourse [57].

Garera and Yarowsky [35] also extended their approach on gender classiﬁcation
to the prediction of speakers’ age and native/non-native speaker. Again they had
ﬁndings consistent with the sociolinguistic studies for age [57], such as frequent usage

of the word “well” among older speakers.

Social Networks and Biographical Facts

Jing et al. [49] gave a framework to extract social networks and biographical facts
from conversation speech transcripts. Entities, relations, and events are extracted

separately from the conversation scripts by different information extraction modules,

23

 

 

and a fusion module is then used to merge their outputs and extract social networks
and biographical facts.

Identiﬁed person entities and extracted relations are fused as nodes and ties in a
social network. For example, from the input sentence my mother is a cook, a relation
detection system identiﬁes the relation motherO f (mother, my). And if an entity
recognition module identiﬁes that my refers to the person Josh and mother refers to
the person Rosa, then by replacing my and mother with the corresponding named
entities, the fusion module produces the following nodes and ties in a social network:
motherO f (Rosa, Josh).

As can be seen from this example, coreference resolution plays a critical role in
the extraction of social networks. As a result, Jing et al. [49] paid a major effort on
improving coreference resolution for conversations, by both feature engineering and
improving the clustering algorithm.

Biographical facts are extracted in a similar way by selecting the events (extracted
by the event extraction module) and corresponding relations (extracted by the relation

extraction module) that involve a given individual as an argument.

Agreements and Disagreements

Conversations involve many agreements and disagreements of one speaker to another.
Galley et al. [34] focused on the identiﬁcation of agreements and disagreements on the
utterance level, and formulated the problem as a multi-class classiﬁcation problem:
given an utterance from a speaker, the task is to classify whether it is an agreement,
a disagreement, or none of these two. They suggested to use a sequence classiﬁcation
model to approach this task, with a set of local and contextual features characterizing
the occurrence of agreements and disagreements.

The local features include lexical features such as agreement markers [21], e.g. yes

and right, general one phrases [45], e.g. but and alright, and adjectives with positive

24

 

 

or negative polarity [42]. A set of durational features are also incorporated and
described as good predictors of agreements: utterance length distinguishes agreement
from disagreement. The latter tends to be longer since the speaker elaborates more on
the reasons and circumstances of her disagreement than for an agreement [21]. And
a fair amount of silence and ﬁlled pauses is sometimes an indicator of disagreement,
since it is a dispreferred response in most social contexts and can be associated with
hesitation [73].

Galley et al. [34] also noted that context provides important information to the
classiﬁcation of agreements and disagreements. For example, whether an utterance is
an agreement or a disagreement is largely inﬂuenced by whether the previous utter-
ance from the same speaker is an agreement or a disagreement, i.e. an agreement is
more likely to be followed by another agreement, and vice versa. There are also reﬂex-
ive and transitive contexts that may be indicative. Reﬂexivity means if A disagrees
with B, then B is also likely to disagree with A. 'Itansitivity means, for example, if
A agrees with B and B disagrees with C’, then A may also disagrees with C, and so
forth.

In order to capture both the local and the contextual features to classify the
agreements and disagreements, Galley et al. [34] used a Bayesian network to perform
the classiﬁcation. The most probable agreement / disagreement sequence is computed

by performing a sequential decoding with beam search.

Meeting Summarization

Automatic summarization helps the processing of information contained in conversa-
tion scripts. Murray and Carenini [66] took an extractive approach to conversation
summarization. They conducted a binary classiﬁcation on sentences in a conversa-
tion, identifying whether each sentence should be extracted as the summary. Sen-

tences were ranked by their classiﬁcation scores, and a top portion of sentences were

25

 

kept as the conversation summary until they reach a certain threshold of word count.

To locate the most salient sentences in a conversation, Murray and Carenini [66]
derived various features to train their classiﬁer, which include sentence lengths that
were previously found to be effective in speech and text summarization [33, 62, 67],
structural features capturing the relation between a sentence and the conversation,
features related to conversation participants, a lexical feature capturing varying inter-
ests and expertise between the conversation participants, a lexical feature capturing
topic shifts in a conversation, cosine features capturing whether the conversation is
changed by a sentence in some fashion, centroid features capturing the similarity
between a sentence and the conversation, word entropy features measuring how infor-
mative a sentence is, and whether a sentence is a turning point in the conversation,

and the ClueWordScore used by Carenini et al. [15].

Murray and Carenini [66] used a simple feature subset selection based on the
F statistics [18], and applied their extractive summarization system to a portion of
the AMI corpus [16]. They found that the best features for summarization are sen-
tence length, sum of term scores (described above), and the centroid features that
measure whether the candidate sentence is similar to the conversation. Their evalu-
ation results show that such a summarization system, which relies solely on features
extracted from conversation scripts, achieved a competitive performance compared
to the state-of-the-art summarization systems that also employ speech-speciﬁc (e.g.
prosodic) features. Therefore, the same summarization system is also applicable to

other domains similar to spoken conversations, such as email threads.

Predicting Success in Task-oriented Dialogues

In task-oriented dialogues, an important indicator of the communication effectiveness

is whether the task is accomplished successfully.

26

 

Pickering and Garrod [7 2] suggested in their Interactive Alignment Model that
dialogues between humans are greatly aided by aligning representations on several
linguistic and conceptual levels. This effect is assumed to be driven by a cascade of
linguistic repetition effects, where interlocutors tend to re-use lexical, syntactic and
other linguistic structures after their introduction. Reitter and Moore [77] referred
to this repetition effect, or a tendency to repeat linguistic decisions, as priming. Mo-
tivated the hypothesis of Pickering and Garrod [72], Reitter and Moore [77] deduced
that “the connection between linguistic persistence or priming effects and the success
of dialogue is crucia ” for the Interactive Alignment Model. Based on this assumption,

Reitter and Moore [77] proposed an automatic method of measuring task success.

Reitter and Moore [77] tried to predict task success from a dialogue using lexical
and syntactic repetition information. They used the HCRC Map Task corpus [4],
where subjects were given two slightly different maps and one of them gives directions
of a predeﬁned route to another. The task success is then determined by the deviation
between the route given by the leader and the route followed by the follower, which is
measured by the area covered in between the two paths (PATHDEV). They trained an
SVM regression model, using features of lexical, syntactic, and string repetitions and
the PATHDEV score as output. Their results show that “linguistic repetition serves

as a good predictor of how well interlocutors will complete their joint task” [77].

Reitter and Moore [77] further compared the indications of short-term priming
and long-term priming (alternatively called adaptation). It was argued that short-
and long-term adaptation effects may be due to separate cognitive processes [31],
so they wanted to ﬁnd out whether alignment in dialogues is due to the automatic,
classical priming effect, or whether it is based on a long-term effect that is possibly
closer to implicit learning [17].

Through similar experiments using PATHDEV as a measurement of task success,

Reitter and Moore [77] found that path deviation and short-term priming did not

27

correlate. Despite the fact that priming effect is clear in the short term, “the size
of this priming effect does not correlate with task success” [77]. In contrast, there
is a reliable correlation of task success and long-term adaptation. Stronger path
deviations relate to weaker adaptation. The more adaptation were observed, the
better performance were achieved by the subjects in synchronizing their routes on the
maps. This conﬁrms their assumption derived from Interactive Alignment Model.
In conclusion, the correlation shows that, of the repetition effects included in the
task-success prediction model, it is long-term adaptation as opposed to the more
automatic short-term priming effect that contributes to prediction accuracy. “Long-
term adaptation may thus be a strategy that aids dialogue partners in aligning their

language and their situation models.” [77]

28

 

Chapter 3

A Dependency Approach to

Textual Entailment

Conversations are not completely different from text. After all, a conversation is
made up by similar linguistic components, from words, sentences, to discourse. The
ﬁrst question is, to what degree that methods for textual entailment can be used to
infer knowledge from conversations.

In this chapter, we describe a dependency-based approach for textual entailment,

which provides a reasonable baseline for our investigation on conversation entailment.

3.1' A Framework of the Dependency Approach

As introduced in Chapter 1, a deﬁnition of the textual entailment problem is given
by the PASCAL RTE Challenge [8, 10, 22, 36, 37]: given a piece of text T and a
hypothesis H, the goal is to determine whether the meaning of H can be sufficiently

inferred from T.

Formally, we use the sign l= to denote the entailment relationship. We represent

29

 

that T entails H as
T l: H

Similarly, if T does not entail H, we represent it as

Tl'fH

Given such context, we will use the phrase premise discourse to refer to the
text from which the meaning is to be inferred (in conversation entailment it is a
conversation segment), and use the letter D to denote it. And for the hypothesis, it
is usually a single statement (e.g., in the PASCAL RTE data set and our data set).
We call it the hypothesis statement, and use the letter S to denote it. Thus a

generic form of the textual or conversation entailment problem is stated below:

Given a premise discourse D and a hypothesis statement 8', estimate the
probability
P(D l= SID, S)

The probability represents the likelihood of the entailment relationship between D
and S, and we can say that D entails .S' if this likelihood is above a certain threshold
(usually 0.5).

3. 1. 1 Representation

This section discusses how we represent natural language text and statements in our
system.
We ﬁrst introduce several concepts that we are using throughout the presentation

of our framework:

e A term refers to either an entity or an event:

30

 

 

— An entity refers to a person, a place, an organization, or other real world
entities. This follows the same idea as the concept of mention in the
Automatic Content Extraction (ACE) Workshops [28]: a mention is a
reference to a real world entity; it can be named (e.g. John Lennon),

nominal (e.g. mother), or pronominal (e.g. she).

— An event refers to an action, an activity, or other real world events. For
example, from the sentence John married Eva in 1940 we can identify the

event of marriage.
We use lower-case letters to represent terms (e.g., x = John, y = marry, etc.).
o A clause is either a property or a relation:

— A property is a property associated with a term (entity or event). For
example, an entity company can have a property of Russian, and an event
visit can have a property of recently. We use a unary predicate p(a:) to

represent a property, e.g. Russian(company), recently(visit).

— A relation is a relation between two terms (either entities or events). For
example, from the phrase headquarter in Canada we can recognize that
the entities headquarter and Canada have a relation of “is in”. From the
phrase Prime Minister visited Brazil we can recognize that the event visit
and the entity Prime Minister have a relation that Prime Minister “is the
subject of” visit. We use a binary predicate r(2:, y) to represent a relation,

e.g. in(headquarter, Canada), subj (visit, Prime Minister).

Syntactic Decomposition

The clause representation of a natural language sentence is derived from its syntactic
parse tree. The process of converting a parse tree to the clause representation can be

seen as a decomposition of the tree structure.

31

 

 

5 NP , VP
3 VP PP
-/.\~ ............ j /\ .......
: g: VBD ‘gg NP ‘alNg' NP

 

[Bountiful].E {reached San Francisco5 in EAugust 19455

-----------

 

 

x 1 x4 x2 x3
Syntactic substructure Head Derived clause
NP —> Bountiful x1 =Bountiful

NP —> San Francisco 122 =San Francisco
NP —> August 1945 51:3 =August 1945

 

 

 

 

VBD ——i reached 11:4 =reached

PP —> IN NP in $3

VP —» VBD NP 1134 object(:r:4,z2)

VP —> VP PP 2:4 preposition(:r4, in 3:3)
S —» NP VP 2:4 subject(:r4,:c1)

 

Figure 3.1: An example of syntactic decomposition

Decomposing a syntactic parse tree into a set of clauses is based on dependency
parsing [24], where a set of hand-crafted rules, or patterns, are applied on the phrase
structures. Appendix A lists the set of rules that we developed to derive the depen-

dency structures.

Figure 3.1 illustrates the decomposition process for the statement Bountiful reached
San Francisco in August 1945. For each phrase structure in the parse tree (e.g.,
S —> NP VP), an associated decomposition rule is used to specify two types of infor-
mation: ( 1) the head term of the parent node (e.g., S), which is obtained from one
of its children (in this case the head of S is get from the head of VP); (2) the clauses
that are to be generated, e.g., for S —+ NP VP we generate subject(h2,h1), where

32

 

hl is the head term of the ﬁrst child (NP), and IQ is the head term of the second
child (VP). The head terms of NP and VP are obtained recursively by decomposi-
tion rules deﬁned upon the substructures spanning them (e.g., NP -) Bountiful and
VP -—> VP PP, respectively).

We have also taken care of the following processes in our decomposing rules similar

to those in dependency parsing [24]:

o Collapsing a prepositional relation preposition(:t:, prep y) into a relational clause
between :1: and y described by prep. For example, in Figure 3.1, the clause

preposition(:tt4, in 1133) are collapsed into in(a:4, £133).

e Processing conjunct dependencies to produce a representation closer to the
semantics. For example, for “bills on ports and immigration” we produce
0n(bills,p0rts) and on(bills,immigrati0n) (as opposed to 0n(bills,ports) and
and(p0rts, immigration)). This is implemented by the multi-head mechanism

encoded in our decomposition rules.

0 Adding arguments for relative clauses. e.g. For I like the man who tells jokes

we have subject(tells, man).

After the syntactic decomposition, both the premise discourse and the hypothesis
statement are represented as sets of terms (e. g., r1 = Bountiful, x4 = reached, etc.)
and clauses (e.g., object(:r4, 1:2), in(a:4,:1:3), etc). Figure 3.2(a) shows an example of a
premise and the corresponding hypothesis. Figure 3.2(e) shows the decomposed terms
and clauses for the premise and Figure 3.2(e) shows the decomposed representation
for the hypothesis.

We use the term “clause” here because logically, a statement is the conjunction of a
set of clauses. Similarly a natural language statement can be viewed as a conjunction

of clauses deﬁned above.

33

 

Premise: Bountiful arrived after war’s end, sailing into San Francisco Bay
21 August 1945.
Hypothesis: Bountiful reached San Francisco in August 1945.

(a) The text premise and hypothesis statement

Bountiful arrived aﬁer m’s gm, sailing into San Francisco Ba 21 August 1945.

 

 

 

 

 

(b) Dependency structure for the premise

 

 

Terms Clauses
vi = Bountiful, 312 = war. :13 = end, modifiedya .212), midi/6. 214).
y4 = San Francisco Bay, adverbial(y5, y5), subject(y7,y1),

y5 = 21 August 1945, y6 = sailing, after(y7,y3), adverbial(y7, y6)
y7 = arrived

 

 

 

 

(c) Clause representation for the premise

Bountiful reached _S_an Fraanisco in August 1945.

 

(d) Dependency structure for the hypothesis

 

Terms Clauses

2:1 = Bountiful, subject($4,$1),

:52 = San Francisco, object(SL‘4, 932),

2:3 = August 1945, in(:r4,a:3)

2:4 = reached
(e) Clause representation for the hypothesis

 

 

 

 

 

Figure 3.2: The decomposition of a premise-hypothesis pair

34

This representation is similar to the neo—Davidsonian-style quasi-logical form [65,
76]. And we also follow its idea of reifying the verb terms. Alternatively, a represen-
tation without reiﬁcation would put the sentence “Bountiful reached San Francisco”
as reach(Bounti f ul , San Francisco), but in this way the modiﬁer “in August 1945”

will have no place unless higher-order logic is introduced.

This representation is also similar to a typed dependency structure, if we view
terms as nodes, prOperty clauses as node properties, and relation clauses as depen—
dency edges. The only difference between our representation and a dependency struc-
ture is that we only take nouns and verbs as terms (or nodes), and put other words
like adjectives and adverbs as properties, e.g. instead of mod(visit, recently) we have
recently(visit). Figure 3.2(b) and 3.2(d) show the dependency structures of both
the premise and the hypothesis (corresponding to the clause representations in Fig-

ure 3.2(e) and 3.2(e), respectively).

3.1.2 The Alignment Model

As both the premise and the hypothesis are represented as terms and clauses:

D: {y1,...,yb,d1(...),...,dm(...)}

S:{I1,...,:L‘a,81(...),...,8n(...)}

where 3:1, . . . ,xa are the terms in the hypothesis, y1,. . . ,yb are the terms in the
premise, 31,. . . , sn are the clauses in the hypothesis, and d1, . . . ,dm are the clauses
in the premise, in order to predict whether the hypothesis can be inferred from the
premise, we need ﬁrst to ﬁnd an association between the terms in the premise and
ther terms in the hypothesis. For example, in Figure 3.2, we need to know that the

term 3:1 in the hypothesis (Bountiful) refers to the same entity as term yl in the

35

Premise Hypothesis

 

 

 

 

 

 

 

 

 

 

 

y 1 =Bountiful x1=Bountiful
y2=war x2=San Francisco
Y3=end x3=August 1945
y4=San Francisco Bay x 4=reached

 

 

y5=21 August 1945

 

y6=sailing

 

y7=arrived

 

 

 

 

I9: {($ltyll3 ($23y4la ($3ty5l3 (1:433/6)? ($4vy7llI

Figure 3.3: An alignment for the example in Figure 3.2

premise (Bountiful), and the term .734 in the hypothesis refers to an event (reached)

that may be the same as what y7 refers to in the premise (arrived).

Formally, we define an alignment 9 to be a binary relation, i.e., a subset of the
Cartesian product, between the hypothesis term set {1:1, . . . ,xa} and the premise
term set {y1, . . . ,yb}. A term pair (:13, y) is considered to be aligned, i.e., (3:, y) E 9, if
and only if they refer to the same entity or event. Figure 3.3 shows such an alignment

for the example in Figure 3.2.

Alternatively, an alignment 9 can be considered as a binary function deﬁned over

a hypothesis term a: and a premise term y:

g:{$1,...,:1:a}><{y1,,yb}—+{0,l}

1 if :c and y are aligned
9(1', 31) =
0 otherwise

Straightforwardly, the function notation of alignment is equivalent to the relation

notation:
902.31) = 12 (23.31) 6 g

36

 

In this thesis we will use these two notations interchangeably.

Note that an alignment can be between an entity (noun) and an event (verb), e.g.
g(sale, sell) = 1, or vice versa. It is also possible that one hypothesis term is aligned
to multiple premise terms, e.g., (1:4, y6) and (3:4, y7) in Figure 3.3, or vice versa.

An alignment model 9 A gives such an alignment for any premise-hypothesis
pair:

SAID,S—>g

3.1.3 The Inference Model

We have formulated the problem of predicting whether a hypothesis S can be inferred

from a premise D as estimating the probability

P(D t: SID, S)
Suppose we have decomposed the premise D into m clauses d1, d2, . . . ,dm and the
hypothesis S into n clauses 31,32, . . . , .9", the probability to be estimated becomes

P(D l: SID,S) = P(D l: SID = dldg . ..dm,S = 31.92 . ..sn)

: P(d1d2 . ..dyn I: 8182 ....5nId1,d2, . . ,d7n,31,82,. ..,Sn)

Since a statement is the conjunction of the decomposed clauses, whether it can
be inferred from a premise is equivalent to whether all of it clauses are inferred from

the premise:

P(D l: 3132 . ..snID,sl,52,...,sn) = P(D l: sl,D l= 32,...,D l: snID,sl,32,...,sn)

And to simplify the problem, we make the assumption that whether a clause is

37

 

entailed from the premise is conditionally independent from other clauses. So

n
P(D P 31,Dl= 32,...,D P snID,sl,82,...,sn) = H P(D P stD,s]-)
i=1

And the probability to be estimated is given by the following formula

n
P(D P SID,S) = H P(D P .9le = dldg . ..dm,sj)
i=1
n
= P(dldg...dm I: 8de1,d2,...,dm,S]-) (3.1)
i=1

An inference model 0 E gives such probabilities, that whether a clause from
the hypothesis is entailed by a set of clauses from the premise, given an alignment 9

between the terms in the hypothesis and the terms in the premise, i.e.
6E I d1,d2,...,dm,8j,g —> P(dldg...dm I: Sj)

And from Equation (3.1) we know that given a premise-hypothesis pair and an in-
stance of alignment, the inference model also gives the probability that the hypothesis

is inferred from the premise:

9E = D.S,g—> P(DE 3)

3.2 Learning the Entailment Models

With the dependency-based framework consisting of two-stage models, the alignment

model and the inference model, next we describe how we build these models.

In the PASCAL RTE data sets [8, 10, 22, 36, 37], for every entailment example

we have the truth judgement of whether the hypothesis can be inferred from the

38

premise given by human annotators. Furthermore, work has been done on manually
annotating the word-level alignments for the RTE-2 data set [14]. Therefore, it is
natural to adopt the machine learning methodology and learn our entailment models
from those annotated data. Particularly, we train both the alignment and inference

models using a machine-learning framework.

3.2.1 Learning the Alignment Model

Recall from Section 3.1.2 that an alignment model gives the alignment for a premise-
hypothesis pair (D, S):
9 A : D, S —> 9

That is, for each term in the hypothesis :1: and each term, in the premise y, it gives

(array—ides)

This is a binary classiﬁcation problem: given a term pair (x, y), we want to make

the binary decision of the value of g(r, y) (0 or 1).

We prepose to use a feature vector f A(:t:, y) to characterize the lexical, structural,
and semantic features of the terms a: and y, and use a binary classiﬁcation model to
estimate their alignment score, g(r, y). We can use the notation 0 A to refer to this

classiﬁcation model:

9A : @103. y) —> 903.11)

To train such a classiﬁcation model, we consider a training set with a gold-standard
alignment 9* for each entailment pair (D, S). Given such a training set T, we can

learn an alignment model by maximizing the log—likelihood of the aligned term pairs

39

(positive training instances):

2 Z 10sP(9(:r,y) = llD.S.0A)
(D.S.g*)ET (r,y)€g*
and minimizing the log-likelihood of unaligned term pairs (negative training in-
stances):

: 2 log P(g(:r,y) = 1|1),S,6’A)

(D.SIQ*)ET (xii/Ma“
Thus the learned model 9 A maximizes the log-likelihood of predicting the gold-

standard alignments:

2 log P(g = g*ID,s,0A)
(D,S,g*)€T

3.2.2 Learning the Inference Model

Recall from Section 3.1.3 that an inference model gives the probability that a clause
from the hypothesis, 3 -, is entailed by a set of clauses from the premise, d1, d2, . . . ,dm,
given an alignment g between the terms in the hypothesis and the terms in the
premise:

6E I d1,d2,-.-,d1na3jI9 —* P(‘11(12°"d'm ’: SJ)

As in the alignment case, here we also formulate the inference prediction as a
binary classiﬁcation problem: we ﬁrst use a feature vector f E(d1, d2, . . . ,dm, sj, g) to
characterize the lexical, structural, and semantic features of the clauses d1, d2, . . . ,dm, 3 j
given the alignment g, and then build a classiﬁcation model 6 E to estimate the prob-

ability P(d1d2 . . . dm l: 3]) given such a feature vector:

0E I fE(d1,d2,. . . ,d7n,8j,g) —* P(dld2 . . .dyn l: Sj)

40

Again we use the same notation 6 E for the classiﬁcation model here because of its

equivalence to the original inference model.

Now we want to train such a model 6E from a data set of positive entailment
examples, T+ = {(D, S )+}, where the premises entail the corresponding hypotheses,
and a data set of negative entailment examples, T“ = {(D, S )7 }, where the premises
do not entail the corresponding hypotheses. We follow the assumption that for each
entailment example (D, S), we have a gold-standard alignment g*. Additionally, we
also assume that for each of the hypothesis clauses s -, we have the ground truth that

whether it is entailed from the premise D, given the gold-standard alignment 9*.

We use S+(D, S, g*) to denote the set of clauses in S that are entailed from D
given g* (positive training instances), and S _(D, S, 9*) to denote the set of clauses
in S that are not entailed from D given 9* (negative training instances). Then an

inference model can be learned to maximize the log-likelihood:

Z Z logP(Dl=sJ-|D,sj,g*,9E)+
(D,S,g*)€T+ SjES

Z 2 log P(D l= stD,sj,g*,6lE)+
(D,S.g*)ET‘ SjES+(D,S.9*)
Z 2 log P(Df'f stD,sj,g*,0E)

(D,S.Q*)ET— 83'65—(0519’0

Note that for T+, S+(D,S,g*) = S and S_(D,S,g*) = (1') (every clause in the

hypothesis should be entailed from the premise).

As such, a learned model 9 E also maximizes the log-likelihood of giving the right.

entailment judgement for each premise-hypothesis pair:

2 logP(Dl=SID,S,g*,6E)+ Z 1ogP(DP SID,S,g*,9E)
(D,S,g*)ET+ (D,S,g*)ET_

41

3.3 Feature Design

Section 3.2 gave the framework of learning the alignment and inference models from
annotated data set. This section discusses the indicative features that are used in

learning these models.

3.3.1 Features for the Alignment Model

As introduced in Section 3.2.1, a feature vector for the alignment model f A(:z:, y) is
deﬁned over a term a: from the hypothesis and a term y from the premise.

In theory, a verb term and a noun term can potentially be aligned together.
However, to simplify the problem, here we restrict the problem to the alignment
between two nouns or two verbs. We designed different feature sets according to

whether :1: and y are nouns or verbs.

Features for Noun Term Alignment

If a: and y are noun terms, the feature vector f A(:c, y) is composed by:
1. String equality: whether the string forms of :1: and y are equal.
2. Stemmed equality: whether the stems of :r and y are equal.

3. Acronym equality: whether one term is the acronym of the other, e. g., Michigan

State University and MSU.

4. Named entity equality: whether two names refer to the same entity, e.g., Pres-
ident Obama and Barack Obama are the same person. Our simple approach to
estimate the equivalence of two named entities is by comparing the right-most

terms in the two names (e.g., Obama in the above example).

42

5. WordNet similarity [54]: a similarity measurement of the two terms based on

the WordNet taxonomy:

s'm (.2: ) 2 x leg P(ngy)
,. ,, . 2 .
W ’3’ leg P(Cx) + log P(Cy)

 

where Cg: is the WordNet class containing :13, Cy is the VVerdNe't class containing
y, Cmy is the most speciﬁc class that subsumes both Cm and Cy, and P(C) is

the probability that a randomly selected object belongs to C.

6. Distributional similarity: a similarity measurement of the two terms based on

the Dice coefﬁcient of their distributions in a large text corpus:

2|Dany|

St’lTlD(JI,y) = m

where D15 is the set of documents that contain the term :10, and Dy is the set of
documents that contain the term y. We use the AQUAINT [39] news corpus as

the document collection here.

Features for Verb Term Alignment

To learn the alignment model for verb terms, we use most of the features that are
similar to these in the noun alignment model, including string equality, stemmed
equality, WordNet similarity, and distributional similarity. However, we also designed
a few more features specialized to verb alignment. One of these features is the verb
be identiﬁcation, which identiﬁes whether any of the two verbs, :1: from the hypothesis

and y from the premise, is any form of the verb be:

1 if both or neither of :1: and y is verb be
f’Ub(xt y) =
0 otherwise

43

Further more, for an action/ event, it is identiﬁed by not only the class or type of
the action/ event, which is described by the verb, but also the executer and receiver of
the action or participaters in the event, which are described by the verb’s arguments.

Here we consider two types of arguments: subject and object.

e Two action/events are not the same if their subjects (when present) are dif-

ferent, e.g., A laughed and B laughed;

0 Two actien/ events are not the same if their objects (when present) are differ-

ent, e.g., A watched TV and A watched a football game.

Note that action/ events could be identiﬁed by other arguments or adjuncts too. For
example, temporal phrases as in A went to New York in 1.970 and A went to New York
last week. Here, we take a consistent approach that only identiﬁes the action /evcnts
by the verbs along with their subject / ob jects, and leaving the identiﬁcation of other
adjuncts such as temporal phrases to downstream processes.

So we designed two additional features to model the argument consistency of the

verbs :1: and y.

1. Subject consistency: whether the subjects of :1: and y (when present) are con-

sistent;

2. Object consistency: whether the objects of a: and y (when present) are consis-

tent.

Te characterize the consistency of the arguments (subjects and objects) between
a hypothesis verb :1: and a premise verb y, here we developed a simple approach as a
baseline. Take subject consistency for example, we let 83; be the subject term of verb
is in the hypothesis, and let 3;, be the aligned term of 3;; in the premise (if there are
multiple terms that are aligned with 83;, let sy be the one that is closest to y in the

dependency structure of the premise). The subject consistency of the verbs (2:, y) is

44

then measured by the distance between 33/ and y in the dependency structure of the
premise.

The idea here is, if x and y are aligned, then for the subject of :17, 333, it’s aligned
part in the premise (39) should also be the subject of y. The distance between 33/
and y characterizes (primitively) the possibility of 3y being y’s subject.

Similarly, the object consistency of (:13, y) is measured by the distance between the

verb y and the aligned object of :13.

An Example of Feature Estimation for Verb Alignment

Here we demonstrate how we estimate the features for verb alignment, using the
example in Figure 3.2. Particularly, we show what are the feature values to decide
the alignment between the hypothesis term :134 2 reached and the premise term
317 = arrived.

The values of primary features to decide this alignment are:
0 String equality: 0

o Stemmed equality: 0

o WordNet similarity: 0.84

o Distributional similarity: 0.10

o Verb be identiﬁcation: 1

Next we check the subject and object consistencies for the pair of verbs. Here
we illustrate the object consistency as an example. We ﬁrst ﬁnd the object of 2:4
in the hypothesis, 1:2 2 San Francisco. Assuming we have the result from the
noun term alignment model that 2:2 in the hypothesis is aligned to y4 in the premise
(y4 2 San Francisco Bay), we can then get the distance between y4 and y7 in the

dependency structure of the premise (see Figure 3.2), which is 2 (y4 ~ y6 ~ y7).

45

 

As such the argument consistency features for the verb pair (x4, y7) have values

of:

0 Subject consistency: 1 (the distance between y7 and the aligned term of 1:4’3

subject)

0 Object consistency: 2 (the distance between y7 and the aligned term of 334’s

object, as illustrated above)

3.3.2 Features for the Inference Model

In Section 3.2.2 we introduced the inference model, which predicts the probability
that a hypothesis clause sj is entailed from a set of premise clauses d1 . . . dm, given
a feature vector f E describing these clauses with an alignment 9 between the terms

in them:

6E I fE(d13d23' "ad'lnisjvg) —+ P(d1d2 ' '°d1n i: Sj)

We designed different feature sets according to whether 3 j is a property clause or

a relational clause.

Features for Property Inference Mode]

If sj is a property clause, i.e., it takes one argument and can be denoted as sj(:r),
then for it to be inferred, we would like x’s counterparts (i.e., aligned terms) in the
premise to have the same or similar property.

Therefore, we look for all the property clauses in the premise that describe the

counterparts of 17, i.e. a clause set D’ = {di(y)|d,-(y) E D,g(:2:,y) = 1}. For example,

Premise: I’ve just heard some old songs. They’re wonderful!

Hypothesis: I heard good music.

46

Consider the property clause good(:r2) in the hypothesis with the term 3:2 2 music.
Suppose that 3:2 is aligned to two terms in the premise: yg :2 songs and y4 2 they,
then D’ = {some(y2), old(yg), wonderful(y4)}.

We then design a set of features to characterize the similarity between the clause
5 j and the clauses in D'. These features are similar to those used in the alignment

models in Section 3.3.1:
1. String equality: whether any of the clauses in D’ is the same as 83-;
2. Stemmed equality: whether any of the clauses in D’ has the same stem as sj;

3. WordNet similarity: calculate the WordNet similarity (see Section 3.3.1 for

definition between an clause in D, and s.- and ick the maximum one:
y J, ,

4. Distributional similarity: calculate the distributional similarity (see Section 3.3.1

for deﬁnition) between any clause in D’ and sj, and pick the maximum one.

In the above example, one property of y4, wonderful (y4), has a high similarity
to the property of good($2), so we can predict that good(;r2) is entailed from the

premise.

Features for Relational Inference Model

If sj is a relational clause, i.e., it takes two arguments and can be denoted as
sj(;r1, 11:2), then for it to be inferred, we would like the same or similar type of relation
to exist in the premise, between xl’s and 19’s counterparts.

So we look for the sets of terms in the premise that are aligned with 3:1 and 3:2,
respectively:

D'1={y|y E D,g(w1,y)=1}

0’2 = {ny E D,g($2,y)=1}

47

 

Table 3.1: Calculating the features of inference model for the example in Figure 3.2

 

 

 

 

 

 

 

 

 

 

 

 

Hypothesis clause subject(:r4, 9:1) object(:r4, 3:2) in(a:4, 2:3)
Clause type relational relational relational
Terms in this clause 2:4 :13] 1274 1:2 :54 2:3
Aligned terms in the premise {ya 347} {311} {ya in} {314} {ya y7} {y5}
Closest term pair in the premise (y7,y1) (y6, y4) (y6, y5)
Minimal distance fr 1 1 l

 

 

 

We then model the relations between the terms in D’1 and the terms in D5. As a
baseline approach, here we only develop one feature to model these relations. That
is, the closest distance between these two sets of terms in the dependency structure

of the premise:

fT(D,5ja9) = rein , dist/(311,312)
yIED11y2€D2

The idea here is simple: the closer that two terms are in a dependency structure,
the more likely these two terms have a direct relationship. Since these relations are
mostly syntactic relations (e.g., subject, object, etc.), wemade an assumption that
the closest relation found between D’1 and D’2 is the same type as the relation of 5]-

between 1:1 and 2:2.

An Example of Feature Estimation in Inference Model

We use the example in Figure 3.2 to illustrate how features are calculated for the
inference model.
Suppose the alignment for this example is the one shown in Figure 3.3, then the

inference features for each clause in the hypothesis are shown in Table 3.1.

3.4 Post Processing

According to Equation (3.1), when our inference model predicts that each of the

clauses sj in a hypothesis is entailed from the clauses d1...dm in a premise, the

48

. f

whole hypothesis S is determined to be entailed from the premise D. However, this
is not always true due to some of the linguistic phenomena, in particular, polarity
and monotonicity. In our entailment system, we developed a post processing routine

to deal with these issues.

3.4.1 Polarity Check

Consider the following example:

Premise: Around this time, White decided that he would not accept the
$10,000 Britannia Award and another Miles Franklin Award for his work.

Hypothesis: White got the Britannia Award.

The hypothesis contains following terms and clauses:

Terms: 231 = White, r2 = Britannia Award, 333 = get

Clauses: sabject(;r3,:r1), 0bj66l(l‘3,'172)

When alignment between the hypothesis and the premise contains the following

term pairs
(1:1,he), (x2,Britannia Award), (11:3,accept)
all the clauses in the hypothesis can be inferred from the premise:

subject(a:3, $1): 331’s aligned term (he) is the subject of 333’s aligned term
(accept) in the premise.

object(:r3, 2:2): 222’s aligned term (Britannia Award) is the object of 233’s
aligned term (accept) in the premise.

However, in this example the entire hypothesis is clearly not entailed. This is because

in the premise there is a negative adverb not applying on the verb accept.

49

Table 3.2: The list of negative modiﬁers used for polarity check

 

 

 

barely nor
hardly not
little n’t
neither nowhere
never rarely
no scarcely
none seldom

 

 

In order to detect this situation, the ﬁrst post processing after predicting a pos-
itive entailment is to check the polarity of each verb in the hypothesis against its
counterpart (i.e., aligned verbs) in the premise. If the polarities of a pair of aligned
verbs are different, we change the entailment prediction to be false.

The polarity of a verb can be characterized by the number of its negative modiﬁers.

A set of negative modiﬁers that we recognize are listed in Table 3.2.

3.4.2 Monotonicity Check

The monotonicity assumption states that, when a statement is true, adding any
context would not affect the truth of that statement. This assumption, which may
be true in the most studied formal logic, is however not the case in natural language.

For example:

Premise: He said that “there is evidence that Cristiani was involved in
the murder of the six Jesuit priests” which occurred on 16 November in
San Salvador.

Hypothesis: Cristiani killed six Jesuits.

The hypothesis Cristiani killed six Jesuits can be sufﬁciently inferred from the state-
ment Cristiani was involved in the murder of the six Jesuit priests. However, this
example is a false entailment because the entailing statement is in a context of he

said that “. .. ”.

50

So after our system makes an positive entailment prediction, we also check against
the monotonicity assumption. Our approach is to search the context of the entailing
statement, namely, all the upward nodes from the head of that statement in the parse
tree. If any of these nodes contain a non-monotonic context, the entailment prediction
is changed to false.

From training data (see Section 3.5) we identiﬁed the types of words that signal

non-monotonic text. They usually contain one of the following meanings:

1. Indicating a statement is someone’s claim or declaration, e.g., say;

2. Indicating a statement is someone’s vision or imagination, e.g., think;

3. Indicating a statement is someone’s intended outcome, e.g., suggest;

4. Indicating a statement is questioned, e.g., ask;

5. Indicating a statement is hypothesized, e.g., suppose;

6. Indicating something is desired but may not actually happened, e.g., prefer;
7. Indicating something is permitted but may not actually done, e.g., allow;

8. Indicating something is weakly perceived but not attested or conﬁrmed, e.g.,
hear (note that words expressing strong perception are considered to indicate

true entailments, e.g., witness);
9. Indicating something is planned or happens in the future, e.g., decide;
10. Indicating something happened in the past, e.g., use to;

11. Indicating something is fake, e.g., pretend.

We further expanded this set of non-monotonic contexts by adding the synonyms

of the recognized words. The expanded set of non-monotonic contexts are listed in

Table 3.3.

51

Table 3.3: The list of non-monotonic contexts

(A extensive set also includes their derivative forms, e.g., thought)

 

 

advertise
advise
aim
allege
allow
announce
anticipate
argue
arrange
articulate
ask

assert
assume
attempt
authorize
beg
believe
call for
can
choose
claim
conceive
conjecture
consider
dare
decide
declare
deem
demand
deserve
desire
determine
discuss
divine

 

dream
elect
enable
encourage
enjoy
enunciate
envisage
expect
express
fancy
feel
forecast
foresee
foretell
formulate
going to
guess
have to
hear
hope
hypothesize
imagine
inquire
insist
intend

let

like

likely
look forward to
love

may
maybe
mean
might

must
need
negotiate
obligate
offer

opt

order
oughtto
perhaps
permit
phrase
picture
plan
plead
pose
possible
postulate
potential
predict
prefer
premise
prepare
presume
presuppose
pretend
probable
proffer
project
promise
pronounce
propose
propound
put
question

 

 

reckon
recommend
report
request
require
say

seek
select
shall
solicit
speculate
state
suggest
suppose
surmise
suspect
swear
tell
tend
think
try

urge
use to
vision
visualize
vote
vow
want
will
wish
wonder
write

 

52

 

3.5 Experimental Results

We choose the textual entailment data from PASCAL-3 RTE Challenge [36] for our
experiments. There are 800 entailment examples for the development set and 800
entailment examples for the test set. In order to train our entailment models, we ﬁrst
decomposed the premises and hypotheses in the development set into sets of terms
and clauses, and then manually annotated the data set for ground-truth judgements

described in Section 3.2, which are

0 For each term in the hypothesis 2: and each term in the premise y, we annotated

the label of g(r, y) (whether :1: and y are aligned)1;

o For each clause 3 j in the hypothesis, we annotated whether it is entailed from

the clauses d1 . . . dm in the premise (truth value for d1 . . .dm l: 3]).

We then evaluated the results for both the alignment decision and the entailment

prediction.

3.5.1 Alignment Results

We trained two logistic regression models from the annotated development data, one
for noun term alignment and one for verb term alignment. Since we have only the
gold-standard alignments for the development data (not for the test data), we evaluate
their performances by cross-validation on the development data.

The evaluation is based on pairwise judgements: for a term pair (3:, y), where a:
is from a hypothesis and y is from a premise, whether the model correctly predicts
the value of their alignment function g(zr,y) (0 or 1). Since the class distribution

of alignment judgement is extremely unbalanced (among all possible pairings of two

 

1The RTE—2 data set has word-level alignment annotation available [14], which is also an option
to derive the ground truth for term-level alignments.

53

1 ~ + Precision
O 9 _ -+- Recall
' + F—measure

I

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 r

00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

j

1

I

r

T

r

 

 

Threshold

Figure 3.4: Evaluation results of verb alignment for textual entailment

terms, only a small portion of them are aligned pairs), we evaluate the alignment

results by precision and recall of positive alignments.

The alignment for noun terms achieved 96.4% precision and 94.9% recall. This
performance is relatively satisfying. We consider it sufﬁcient for downstream pro-

CBSSBS.

The alignment for verb terms, however, performs signiﬁcantly lower. The standard
logistic regression model gives 52.4% precision and 16.4% recall. Since the recall
performance is especially low, and it is actually more important to the downstream
process (i.e., the inference model), efforts were made to balance the precision and
recall. Our mechanism was to adjust the output threshold of the logistic regression
model: the lower the threshold is, the model predicts more positive results (i.e.,
aligned term pairs), giving lower precision and higher recall; while the higher the
threshold is, the model predicts more negative results (i.e., unaligned term pairs),
giving higher precision but lower recall. We experimented with different thresholds

from 0.1 to 0.9, and the results are shown in Figure 3.4.

54

We can see that the combined performance of precision and recall (i.e., the f-
measure) achieved maximum when the threshold is set to 0.3. Under this setting the
verb alignment model has a performance of 48.9% precision, 32.8% recall, and 39.3%

f-measure.

3.5.2 Entailment Results

We then trained two logistic regression models for the inference model, namely, the
property inference model and the relational inference model. The models were trained
on the annotated examples of the development set, and applied to the test set. We
evaluated the results predicted by these models. Among the 800 test examples, the
entailment predictions made by our models achieved an accuracy of 60.6%. Com-
paring this result to the median performance of the participating systems in the
PASCAL-3 RTE Challenge [36] (61.8%), the difference is not statistically signiﬁcant
(z-test, p = 32%).

As discussed in Section 2.1.4, the key issue that distinguishes the performance of
different systems is the amount of knowledge they use. In our implementation, we
used knowledge sources and language tools no more than those publicly available,
such as the Stanford parser [52], OpenN LP tools2, WordNet [64], and the AQUAINT
Corpus of English News Text [39]. Therefore, the fact that the performance of our im-
plementation is on par with the median performance in RTE-3 provides a reasonable

baseline to process conversation entailment.

 

2http://opennlp.sourceforge.net/

55

 

Chapter 4

An Initial Investigation on

Conversation Entailment

As an initial investigation, we follow the PASCAL practice and created a database
of examples on conversation entailment. We tested the dependency-based approach
on the collected data. In this chapter, we describe our data collection and annota-
tion procedure, analyze the collected data, and report the results from our initial

investigation.

4. 1 Problem Formulation

Following the PASCAL practice [8, 10, 22, 36, 37], here we consider the conversation
entailment problem as inferring a single natural language statement, or a declarative
sentence, from a conversation.

Similar to the formulation in Section 3.1, we use 5' to represent the statement
which is the hypothesis in question, and use D to represent the premise from which
the hypothesis is to be inferred. In this case the premise D is a conversation segment.

We say that D entails S' if and only if the meaning of S can be sufﬁcient inferred

56

 

from the premise D, and write it as
D l: S
Similarly, if D does not entail S, we say

Dl'fS

Also similar to the case of textual entailment in PASCAL, the deﬁnition here is
not strict. Rather, it is based on an agreement of most intelligent human readers,
given the general background knowledge. That means, the standard is not whether
the hypothesis is logically entailed from the premise, but whether it can be reasonably
inferred by human readers.

Table 4.1 gives a few examples of premise-hypothesis pairs, and whether each
hypothesis is entailed by the corresponding premise. These examples show that con-
versations are different from written text. Utterances in a conversation tend to be
shorter, with disﬂuency, and sometimes incomplete or ungrammatical. These exam-
ples also show the importance to model the conversation context. One utterance could
span several turns (e.g., utterance of B in Example 1). The pronouns are frequently

used and may require special treatment (e.g., you in Example 2).

4.2 Types of Inference from Conversations

In the text entailment exercise, almost all hypotheses are about facts that can be
inferred from the text segment. This is partly due to the fact that the newswire
articles mainly report signiﬁcant events and partly due to how the data is collected.
From conversations, however, we can infer different types of information. It could be

some opinion of the world held by the participants, some facts (assuming speakers

57

Table 4.1: Examples of premise-hypothesis pairs for conversation entailment

 

 

 

 

 

 

 

 

 

ID Premise Hypothesis Entailed
1 B: My mother also was very very B is eiglrtyutlu'ee. False
independent. She had her own, still
had her own little house and still
driving her own car, B’s mother is True
A: Yeah. eighty-three.
B: at age eighty-three.
2 A: sometimes unexpected meetings or a Sometimes a client False
client would come in and would want wants to see B.
to see you, Sometimes a client True
B: Right. wants to see A.

 

are telling the truth) about the participants, and communicative relations between

the participants (e.g., A disagrees with B).

In this work, we particularly focus on the inference about conversation partici-
pants. This is because understanding conversation participants is key to any appli-
cation involving conversation processing: either acquiring information from human-
human conversation or enabling human-machine conversation. In human-human con-
versation, correct hypotheses about conversation participants can beneﬁt many ap-
plications such as information extraction and knowledge discovery from conversation

data. In human-machine conversation, better understanding of its conversation part-

ners will enable more intelligent system behavior.

Speciﬁcally, we are interested in following four types of inference:

0 Fact. Facts about the participants. This includes:

1. Proﬁling information about individual participants (e.g., occupation, birth

2. Activities associated with individual participants (e.g., A bikes to work

3. Social relations between participants (e.g., A and B are co-workers, A and

place, etc.);

everyday) ;

B went to college together).

58

 

 

o Belief. Participants’ beliefs and opinions about the physical world. Any state-
ment about the physical world in fact is a belief of the speaker. Such statements
are not about the speaker him/ herself and often involve subjective judgements,
e.g., B thinks that crafts are relaxing. Technically, the state of the physical
world that involves the speaker him/ herself is also a type of belief. However,

here we assume a statement about oneself is true and is considered as a fact.

a Desire. Participants’ desire of certain actions or outcomes (e. g., A wants to ﬁnd
a university job). These desires represent the states of the world the participant

ﬁnds pleasant (although they could be conflicting to each other).

a Intent. Participants’ deliberated intent, in particular communicative intention
which captures the intent from one participant on the other participant such
as whether A agrees/disagrees with B on some issue, whether A intends to

convince B on something, etc.

Most of these types are motivated by the Belief-Desire-Intention (BDI) model [2],
which represents key mental states and reflects the thoughts of a conversation par—
ticipant. Desire is different from intention. The former arises subconsciously and
the latter arise from rational deliberation that takes into consideration desires and
beliefs [2]. The fact type represents the facts about a participant. Both thoughts
and facts are critical to characterize a participant and thus important to serve many

other downstream applications.

4.3 Data Preparation

Currently there is no data available to support the research on conversation entail-

ment. Therefore, as a ﬁrst step, we have developed a database of entailment examples

59

with different types of hypotheses to facilitate algorithmic development and evalua-

tion.

4.3.1 Conversation Corpus

The data was collected from the Switchboard corpus [38]. It is a corpus of make-up
phone calls, where the participants, who do not know each other, exchange ideas and
discuss issues of interest. These conversations are casual and free—form compared to
goal—driven conversations (e.g., conversation about how to install a computer pro—
gram). Inference from this set of conversations can be more challenging since the
goals/subgoals are not explicit and topic evolvement can be unpredictable.

All of the conversations in this corpus have been transcribed by human annotators.
A portion of it has been annotated with syntactic structures, disfluency markers, and
discourse markers as a part of Penn Treebank [61].

As we are mainly interested in semantic analysis and inferring information from

the conversations, we work on the conversation transcripts directly.

4.3.2 Data Annotation

We selected 50 conversations from the Switchboard corpus. In each of these conver-
sations, two participants discuss a topic of interest (e.g., sports activities, corporate
culture, etc), and has a full annotation of syntactic structures, disﬂuency markers, and
discourse markers. We chose the conversations with annotation because the available
annotations will enable us to conduct systematic evaluations of developed techniques,
for example, by comparing performance of inference based on annotated information
versus automatically extracted information from conversation.

We had 15 volunteer annotators read the selected conversations, and created a

total of 1096 entailment examples. Each example consists of a segment from the con-

60

versation, a hypothesis statement, and a truth value indicating whether the hypothesis
can be inferred from the conversation segment, given the contextual information from
the whole history of that conversation session. The following guidelines are followed

during the creation of entailment examples:

0 The number of examples is balanced between positive entailment examples and
negative entailment ones. That is, roughly half of the hypotheses are entailed

from the premise, and half of them are not.

0 Special attention is given to negative entailment examples, since any arbitrary
hypotheses that are completely irrelevant will not be entailed from the conver-
sation. So in order not to make the prediction of false entailment too trivial,
a special guideline is enforced to come up with “reasonable” negative exam-
ples: the hypotheses should have a major portion of words overlapping with the

premise.

A recent study shows that for many NLP annotation tasks, the reliability of a small
number of non-expert annotations is on par with that of an expert annotator [79].
It is also found that for tasks such as affection recognition, an average of four non-
expert labels per item are capable of emulating expert-level label quality. Based on
this ﬁnding, in our study the entailment judgement for each example was further
independently annotated by four annotators (who were not the original contributors
of the hypotheses). As a result, on average each entailment example (i.e., a pair
of conversation segment and hypothesis) received ﬁve judgements, including the one

given by the original annotator (i.e. creator of the hypothesis).

4.3.3 Data Statistics

In total we collected 1096 entailment examples from the annotators. In this section

we will analyze the collected data and give some important statistics.

61

 

600 I I I I T l I I

500 -

Number of Examples
w a
O O
O O

N

O

O
I

100

 

    

 

l

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Agreement

Figure 4.1: Agreement histogram of entailment judgements

As the most important annotation is the judgement of truth values, that whether
a hypothesis can be inferred from the premise, it is essential to investigate how reliable
those judgements are from our annotators, who are average native English speakers.

As described in Section 4.3.2, we have ﬁve entailment judgements from different
annotators for each premise-hypothesis pair. Figure 4.1 gives a histogram of the
agreements of collected judgements. From the ﬁgure we can see that inference from
conversations is a difﬁcult task, for only 53% of all the examples (586 out of 1096)
are agreed by all human annotators.

Some of the disagreements are due to the ambiguity of the language itself, for

example:

Premise:

A: Margaret Thatcher was prime minister, uh, uh, in India, so many,

62

uh, women are heads of state.

Hypothesis:

Margaret Thatcher was prime minister of India.

In the conversation utterance of speaker A, the prepositional phrase in India is
ambiguous because it can either be attached to the preceding sentence, Margaret
Thatcher was prime minister, which sufﬁciently entails the hypothesis, or it can be
attached to the succeeding sentence, so many women are heads of state, which leaves
it unclear which country Margaret Thatcher was prime minister of.

In some other instances of disagreements, the hypotheses are often not directly
inferred from the text, but can be inferred after a few more steps of reasoning. Those
reasonings often involve assumptions on conversational implicature or coherence. For

example:

Premise:

A: Um, I had a friend who had ﬁxed some, uh, chili, buffalo chili and,
about a week before went to see the movie.

Hypothesis:

A ate some buffalo chili.

Premise:

B: Um, I’ve visited the Wyoming area. I’m not sure exactly where
Dances With Wolves was ﬁlmed.

Hypothesis:
B thinks Dances With Wolves was ﬁlmed in Wyoming.

In the ﬁrst example, a listener would assume that A follows the maxim of relevance,

so that when she mentions the ﬁxing of buffalo chili at this point in the conversation,

63

Table 4.2: Distribution of hypothesis types
Count Percentage
Fact 416 48.3%
Belief 299 34.7%
Desire 54 6.3%
Intent 92 10.7%

 

 

 

 

 

 

 

it is relevant. A most natural inference that would make the ﬁxing of buffalo chili

relevant is that A ate the buffalo chili.

In the second example, when the speaker A mentions a visit to the Wyoming area
and expresses a lack of knowledge of the ﬁlming place of Dances With Wolves, the
entire utterance is assumed to be coherent. This means in the speaker’s mind, the
Wyoming area must have some relationship with the ﬁlming of Dances With Wolves,

although she does not know where exactly in the Wyoming area that movie was

ﬁlmed.

Given the fact that the inference from conversations is already so diﬁcult even
for human readers, it is expected to be much more challenging for computer sys-
tems. Therefore for the ﬁrst step we will focus our preliminary experiments on 875

entailment examples that have agreements greater than or equal to 75%.

For the 875 entailment examples that have good agreements (Z 75%), we observe
a slight imbalance between the positive entailment class and the negative entailment

class. The ratio is 4742401 (54%:46%), with a bias toward the positive class.

This also sets up a natural baseline for our entailment prediction system, as a
majority guess approach (i.e. always guess positive for a data set that is biased to

the positive class) will achieve 54% prediction accuracy, expectedly.

The distributions of four hypotheses types among the 875 data set are shown in

Table 4.2.

64

4.4 Experimental Results

We applied the same dependency approach (as in Chapter 3) to the conversation en-

tailment data. This section presents our preliminary experiments and initial ﬁndings.

4.4. 1 Experiment Setup

As described in Section 4.3, our data set of conversation entailment consists of 875
premise-hypothesis pairs created from 50 conversations. To facilitate follow-up inves-
tigations, we further divided the 875 examples into two sets: a development set and
a test set. We select one third of the examples as development data and two third as

test data. The division is governed by the following guidelines:

1. No instances from the same conversation are divided into two different sets,
since we will potentially train our computational models from the development

data and apply them on the test data;

2. The ratio between positive and negative instances should remain roughly the

same for both the development and the test data sets;

3. The distribution of four hypothesis types (fact, belief, desire, intent) should

remain roughly the same on both the development and the test data sets.

As a result, we selected 291 examples from 15 conversations as the development set
and 584 examples from 35 conversations as the test set. The positive/ negative ratio

and the distribution of hypothesis types in both data sets are presented in Table 4.3.

Similar to the discussion in Section 4.3.3, the natural baseline by always guessing
the majority class can achieve accuracies of 56.4% and 53.1% on the development and

test data sets, respectively.

65

Table 4.3: The split of development and test data for conversation entailment

 

 

 

 

 

 

 

 

Total Development Test
Conversations 50 15 35
Premise-hypothesis pairs 875 291 584
Positive entailments 54.2% 56.4% 53.1%
Negative entailments 45.8% 43.6% 46.9%
Fact hypotheses 48.3% 47.1% 48.9%
Belief hypotheses 34.7% 34.0% 35.1%
Desire hypotheses 6.3% 10.7% 4.0%
Intent hypotheses 10.7% 8.2% 11.9%

 

 

4.4.2 Results on Verb Alignment

As shown in Section 3.5.1, the alignment for noun terms is a relatively easy task, for
which our current model can already be considered sufﬁcient and giving satisfying
results. Thus in the follow-up evaluations for alignment models, we will focus on the

alignment results for verb terms.

we applied the alignment model learned from the textual entailment data directly
to the conversation entailment data. The ﬁrst two series in Figure 4.2 (Development
and Test) show the f—measures of the alignment results on the development set and
the test set, respectively (here Development is only the name of a data set, which is

not used to develop our system models just yet).

Similar to Section 3.5.1, here we also evaluate a series of results with different out-
put thresholds for the logistic regression model. We can see that both f-measures for
the development set and for the test set achieves to maximum when the threshold is
set to 0.7 (24.9% and 32.3%, respectively). However, as we also show the f-measures
in the textual entailment task as the third series (Text) in Figure 4.2, we can see
that the maximum performance of conversation alignment is signiﬁcantly lower than
the maximum performance of text alignment (39.3%). This shows that the align-
ment model learned from textual entailment is not sufficient to tackle conversation

entailment.

66

0.5 e + Development

 

 

+Test
+Text
0.4-
9’3- 0.3-
u'_ 0.2-
0.1»
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Figure 4.2: Evaluation results of verb alignment using the model trained from text
data

4.4.3 Verb Alignment for Different Types of Hypotheses

We broke down the evaluation on verb alignment (with threshold 0.7) by different
hypothesis types, and the results are shown in Figure 4.3. For both the development
and the test data sets, the performance is better for the fact type than for most other
types. For fact type of hypotheses, the alignment f-measures are consistent between
the development and test data (30.7% and 37.3% respectively), which are also close to
that on the text data (39.3%). However, the f—measures for other types of hypotheses
are not so consistent, especially for desire and intent types. This is because there
are not many instances in these two subsets of data. Nevertheless, if we combine the
results on the development and test data for these two subsets, we get f-measures of

31.0% for desire and 27.0% for intent.

In summary, when we apply the alignment model learned from textual entailment
to the task of conversation entailment, it handles the alignments for fact and de-

sire types of hypotheses with relatively acceptable performance. The limitation of

67

 

 

 

 

 

 

 

 

0-5 ” [:1 Development
Test

0.4 ~
0
§ 0.3 l
g
u’. 0.2 ]

0.1 ~

Overall Fact Belief Desire Intent

Figure 4.3: Evaluation results of verb alignment for different types of hypotheses

the current model is mostly revealed when dealing with belief and intent types of

hypotheses.

4.4.4 Results on Entailment Prediction

Again, we applied the inference models learned from the textual entailment data
directly to the conversation entailment data. Figure 4.4 shows the performances for
both the development set and the test set. The overall prediction accuracies for the
two data sets are 48.5% and 53.1%, respectively. Similar to what we found from the
alignment evaluation, the reasonable models for predicting textual entailment now
produces signiﬁcantly lower performance on the conversation data.

In fact, the performance of the model predictions did not even beat the baseline
of majority guess, which (as given in Section 4.3.3) are 56.4% for the development
set and 53.1% for the test set. This is probably because our approach takes a rather
strict standard, i.e., it tends to predict negative entailments rather than positive
entailments. As a result, the more a data set is biased towards positive class (e.g.,

the development set), the less accurately our approach performs.

68

0.7

 

 

 

 

 

 

 

 

’ [Z] Development
Test
0.6 -
5* 0.5 -
CU
< 0.4 ~
0.3 ~
0.2 . .
Overall Fact BEIIEf Desnre Intent

Figure 4.4: Evaluation results of entailment prediction using models trained from text
data

Could the performance difference be attributed to the different sources of training
data? We further experimented with training our entailment models (both aligmnent
and inference models, with the same set of features as in textual entailment case)
from the development data of conversation entailment, and evaluating them on the
test data. This resulted in an accuracy of 52.4%. The newly trained models show
no advantage compared to the previous models trained from textual entailment data.
Therefore in the follow-up investigations, we will still use the previous result (53.1%
accuracy on the test data) as the baseline.

Figure 4.4 also shows the break-down results of entailment performances by dif-
ferent hypothesis types. Again we see the current models perform better for fact type
than the other three types.

The initial results on conversation entailment suggest that only applying ap—
proaches from textual entailment will not be enough to handle entailment from con-
versations, especially the entailment of belief, desire, and intent types of hypotheses.
Considerations of unique behaviors of conversations is important to tackle the con-

versation entailment problem.

69

Chapter 5

Incorporating Dialogue Features in

Conversation Entailment

Dialogues exhibit very different language phenomena compared to monologue text.
As a result, the algorithm framework that is designed to recognize entailment from
text will not be sufﬁcient to process conversation entailment. In order to effectively
predict entailment from conversations, we need to model unique features from the
conversations [92]. In this chapter we discuss the modeling of two types of features:
linguistic features in conversation utterances and structural features of conversation

discourse.

5.1 Linguistic Features in Conversation Utterances

Compared to newswire texts that are mostly formal and standardized, spontaneous
conversations tend to have much more linguistic variations, which dramatically in-
creases the difﬁculty of recognizing entailments from them. These variations of lin-
guistic features mainly include disﬂuency, syntactic variations, and special usage of

language.

70

5.1. 1 Disfluency

Oral conversations contain different forms of disfluency that breaks the normal struc-
ture of language. Below are a list of some types of disfluency.
1. Filled pause

When people are thinking, hesitating, or pausing their conversation due to
other reasons while they speak, they tend to create such words like uh, um,
huh, etc. These insertions have no semantic content, but they break the flow of

communication.

2. Explicit editing term

These are the words that have some semantic content, but do not carry much

actual meaning, such as I mean, sorry, excuse me, etc. For example:

A: Oh, yeah, uh, the whole thing was small and, you, I mean, you
actually put it on.

They usually occur between the restart and the repair (and are as such “ex—
plicit”).
3. Discourse marker

Discourse markers does not carry much meaning either, but they have a wider
distribution than explicit editing terms. Such words include like, so, actually,

etc. For example:

B: I think that was better than like Showbiz Pizza cause there’s
more of them to do.

Because discourse markers can almost appear anywhere in a sentence, they are

much more likely to be confounded with content words that take the exact same

71

form, and thus create ambiguities. Take the word like for example and compare
its roles in sentences They’re like bermuda shorts and They’re, like, bermuda
shorts. In the ﬁrst case the shorts are not bermudas (only look like them), while

in the second case they are.

4. Coordinating conjunction

These are conjunctions like and, and then, and so, etc. But unlike regular
conjunctions, they carry no semantic meanings while just serve as coordinating

roles. Example:

A: And he usually is good about staying within them, although our
next door neighbors have a dog, too, and, uh, she, she is good
friends with my dog.

B: Oh, yeah?

A: And so he often gets to smelling her scent and will go over there
to sniff around and stuff.

5. Aside

Aside is a longer sequence of words that is irrelevant to the meaning of the
main sentence. It interrupts the fluent flow of the sentence and the sentence

later picks up from where it left off. For example:

B: I, uh, talked about how a lot of the problems they have to come,
overcome to, uh, it’s a very complezr, uh, situation, to go into
space.

6. Turn interruption

The speech of a speaker can be interrupted in the middle by another person

and then continued and completed by the same speaker. For example:

72

B: I thought it was kind of a strange topic about corruption in the
government and —

A: Yeah.

B: — uh, how many people are self serving.

7. Incomplete sentence

Sometimes a sentence is incomplete. This may be because it is interrupted by
another speaker and then discontinued, or it is just unﬁnished by the speaker.

For example:

A: We’ve had him for, let’s see, he just had his fourth birthday.

8. Restart

A restart happens when a part of a sentence is canceled by the speaker and then

ﬁxed by a repairing part. Examples of restarts can be a simple substitution:

A: Show me flights from Boston on, uh, from Denver on Monday.

or more complicated cases where there is a restart within a restart (which are

called nested restarts):

A: I liked, uh, I, I liked it a lot.

5.1.2 Syntactic Variation

Oral conversations have unique syntactic behavior which rarely occurs in written

newswire articles. We summarize a few phenomena as follows.

1. Dislocation and movement

73

 

A dislocation describes the case when a sentence constituent (which is dislo—

cated) is associated with a resumptive pronoun. For example, in

A: John, I like him a lot.

John is associated with him, which constitutes a left—dislocation. And in

A: One of the problems they’re facing now, a lot of people now, is
that the small business can’t offer health insurance.

a lot of people is associated with they, which constitutes a right-dislocation.

A similar case is the movement of appositives. While it is very much like the
dislocation, the only difference is that the moved appositive is associated with

a regular noun phrase other than a pronoun. For example:

B: Her father was murdered, her father and three other guys up here
in Sherman.

In both of these cases, it is critical to recognize the dislocated or moved con-
stituent and identify the original element they are associated with.
. Subjectless sentence

In strictly grammatical sentences, those without subjects may in most cases be
considered imperatives. In conversations, however, the use of empty subjects is

allowed in non-imperative contexts. For example:

B: You know, I think you are right. I think it is Raleigh.
A: Think so?

74

In this example, a completed form of the sentence of speaker A should be Do

you think so .9.

Here is another example:

B: Later I tried to get the baby to a baby-sitter. Supposed to be good,
recommended person from the church, and I knew her personally.

for which an unabbreviated form would be She was supposed to be good.

In order to get the actual meaning of a sub jectless sentence, a way to recover
what has been omitted is desired.

. Double construction

Double constructions are rarely seen in written, textual English, and are thus in
need of special treatment for both syntactic parsing and semantic interpretation.

These include double is construction, such as in

B: That’s the only reason I work there, is that my children now
have graduated, and graduated from college.

and double that construction, for example:

A: Or you can hope that if people keep their money that they’ll
spend more and create jobs and, and whatnot.

5.1.3 Special Usage of Language

Compared to written text, language use in oral conversations can be much more ﬂex-

ible. Such flexibility can have signiﬁcant inﬂuence on the recognition of conversation

entailment. Below are a few situations of the special usages.

75

1. Ellipsis

Ellipsis can happen in written text, but it is used much more widely and fre-

quently in oral English. For example:

A: Did, did you go to college?
B: Well, no. I’m going right now.

In this conversation, speaker B’s utterance means I’m going to college right now.
It is important to recognize such an ellipsis in order to recognize entailments

like B is going to college.

In Section 5.1.2, “subjectless sentence” is a special case of ellipsis. In that case
a regular grammatical component of a sentence is omitted, making up a special
syntactic structure. While here we consider sentences, although with ellipses,

but still in ordinary syntax.

2. Etcetera

There are many possible ways to represent etcetera in English, such as and so
on, and so forth, etc. More variations are speciﬁcally seen in spoken English, in-
cluding or whatever, or something like that, and and stuff like that. These vague
phrases, which can be either nominal or adverbial, require special recognition
to be distinguished from regular nominal or adverbial phrases. For example,
a nominal etcetera can be used in conjunction with an enumeration of verb

phrases:

A: They just watch them and let them play and things like that.

3. Negation

76

In Section 3.4.1 we discussed the importance of modeling negation in textual
entailment task. We have listed a set of negative adverbs in Table 3.2. However,
negation in conversations can be represented by a larger variety of forms. For

example:

B: They’ve got to quit worrying about, uh, the, uh, religious, uh,
overtones in our textbooks and get on with teaching the product.

In this utterance, the word quit also represents a meaning of negation, which is

the same as saying they’ve got to not worry about . . ..

. Question form

Written text also take question forms from time to time, but they are mostly
rhetorical questions or hypothetical questions. In conversations, however, as two
or more people communicate and exchange ideas and information, it is much
more common to see one speaker ask a question, which is answer by another

speaker. For example:

B: Hi, um, okay what, now, uh, what particularly, particularly what
kind of music do you like?
A: Well, I mostly listen to popular music.

5.2 Modeling Linguistic Features in Conversation

Utterances

As a starting step, we chose to incorporate the disﬂuency and some of the special

usages of language in our conversation entailment system. This section describes how

we model these features.

77

5.2.1 Modeling Disﬂuency

The detection of disfluency has been studied in previous works [70]. Here our focus is
how they affect the recognition of conversation entailment, and how to model them
in the entailment prediction process. Therefore, we employ a corpus of disfluency
annotations on the Switchboard conversations, given by Penn Treebank [61].

After they are detected (or marked out by annotation), we treat different types
of disﬂuency differently. Filled pauses, explicit editing terms, discourse markers,
coordinating conjunctions, and asides are directly removed. Interrupted utterances
are pieced together to recover the meaning of the original utterances. Incomplete
sentences are ungrammatical, usually unable to analyze or comprehend, and often
make no sense. Thus they are discarded from the conversations.

A more complex case is the restarts. They need to be repaired for their original

meaning to be understood. For example,

A: Show me flights from Boston on, uh, from Denver on Monday.

In this case, we remove the canceled part (e.g., from Boston on) as well as concurrent
ﬁlled pauses and editing terms (e.g., uh), and replace them with the ﬁxed constituent
(e.g., from Denver on). As such we are able to recover the correct form of this

utterance: Show me flights from Denver on Monday.

5.2.2 Modeling Polarity

In Section 5.1.3 we have found a group of words that can represent negative polarities
in conversations, which were not used in textual entailment. We expanded this group
of words and added them to the set of negative modiﬁers used in textual entailment

(in Table 3.2). The expanded set of negative words is listed in Table 5.1.

78

 

Table 5.1: The expanded set of negative words used for polar-it y check

 

 

 

 

 

The set used in textual entailment

barely never not scarcely
hardly no n’t seldom
little none nowhere
neither nor rarely

The expanded set
abolish deny give up proscribe
abort disallow halt put off
abrogate disapprove hesitate quit
annul disclaim interdict refuse
avert discontinue invalidate reject
avoid drop negate repeal
ban eliminate neglect repudiate
bar escape nullify rescind
cancel except obviate resist
cease exclude omit revoke
debar fail oppose stop
decline forbid postpone terminate
defer forestall preclude void
defy forget prevent
delay gainsay prohibit

 

 

 

 

79

 

 

Using the negative identiﬁers in Table 5.1 and the post processing mechanism of
polarity check described in Section 3.4.1, we are now able to detect that conversation

entailment examples such as

Premise:

B: They’ve got to quit worrying about, uh, the, uh, religious, uh, over-
tones in our textbooks and get on with teaching the product.

Hypothesis:

B believes people should focus more on the religious overtones of
textbooks.

are false entailments.

5.2.3 Modeling Non-monotonic Context

In Section 3.4.2 we have proposed that after each entailment prediction, if it is pre-
dicted to be true entailment, we need to check the context of the entailing statement
against the monotonicity assumption. In conversation entailment, we follow the same
idea. However, the category of non-monotonic context should not be limited to what

was introduced in Section 3.4.2. For example:

Premise:
B: What kind of music do you like?

Hypothesis:

A likes music.

The clause representation of the hypothesis is

11:1 = A, 5132 = music, 2:3 = likes, subject(a:3,:t‘1), object(:1:3,:r2)

80

In the conversation segment, we can align the term y1 = music to the hypothesis
term 2:2(music), term y2 = you to the hypothesis term 5131(A), and term y3 = like
to the hypothesis term :L‘3(likes). Since y2 is y3’s subject in the premise, which
entails the hypothesis clause subject(:r3,:r1), and y1 is y3’s object in the premise,
which entails the hypothesis clause object(:z:3,:1:2), all clauses in the hypothesis are
entailed. According to the entailment framework in Section 3.1, the hypothesis can
be predicted to be entailed from the conversation segment.

However, the hypothesis in this example is clearly not entailed from the premise,
because the conversation segment provides no descriptive information about speaker
A. In fact, the premise relations subject(y3,y2) and object(y3,y1), from which the
hypothesis clauses are entailed, all occurred in a question asked by the speaker B.

Therefore in conversation entailment, we identify questions (including wh-questions
and yes-no-questions) as non-monotonic context too. Admittedly, questions can also
be identiﬁed as non-monotonic context in textual entailment. But as we discussed in
Section 5.1.3, question forms are not extensively used in text writing, while are much

more COHIIIIOD in COHVCI‘S‘dthIIS.

5.2.4 Evaluation

In order to evaluate our improved system modeling the linguistic features in conver-
sation utterances, and compare it to the baseline system using models from textual
entailment, we investigated both how they perform on the verb alignment task and

how they classify entailment prediction.

Evaluation on Verb Alignment

Figure 5.1(a) and 5.1(b) show the verb alignment results on the development and the
test data sets respectively. The Baseline results were produced by the system using

models from textual entailment, and the +F—utterance are the results incorporating

81

 

 

0.35 . + Baseline
+ +F-utterance

I

0.3

I

0.25

0.2

I

F-measure

0.15 ~

 

0"o 0.1 0.2 0.3 0:4 0.5 06 0:7 0.8 0:9 1

Threshold

(a) On development data
0.35 _

0.3

0.25 ~

F-measure

0.2

I

0'15 F + Baseline

+ +F-utterance

0“’0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0:8 0.9 1

 

i—

Threshold
(b) On test data

Figure 5.1: Evaluation of verb alignment for system modeling linguistic features in
conversation utterances

82

linguistic features in conversation utterances. Results are presented with different
thresholds of model output.

As we see from the comparison, the two systems do not produce very much differ-
ent results on verb alignment. This is not surprising, since the modeling of linguistic
features (as described in Section 5.2) mostly happens on the post processing stage
(polarity check or monotonicity check). Figure 5.2 of the alignment results broken
down by hypothesis types again demonstrates similar comparisons between the two

systems.

Evaluation on Entailment Prediction

Figure 5.3 compares the entailment prediction performances of the two systems for
both the development and the test data sets. Overall speaking, the system incor-
porating linguistic features in conversation utterances (+F-utterance) shows some
improvement over the Baseline system on the development data, but not much im-
provement on the test data. This is because the baseline on the development data
is relatively low (as discussed in Section 4.4.4). Overall speaking, neither of the per-
formances on development data and on test data has beaten the natural baselines
by majority guess (56.4% and 53.1% respectively, as in Section 4.3.3) after modeling
linguistic features in conversation utterances.

However, if we break down the evaluation results by different types of hypothe-
ses, which are also shown in Figure 5.3, we can see that modeling linguistic features
bring improvement on both data sets for the inference of fact type of hypotheses.
Statistical tests illustrate that the improvements are signiﬁcant on both data sets
(McNemar’s test, p < 0.05). This demonstrates that the modeling of linguistic fea-
tures in conversation utterances helps identifying the entailment of fact hypotheses,
but is not so effective for other types of hypotheses (belief, desire, and intent). The

entailment of belief, desire, and intent hypotheses requires further modeling beyond

83

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

types for system

ation u erances

of verb alignment by different hypothesis
tt

Evaluation
modeling linguistic features in convers

Figure 5.2:

84

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.3: Evaluation of entailment prediction for system modeling linguistic features

utterances

85

the conversation utterances, such as features of conversation structure.

5.3 Features of Conversation Structure

A more important characteristic of conversations is that they are communications
between two or more people. Thus they contain interactions between the conversation
participants, such as turn-taking, grounding, etc. These unique conversation features
make the task of conversation entailment even more distinctive from that of textual
entailment.

Consequently, modeling the structure of conversation interaction can be critical
to recognizing conversation entailment. Below are several examples of conversation

structures that need to be modeled in order to predict correct entailments.

1. Question and answer

In the example,

Premise:

A: And where abouts were you born?
B: Up on Person County.

Hypothesis:
B was born in Person County.

speaker A asks a question and B gives an answer. In order to correctly infer the
statement in the hypothesis, we need to consider B’s answer under the context
of A’s question, i.e., that up on Person County in B’s answer are adjuncts to

the verb born in A’s question.

Generally for a wh-question and its answer, we need to identify the semantic
relation between proper constituents from the question and from the answer

respectively, so that a similar relation in the hypothesis can be entailed. For

86

 

 

a yes-no-question, however, we usually can ﬁnd all desired relations from the
question, and use the yes or no answer to validate or invalidate such relations.

For example,

Premise:

A: Do they, were, were there, um, are you allowed to, um, be casual,
like if it was summer, were, were you allowed to wear sandals,
and those, tha-, not real,

B: Not really, I think, I mean it’s kind of unwritten, but I think
we’re supposed to wear hose and, and shoes,

Hypothesis:
B is allowed to wear sandals.

In this case the hypothesis is not entailed because in the conversation segment
speaker B gives a no-answer, which invalidates the relation between you and

allowed in A’s question.

. Viewpoint and agreement

In the following example,

Premise:

A: We did aerobics together for about a month and a half and that
went over real well,

Uh-huh.

but, uh, that’s about it there.

Oh, it’s good and it’s healthy, too.
Oh, yeah, yeah.

Hypothesis:

2’90???“

A agrees with B that aerobics is healthy.

speaker B raises a viewpoint and speaker A agrees with it. There are three

parts in the hypothesis related to the verb agree, the person that agrees (A), the

87

 

person that is agreed with (B), and the content that is agreed on (that aerobics
is healthy). The content part can be entailed from speaker B’s utterance (it’s
good and it’s healthy) from the conversation segment, while the other two parts
have to be inferred from the relation between utterance Oh, yeah, yeah and its

speaker, and the relation between these two utterances.

A similar case is a viewpoint and a disagreement. For example,

Premise:

A: Of course, there’s not a whole lot of market for seventy—eight

RPM records.
B: Is there not? You, you’d, well you’d think there would be.

Hypothesis:

B disagrees with A about the market for seventy-eight RPM
records.

3. PrOposal and acceptance

In the example below,

Premise:

B: Have you seen Sleeping with the Enemy?
A: No. I’ve heard that’s really great, though.
B: You have to go see that one.

A: Sure.

Hypothesis:
A is going to see Sleeping with the Enemy.

speaker B makes a proposal and speaker A accepts it. Again here we need to
consider speaker B’s utterance you have to go see that one and speaker A’s
utterance sure together to predict the entailment of the hypothesis statement.

Terms and relations in the hypothesis can be entailed by the terms and relations

88

in B’s utterance, but the whole statement has to be validated by A’s acceptance

of this proposal.

Similarly, there can be proposals and denials, which also requires the modeling

of conversation structure to be correctly recognized.

5.4 Modeling Structural Features of Conversations

In order to model the structural features of conversations in our entailment system,
we ﬁrst incorporate the conversation structures in our representation of conversation

segments, i.e., the clause representations.

5.4.1 Modeling Conversation Structure in Clause Represen-

tation

Previously we have used the same technique to represent the utterances in conversa-
tions as we used in representing text. For example, Figure 5.4(a) shows an example
of a conversation segment (with a corresponding hypothesis), and Figure 5.4(c) shows
the clause representation for the conversation segment in this example. As we de-
scribed in Section 3.1.1, a clause representation is equivalent to dependency structure.

So Figure 5.4(b) also shows the dependency structure of the conversation utterances.

This representation represents only the information in conversation utterances,
without the information of the conversation structure. In order to incorporate struc-
tural features such as speaker identity, turning, and dialogue acts, we propose to
augment the representation of conversation segments by introducing additional terms

and predicates:

o Utterance terms: we use a group of pseudo terms ul, v.2, . . . to represent in-

89

 

 

Premise:
B: Have you seen Sleeping with the Enemy?
A: No. I’ve heard that’s really great, though.
B: You have to go see that one.
Hypothesis:
B suggests A to watch Sleeping with the Enemy.

 

 

 

(a) a conversation segment (premise) and a corresponding hypoth-

esis

B: Have mu seen Slee in with the Enem ?

 

   

 

 

 

 

 

 

 

Terms Clauses
y =A .
’_ , 0b10’3xY2)
_A_: No. I've heard membranes, though. y z—Sleep mg subj(y ,y )
' with the Enemy 3 l
l : : _ _ aux(y3,y4)
: y3—seen, y4—have
: y5=A, y6=that subj(y7,y6)
y7=is really great obj(yg,x7)
B: You have to 39 £93 that 91.1—6- y 8=have heard subj(yg,y5)
. . y =A y =one 01?lele10)
9 =s,ee’0 = ’0 0bj(ylZ’yll)
y“ 'y’2 g ’ 0bj0’l3ayl2)

y13=have

 

 

] subj(yl3,y9)

 

(b) dependency structure of the conversation utter- (c) clause representation of the conver-

ances sation utterances

Figure 5.4: An example of dependency structure and clause representation of conver—

sation utterances

90

 

 

 

dividual utterances in the conversation segment. We associate the dialogue acts
for each utterance with the corresponding terms, e.g., u1 = yes_no_question.
Here we use a set of 42 dialogue acts from the Switchboard annotation [38].

Appendix B lists the dialogue act set.

Additional speaker terms: we use two terms sA, s B to represent individual
speakers in the conversation. These terms can potentially increase for multi-

party conversations.

Speaker predicates: we use a relational clause speaker(-, ) to represent the

speaker of each utterance, e.g., speaker(u1, 83).

Content predicates: we use a relational clause content(-, ) to represent the
content of each utterance, where the two arguments are the utterance term
and the head term in the utterance, respectively. e.g., content(u3,y8) (where

318 = heard).

Utterance ﬂow predicates: we use a relational clause f ollow(-, -) to connect
each pair of adjacent utterances. e.g., follow(u2,u1). We currently do not
consider overlap in utterances, but our representation can be modiﬁed to handle

this situation by introducing additional predicates.

Figure 5.5(b) shows the augmented representation for the same example in F ig—

ure 5.4, and Figure 5.5(a) shows the corresponding dependency structure together
with conversation structure. The highlighted parts in these ﬁgures illustrate the
newly introduced terms and predicates (relations). Because such representations of
conversation segments take conversation structures into consideration, we call them
structural representations. In contrast, we call previous representations without con-

versation structures (as in Figure 5.4) basic representations.

Our follow-up discussions are based on the same example in Figure 5.4(a) and

91

 

 

 

Terms Clauses

 

 

B: ﬂag ypg gen Slee in with the Enem ? SA’ SB speaker(u 1,3 8)

u,=yes-no-question content(upys)

 

y1=A,y2=Sleeping 0bj(y3,y2)
with the Enemy subjogm)

5 ¢@— y3=seen, y4=have aux(y3,y4)
r69

 

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

speakeduzai)
u2=no-answer
4: No. 1'v_e__h_eard til—EWW though- fouow(u2’u"
: l speaker(u3,sA)
( Sy’ i 5 i :Q‘z u3=statement content(u3,y8)
: . E follow(u3,u2)
y5=A, y6=that subj(y7,y6)
‘L y7=is really great 0bj(yg,y7)
Ar y8=have heard subjfng/s)
v u
3 speaker(u4,sB)
u4=0pinion content(u4,y13)
follow(u4,u3)
y =A y =one obj(y”,y10)
y9 _S’eeloy ‘go 0bj0’120’11)
11‘ , 12‘ -
y13=have ObeYisJu)
subj(yl3,yg)

 

 

 

 

 

 

 

 

 

(a) dependency and conversation structures of the (b) augmented representation of the conver-
conversation segment sation segment

 

 

B suggests A to watch Slee in with the Enem .

 

 

 

 

 

 

Terms Clauses
x =3, x =A ,
1 S1 2. subj(x4,x2)
x3: eepmg _
ob x
with the Enemy if 4763)
SH x
x4=watch b‘j( 5”")
O x
x5=suggests J( 5J4)
(c) dependency structure of the hypothesis (d) clause representation of the

hypothesis

Figure 5.5: The conversation structure and augmented representation for the example
in Figure 5.4

92

the representation in Figure 5.5. Therefore we also show the dependency structure
and the corresponding clause representation for the hypothesis in Figure 5.5(c) and

5.5(d), respectively.

5.4.2 Modeling Conversation Structure in Alignment Model

Previously, our system is incapable of predicting entailments such as the one in Fig-
ure 5.4(a), because the hypothesis term suggests is not expressed explicitly in the
premise, and thus the system cannot ﬁnd an alignment in the premise for such a
term. Instead, the conversation utterance of speaker B, You have to go see that one,
constitutes the act of making a suggestion. Therefore, we propose to take conversation
structure into consideration so as to solve this problem.

Speciﬁcally, with the structural representations of conversation segments, we in-
corporate those (pseudo) terms representing conversation utterances into our align-
ment model. We call the alignments involving such terms pseudo alignments. For
example, Figure 5.4.2 gives a complete alignment between the premise terms and
hypothesis terms in Figure 5.5, where g(:r.5, u4) = 1 is a pseudo alignment.

A pseudo alignment is identiﬁed between a hypothesis term :15 and a premise term

u if they satisfy the following conditions:

1. :1: is a verb matching the dialogue act of u, e.g., 3:5 = suggests is a match of

u4 = opinion;

2. The subject of :1: matches the speaker of utterance u, e.g., the subject of 15,

3:1 = B, is a match of the speaker of u4, which is SB-

The match of subjects is pretty straightforward because the speaker of an ut-
terance can only be either 3 A or 33. The match of verbs against dialogue acts is
currently processed by a set of rules learned from the development data of conversa-

tion entailment. Each of such rules V ~ U consists of two sets, a verb set V and a

93

 

Conversation Segment

 

SB

 

SA
Y1=A

 

 

y2=Sleeping Hypothesis

 

 

with the Enemy x = B
1

 

y3=seen

 

x2=A

 

 

y4=have (
YS = A " x3=Sleeping

 

Y6: th at v with the Enemy

 

 

y7=is really great x4=watch

 

 

 

 

 

 

y9 =14

 

y10=0ne

 

yu=see

 

YI2=80

 

y13=have

 

u1=yes_no_question

 

112:” 0_ans wer

 

u3=statement

 

 

 

u4=opinion

 

Figure 5.6: An alignment for the example in Figure 5.5

94

dialogue act set U, which means any verb in V can be a match to any dialogue act

in U. Below are a few examples of such rules:

p—i

. (think, believe, consider, find} ~ {statement, opinion}

[\3

. {want, like} ~ (opinion, wh-question}

CO

. {agree} ~ (agree, acknowledge,appreciation}

A

. {disagree} ~ {yes-no—question}

5.4.3 Evaluation

To investigate how the modeling of conversation structure helps our entailment sys-
tem, in this section we evaluate the entailment system incorporating the structural
features. The evaluation is again conducted on two tasks, the verb alignment task

and the entailment prediction task.

Evaluation on Verb Alignment

Since we have introduced pseudo alignment in Section 5.4.2, now the ground truth for
verb alignment is different from before when conversation structure was not incorpo-
rated in the representations. The current true alignments of verbs also include pseudo
alignments, i.e., alignments between verbs terms in the hypotheses and (pseudo) ut-
terances terms in the conversations. For this reason, the verb alignment for the
system incorporating structural features of conversations can not be compared to
that of a system without structural feature modeling. Therefore we evaluate the verb
alignment of the system with structural feature modeling on its own.

Figure 5.7 shows the system’s performance on verb alignment for both the devel-
opment and the test data sets after modeling features of conversation structures. An

overall trend is that when the threshold goes up, the system’s performance does not

95

 

0.35 _ + Development
+ Test

F—measure
9
IO .0
U1 00

.o
N
T

.0

...:

U1
I

 

0"o 0.1 0.2 0.3 0.4 0.5 0.6 0:7 0.8 0.9 1

Threshold

Figure 5.7: Evaluation of verb alignment for system modeling conversation structure
features

decrease as much as the previous system without modeling conversation structures
(see Figure 5.1), especially for the development data set. This is because our system
takes a rule-based classiﬁcation mechanism for pseudo alignments, so the recalls of

pseudo alignments are not affected by high thresholds.

Figure 5.8 shows the alignment result broken down by different hypothesis types
for both the development and the test data sets at threshold 0.7. A dramatic result
we see in this ﬁgure is that the verb alignment performances for the intent hypotheses
now exceed the performances for all other hypothesis types. This is what we expected
- pseudo alignments help align the verb terms in intent hypotheses the most, since
such hypotheses have many verbs (e.g., 3:5 = suggests in Figure 5.5(d)) that have to
be aligned to pseudo terms of utterances with dialogue acts (e.g., u4 = opinion in

Figure 5.5(b)).

96

 

 

 

 

 

 

 

 

 

 

 

 

0-5 ' [:1 Development
ITest

0.4 -
g 0.3 -
E’
u; 0.2 ~

0.1 ~

Overall Fact Belief Desire Intent

Figure 5.8: Evaluation of verb alignment by different hypothesis types for system
modeling conversation structure features

Evaluation on Entailment Prediction

Figure 5.9(a) and 5.9(b) show the accuracies of entailment prediction for three systems
on the development and test data sets. The three systems are a baseline system using
models trained from textual entailment (Baseline), an improved system modeling
linguistic features in conversation utterances (+F-utterance), and a further improved

system incorporating features of conversation structures (+F-structure).

Overall speaking, the system modeling conversation structure has limited improve-
ment compared to other two systems. The improvement is more noticeable on the
development data, since the model to capture pseudo alignments is learned from the

same data set.

However, if we break down the same evaluation results by different types of hy-
potheses (which is also shown in Figure 5.9), we can see that the system modeling
conversation structure features increases the prediction accuracy signiﬁcantly for in-
tent type of hypotheses (McNemar’s test, p < 0.05). This is consistent with what we

found in evaluating verb alignments, i.e., the incorporation of pseudo alignments is

97

 

Accuracy

Accuracy

Figure 5.9: Evaluation of entailment prediction for system modeling conversation

0.8 _

0.7

0.6

0.5

0.4

0.3

0.2

0.8 ~

0.7

0.6

0.5

0.4

0.3 -

0.2

r

I

I

I

 

El Baseline

+F-utterance
El +F-structure

  

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

 

r

I

 

Overall

[:1 Baseline

Fact

(a) On development data

+F-utterance
[El +F-structure

    

 

 

 

 

Belief?

    

 

 

Desire

 

 

Intent

 

 

 

 

 

Overall

structure features

 

Fact

Belief

(b) On test data

98

   

l

 

Desire

’ Intent

 

 

 

most effective for hypotheses of intents.

It should also be noted that after incorporating features modeling conversation
structures, the whole system is re-trained to maximize the performance for all hypoth-
esis types. As a trade-off, the performance on some subset of examples is sacriﬁced
(decreased). For example, in Figure 5.9(b), for the fact type the performance of the
+F-structure system is decreased on the test data compared to the +F—utterance sys-
tem. For the desire type the performance is increased on the test data. but decreased
on the development data.

So why in the end does the performance on some examples decrease? We further
investigated the changes brought into the system by modeling conversation struc-
tures. We found that structural modeling creates more connectivity for language

constituents that were not connected before. For example:

Premise:
A: He, he plays on Murphy Brown.

Hypothesis:
A plays on Murphy Brown.

The speaker A is not in the utterance of conversation, so our previous system would
not ﬁnd an alignment for the hypothesis term A, and thus predicts the entailment
to be false, which is a correct prediction. However, after introducing the modeling of
conversation structure, we identify plays as the content of the utterance in the con-
versation segment, and A as the speaker of that utterance, as we show in Figure 5.10.

Thus a link between plays and A is established, i.e., y3 ~ ul ~ 3 A in Figure 5.10(a).

In this case, a correct prediction of the false entailment has to recognize in F ig-

ure 5.10(a) that the relationship between y3 and s A is not a verb-subject relation.

99

 

 

 

 

 

G -—--~_1’_t2 He,I he plays on Murphy Brown. Terms Clauses
’ ’ ’ u l=statement speaker(u 1,9,1)
sA=A content(ul,y3)
Yizhe b' (y )
0 J 3J2
=Mur h Brown ,
y: p y sub10’3ay1)
ya-plays

 

 

 

 

 

(a) The dependency and conversation structures of (b) The structural representation of the con-
the conversation segment versation segment

Figure 5.10: An example of measuring the relationship between two terms by their
distance (the highlighted distance between y3 and s A is 2)

However, our primitive entailment models only recognize the distance between lan-
guage constituents as their semantic relationship, i.e., for alignment model in Sec—
tion 3.3.1, we use distance to model the relations between a verb and its arguments
(subject or object), and for inference model in Section 3.3.2, we use distance to model
any relation between two terms. As a result, as the distance between y3 and A in
Figure 5.10(a) is 2, the alignment model would recognize A as an argument of y3,
and the inference model would use it to infer the hypothesis clause subject (plays, A).

Therefore, it is critical to develop a better approach of modeling semantic re-
lations between language constituents, to improve our models for both alignment

classiﬁcation and inference recognition.

100

 

Chapter 6

Enhanced Models for Conversation

Entailment

In Section 5.4.3 we have pointed out that the current models in our entailment system
are very simple. A major inadequacy in these models is they simply use the distance
between two language constituents to model the semantic relationship between them.
More speciﬁcally, in the alignment model, we use distance to model the relationship
between a verb and its arguments (subject or object). And in the inference model,

we use distance to model the relationship between any two terms.

To address this problem, in this chapter we aim at enhancing the entailment mod-
els by incorporating more semantics into the modeling of long distance relationship
between language constituents [93]. We ﬁrst describe two approaches of modeling long
distance relationship, and then discuss about how these approaches are employed in
our entailment models. For the convenience of discussion, we copy Figure 5.5 as

Figure 6.1.

101

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Terms Clauses
B: Egg Loy m Slee in with the Enem ? 5,49 SB speaker(u1,s5)
’ ’ ’ ’ ’ u,=yes-no-question content(ul, y3)
y1=A, y2=Sleeping 0bj0’30’2)
with the Enemy subj(y3,yl)
y3=seen, y4=have aux(y3,y4)
speaker(u2,sA)
u2=no-answer
A: No. I've heard mats really great, though. follow(u2,u,)
: speaker(u3,SA)
GD 5 5 :@b u3=statement content(u3,y8)
: i I follow(u3,u2)
y5=A9 Y6=that subj(Y7a}’6)
y7=is really great obj(yg,y7)
y8=have heard subj(yg,y5)
speaker(u4,sB)
u4=0pinion content(u4,y13)
follow(u4,u3)
y9=A, y10=one Oblfynd’lo)
y =see, y12=g0 OblkyizaJ’n)
y];=have 0b.](yl32y12)
1 $14be 139’9)

 

 

 

 

 

 

 

 

(a) dependency and conversation structures of the (b) augmented representation of the conver—
conversation segment sation segment

 

 

1; suggests A to watch Slee in with the Enem .

 

 

 

 

 

 

Terms Clauses
x =B, x =A ,
1 SI 2, subj(x4,x2)
x3: eepmg _
0b x
with the Enemy '1’} 4J3)
su x
x4=watch b'j( S’x’)
0 X
x5=suggests J( 5254)
(c) dependency structure of the hypothesis (d) clause representation of the

hypothesis

Figure 6.1: A copy of Figure 5.5: the structural representation of a conversation
segment and the corresponding hypothesis

102

 

6.1 Modeling Long Distance Relationship

Relationship can exist between any two language constituents in a discourse, even
when there is not a direct syntactical relation between them. For example, in Fig—
ure 6.1(a), the term y9 = you is not the syntactic subject of the term yu 2 see (i.e.,
we do not have subject(y11,y9)). However, if we try to identify the arguments for
the verb y11 2: see, we can see that y9 = you is its logical subject. We call such a
relation between two terms a long distance relation (LDR).

This raises a question of how we can ﬁnd the logical relation between the two
terms (e.g., logic_subject(y11, y9)), given the current representation of the conversa-

tion segment (i.e, dependency plus conversation structures).

6.1.1 Implicit Modeling of Long Distance Relationship

Our previous approach is to use the distance between the two terms in the structural
representation as the modeling of their long distance relationship. For example, in
Figure 6.1(a), the distance between 911 and W is 3. We call this approach the implicit
modeling of long distance relationship.

The rationale behind the implicit modeling approach is, the closer that two terms
are in the dependency and conversation structures, the more likely there is a relation-
ship between them. And, as a basic approach, we do not distinguish what type that
relationship is. For example, the hypothesis terms 2:4 and 2:2 in Figure 6.1(d) has a
relationship of subject(2;4, 2:2). This relationship will be determined as entailed from
yu and y9 in the premise, if 2:4 and 2:2 are aligned to y11 and y9 respectively, and
the distance between 1911 and y9 is close. This decision is made regardless of whether
the relationship between 3111 and y9 is subject(y11, y9).

The advantage of the implicit modeling approach is that it is easy to implement

based on the dependency and conversation structures. However, its limitation is

103

 

that the distance measure does not capture the semantics of relation types between
language constituents. For example, in Figure 5.10, the distance between the terms
y3 and s A is 2, so the algorithm identiﬁed there is a relationship between them.
However, as the type of this relationship is not identiﬁed, our entailment system

would mistakenly use it to infer relations like subject(-, -).

6.1.2 Explicit Modeling of Long Distance Relationship

We noticed that the identiﬁcation of relation types such as subject(-, ) is very much
like identifying the arguments of a verb, e.g., whether an entity is the subject of a
verb. In similar language processing tasks such as semantic role labeling [74], previous
work has often used the path from one constituent to the other in a syntactic parse
tree as a feature to identify the verb—argument relationship. Hence we adopt the same
idea here, to use the path between two terms in the dependency structure (augmented

with conversation structure) to model the long distance relationship.

Speciﬁcally, a path from one term to another in a dependency / conversation struc-
ture is deﬁned as a series of labels representing the vertices and edges connecting

them:

”0161 . . . ’Ul_1€l_1’Ul

where v1, . . . ,v( are the labels of vertices on the path, and e1, . . . , el_1 are the labels
of edges on the path. In our experiment we label the vertices by one of the three
types: noun (N), verb (V), or utterance (U); and label the edges by their directions:
forward (—->) or backward (4—). For example, in Figure 6.1(a), the path from y11 to
ye is

V —> V ——> V i— N

104

 

and in Figure 5.10(a) the path from y3 to 3A is
V —> U <— N

Although various labels can be designed to describe the vertices and edges on a

path, our criteria of choosing such labeling system are that

1. They are adequately detailed to capture the semantics of different. types of
relations. For example, it should be differentiated that V —+ V —+ V +— N

models a verb-subject relationship, while V —> U «— N does not.

2. They are also abstracted to certain extent (i.e., not overly detailed), in order
for the modeling to be generalizable. For example, if we describe the path from

. . b" t b" t b" t .

2:11 to 2:9 in Figure 6.1(a) as see o—Ji go 31?; have 499—16—9— you, tl'us

pattern may not be seen again in other examples.

6.2 Modeling Long Distance Relationship in the

Alignment Model

In Chapter 3 we have described the mechanism of how the alignment model works
in our entailment system. Speciﬁcally, in Section 3.3.1 we have described the feature
sets used to train the alignment models for nouns and verbs. A verb alignment model
classiﬁes for two verbs, 2: from a hypothesis and y from a premise, whether they are
aligned or not. Two important features in the verb alignment model are whether the
arguments (i.e., subjects and objects) of 2: and y are consistent.

In this section we ﬁrst give a brief review of how the argument consistencies are
modeled in our previous system, then propose an enhanced model of the argument

consistencies, and ﬁnally evaluate both modeling methods and compare their perfor-

105

 

mances.

6.2.1 Implicit Modeling of Long Distance Relationship in the
Verb Alignment Model

The previous approach models the argument consistency of two verbs based on im-
plicit modeling of the relationship between a verb and its aligned subject / object.

Speciﬁcally, given a pair of verb terms (2:, y) where 2: is from the hypothesis and
y is from the premise, let 3;; be the subject of 2: in the hypothesis, and let sy be the
aligned entity of 83; in the premise (in case of multiple alignments, 33/ is the one closest
to y). The subject consistency of the verbs (2:, y) is then modeled by the long distance
relationship between sy and y. For implicit modeling of LDR, such relationship is
measured by the distance between sy and y in the (augmented) dependency structure
of the premise.

For example, in Figure 6.1, to decide whether the hypothesis term 2:4 = watch
and the premise term 3711 = see should be aligned, we ﬁrst identify the subject of 2:4
in the hypothesis, which is 2:2 2 A. We then look for 232’s alignments in the premise,
among which y9 = you is the closest to 2:11. In Figure 6.1(a), we ﬁnd the distance
between 911 and y9 is 3.

Similarly, the distance between a verb and its aligned object is used as a measure
of the object consistency.

By implicit modeling of the long distance relationship in the alignment model, the
feature values of argument consistencies are quantitative. And since all other features
used in the alignment model (see Section 3.3.1) are either quantitative or binary, we
trained a discriminative binary classiﬁcation model (e. g. logistic regression model) to
classify verb alignments.

The limitation of such an alignment model is that the implicit modeling of LDR

106

 

 

does not capture the semantic relationship between a verb and its aligned subject or
object. For example, as we discussed in Section 5.4.3, the implicit alignment model

would also identify the term 3 A in Figure 5.10 as the subject of y3.

6.2.2 Explicit Modeling of Long Distance Relationship in the
Verb Alignment Model

In order to model more semantics in the relationship between a verb and its aligned
subject/ object, we adopt the explicit modeling of long distance relationship.

Given a pair of verb terms (2:, y), let sag be the subject of 2: and sy be the aligned
entity of 31; in the premise closest to y, we use explicit modeling of the long distance
relationship between y and 33/ as the feature to capture subject consistency. That is,
we use a string to describe the path from y to 33/. The string description of a path
is deﬁned in Section 6.1.2. For example, in Figure 6.1(a), the path from 911 to yg is
V —+ V -—+ V «— N.

Such explicit modeling is used to capture both the subject consistency and the
object consistency. Since the string representation of paths is not quantitative, we
cannot plot the data instances with this feature into a measurable feature space. In
order to use a discriminative model such as logistic regression to do the classiﬁcation,
traditional way of quantifying a string feature is to convert it to p binary features,
where p is the possible number of values of the original string feature. However, in
our case the number of values for the path feature can be very large, in comparison
the size of our data set is relatively small. As a result, this will cause a severe sparse
data problem.

Therefore, we use an instance-based classiﬁcation model (e.g., k—nearest neigh-
bour) instead of the discriminative model. An instance-based model requires only a

distance measure between any two particular instances. Suppose that for any fea-

107

 

 

ture f4, we have a distance measure between any two of its values, v(f,-) and w(f.,j),
then for any two instances a and b with n features f1 . . . fn, where the feature val-
ues for instance a are v1(f1) . ..vn(fn) and the feature values for instance b are
w1(f1) . . . wn(fn), we can calculate the distance between a and b by their Euclidean

distance:

 

 

dist( (a, b): [2? 210125100 (:07) wi(fi))2

For binary features such as verb be identiﬁcation, string equality, and stemmed equality
in Section 3.3.1, the distance between two values is whether the two values are the
same:

1 if viffi) = w’l(f’l)

0 otherwise

dist (vi(fi)awi(fi)) =

For quantitative features such as WordNet similarity and distributional similarity in

Section 3.3. 1, the distance between two values is the absolute value of their difference:

diStfviffilawiffillzl1’i(fi)— wi(fi)l

And for string features such as the subject/ object consistency (with explicit model-
ing of LDR), the distance between two values is their minimal string edit distance

(Levenshtein distance).

6.2.3 Evaluation of LDR Modelings in Alignment Models

We evaluate two alignment models with different modeling of long distance rela-
tionship, one with implicit modeling and one with explicit modeling. The implicit
alignment model is the same as we evaluated in Sections 4.4, 5.2.4, and 5.4.3, and the
explicit alignment model is trained using a k-nearest neighbour model (described in

Section 6.2.2) from the development data set. So we compare their performances of

108

 

verb alignment only on the test data of conversation entailment.

Figure 6.2(a) compares two alignment models based on basic representation of
conversation segments, and Figure 6.2(b) compares two alignment models based on
structural representation of conversation segments. The performances of the implicit
alignment model in these two ﬁgures are the same as +F—utterance in Figure 5.2(b)
and Test in Figure 5.8, respectively. Also similar to the discussion in Section 5.4.3,
the results between Figure 6.2(a) and Figure 6.2(b) are not meant to be compared
directly, since they involve different numbers of alignment instances.

Overall speaking, the explicit model outperforms the implicit model. This sug-
gests that the explicit modeling of long distance relationship between verbs and their
arguments works better than the implicit modeling used in the previous alignment
model. Furthermore, as we break down the results by different types of hypotheses,
we can see that the improvements are more noticeable when hypothesis types are fact,
belief, and desire, while the improvement is the least for intent hypotheses. This is
because that a large portion of verbs in intent hypotheses are aligned to pseudo ut—
terance terms in the premise, which are handled by the rule-based pseudo alignment

model rather than the verb alignment model described and enhanced in this section.

6.3 Modeling Long Distance Relationship in the
Inference Model

In Section 3.1.3 we have formulated the inference model as to predict the probability
that a clause sj in the hypothesis is-entailed from a set of clauses d1 . . .dm in the
premise, given an alignment scheme g between the terms in the hypothesis and the

terms in the premise:

P(d1d2 . ..dm l: sjld1,d2,.u,dm,3jag)

109

 

  
  

    

new %////////////////////////%m mm ////////////////////////%m

   

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

111‘

(b) Based on struct

of verb alignment with different modelings of long distance

Evaluation

110

In Section 3.3.2 we have described the feature sets that we used to train our inference
models, which are distinguished between the inference of property clauses and the in-
ference of relational clauses. The inference of relational clauses involves the modeling

of long distance’relationship.

In this section we ﬁrst give a review on the implicit modeling of long distance
relationship that is previously used in the relational inference model, and then discuss
about how the model can be enhanced by the explicit modeling of LDR. After that,
the enhanced model of relational inference is evaluated and compared to the original

inference model.

6.3.1 Implicit Modeling of Long Distance Relationship in the

Relational Inference Model

If the hypothesis clause 8 j is a relational clause, that means it takes two arguments
(hypothesis terms). We denote it as sj(2:1,2:2). To predict whether it is entailed
from the premise, we ﬁrst ﬁnd the counterparts (aligned terms) of $1 and 2:2 in the

premise:

D'1={'yly€ D,g(:v1,y)=1}
0'2 = {ny E D,9(r2.y)=1}

and then get the closest pair of terms (yi‘, yg) from these two sets, i.e., the distance
between y; and y; in the (augmented) dependency structure of premise is the smallest
among any yl E Di and y2 6 D5.

For example, in Figure 6.1, if we want to infer whether the hypothesis clause
object(2:5, 2:4) is entailed, we ﬁnd the alignments for 2:5 = suggests and $4 = watch,
which are {u4 = opinion} and {y3 = seen,y11 = see} respectively. In Figure 6.1(a),

the distance between u4 and y3 is 4, and the distance between u4 and 911 is 3, so the

111

 

closet pair of terms between these two sets is u4 and yll.

So the inference decision on sj should be determined by the long distance rela-
tionship between the premise terms y: and y5, i.e., whether (1) there is a relationship
between y; and yg; and (2) whether such relationship is the same as sj, which de—
scribes the relationship between hypothesis terms 2:1 and r2.

Using implicit modeling of long distance relationship, we predict whether sj is
inferred only by the distance between yi‘ and y; The smaller this distance is, the
more likely these two terms have a direct relationship. Though such an assumption
is reasonable, and the implicit modeling addresses to certain extent the ﬁrst question
above, however, it does not address the second question: whether the relationship

between y; and y; is the same as described by sj.

6.3.2 Explicit Modeling of Long Distance Relationship in the

Relational Inference Model

In order to identify the relationship between yf and y;, we need to capture more
semantics in the relationship between the two terms. As an enhanced model, we use
explicit modeling instead of the implicit one to model the long distance relationship
between y; and y5.

In Figure 6.1(a), for example, the explicit modeling of long distance relationship

between u4 and 911 is

U<—V<—-V<—V

Similar to Section 6.2.2, we use an instance-based model (e. g. k-nearest neighbour)
to classify the inference decision on sj(2:1, 2:2). The explicit modeling of long distance
relationship between yi‘ and y; is used as a feature in the classiﬁcation model. Such
a feature has values in string forms, and the distance between two of its values can

be calculated by their minimal string edit distance (as discussed in Section 6.2.2).

112

 

Additionally, the instance-based classiﬁcation model enables us to add an addi-
tional set of nominal features into the classiﬁer. Below is the list of additional features
used in our system (given that the hypothesis clause to be inferred is sj(2:1, 2:2) and
the closest pair of terms aligned to 2:1 and 2:2 in the premise are yi‘ and y; respec-

tively):
1. The types (noun/verb/utterance) of 2:1, 272, yi‘, and y3;
2. The type of relation between 231 and 232, for example, object in object(2:1, 2:2);
3. The order (i.e., before or after) between 2:1 and 2:2, and between yi‘ and y3;
4. The speciﬁc type of the hypothesis (fact/belief/desire/intent).

The distance between two values v,(f,-) and w;(f;) of a nominal feature is esti-
mated based on whether the two values are the same (similar to binary features in
Section 6.2.2):

1 if Mfr) = wi(fi)

0 otherwise

dist (vi(fi)awi(fi)) =

6.3.3 Evaluation of LDR Modelings in Inference Models

In this section we compare the performances of two inference models, one using im-
plicit modeling of long distance relationship and one using explicit modeling. The
explicit inference model is trained from the development data of conversation entail-
ment using features described in Section 6.3.2. So the evaluation and comparison are
conducted only on the test data set.

Figure 6.3(a) shows the prediction results (accuracies) of the two inference models
based on the basic representation of conversation segments, and Figure 6.3(b) shows
the results based on the structural representation of conversation segments. In both

ﬁgures we show inference results with different conﬁgurations of the alignment model

113

 

 

 

 

 

 

 

0.7 . I l' . l' 0.7 -— Implicit alignment
"" apllclt alignment l -+- Explicit alignment
+ p 'C't a Ignment + Annotated alignment
0.65 - 0.65 -
>‘ > /
U U
L“ 2
a O 6 r 3 O 6 r
U U
0.55 ~ A 0.55 - /
0.5 , , ‘ . , J 0.5 . . ‘ . . J
lmphcnt Explncut Imphcut Explucut
Inference model Inference model
(a) Based on basic representation (b) Based on structural representation

Figure 6.3: Evaluation of inference models with different modelings of long distance
relationship

(implicit or explicit). Additionally, in Figure 6.3(b) we also show the results based

on manual annotations of alignments.

In Figure 6.3(a) when we use the basic representation of conversation segments,
the two inference models perform almost the same. This illustrates that when con-
versation structural is missing from the representation, the explicit modeling of LDR

in the inference model offers no signiﬁcant advantage compared to implicit modeling.

But when conversation structures are incorporated in the representation of conver-
sation segments, as we show in Figure 6.3(b), the explicit inference model consistently
performs better than the implicit model. The difference is statistically signiﬁcant

when the alignment model is also explicit or annotated (McNemar’s test, p < 0.05).

The best performance of our system on test data, without using manual annota-
tions of alignments, is achieved under the conﬁguration of the structural representa-
tion of conversation segments, the explicit alignment model, and the explicit inference
model. The accuracy is 58.7%. Compared to the natural baseline which always pre-

dicts the majority class (53.1% accuracy on our testing data), our system achieves a

114

 

0.8 . + Fact 0.8 ~ -*- Fact

 

 

 

 

+ Belief + Belief
+ Desire -°- Desire
0.7 r i Intent 0.7 - *- Intent
> >
u u
E E
a 0.6 f- f a 0.6 '-
u u
0.5 r- / 0.5 _
0.4 , , ‘ . . ' 0.4 . . ‘ . . '
lmplncrt Explrcut Implucnt EXpIICIt
Inference model Inference model
(a) Based on structural representation and ex— (b) Based on structural representation and an-
plicit alignment model notated alignments

Figure 6.4: Evaluation of inference models with different LDR modelings for different
hypothesis types

signiﬁcantly better result (z-test, p < 0.05).

We further break down the evaluation results of two settings in Figure 6.3(b)
(one with the structural representation and the explicit alignment model and one
with the structural representation and the annotated alignments) by different types
of hypotheses. The results are shown in Figure 6.4. We can see that the explicit
inference model performs better than the implicit inference model in almost every
subcategory. For both settings in Figure 6.4(a) and Figure 6.4(b), the improvements

by explicit modeling of LDR are most prominent for the intent type of hypotheses.

It is interesting to see the difference in the system performances on different types
of hypotheses, for example, fact and intent. In Section 6.2.3 we discovered that the
fact hypotheses beneﬁts more from explicit LDR modeling in the alignment model
than the intent hypotheses. While we evaluate different LDR modelings in the in—
ference model in this section, the ﬁndings are the opposite. That means for feet
hypotheses, the more beneﬁt from incorporating explicit modeling of long distance

relationship appears at the alignment stage, while for intent hypotheses, the bene-

115

 

 

fit of explicitly modeling long distance relationship mostly happens at the inference

stage.

This observation shows that the effects of different. types of modeling may vary for
different types of hypotheses, which indicates that hypothesis type dependent models
may be beneﬁcial. However, since the current amount of training data is relatively
small, our initial investigation has not yielded signiﬁcant improvement. Nonetheless,

this still remains a promising direction when large training data becomes available.

6.4 Interaction of Entailment Components

In Section 6.2.3 we have evaluated different implementations of alignment models and
in Section 6.3.3 we have evaluated different implementations of inference models. We
conducted both evaluations under different settings of conversation representations: a
basic representation of conversation utterances only, and an augmented representation
incorporating conversation structures. We have noticed that there is an interaction
between these different components of our entailment system. For example, in Sec-
tion 6.3.3, we found that the effect of explicit modeling of long distance relationship
in the inference model is dependent on the incorporation of conversation structure in

the clause representation.

In this section we further study the interaction between different entailment com-
ponents, including different representations of conversation segments and different
modelings of long distance relationship in alignment and inference models. Speciﬁ-
cally, we want to study how the change of one component might inﬂuence the entail-

ment results, under various conﬁgurations of the other components.

116

 

0.7 r' -— Implicit alignment 0,7 ~ + Implicit alignment
+ Explicit alignment —+— Explicit alignment
-o- Annotated alignment + Annotated alignment
0.65 l 0.65 -

/
5%

Accuracy
:3
0
Accuracy
0
Ch

 

 

 

 

0.55 - 0.55 -
0.5 . A 0.5 _ . .
Basnc Structural Basuc Structural
Representation Representation
(a) With implicit inference model (b) With explicit inference model

Figure 6.5: Effect of different representations of conversation segments on entailment
performance

6.4.1 The Effect of Conversation Representations

In Section 5.4.3 we have evaluated the effect of structural representation on entailment
prediction, with implicit modeling of long distance relationship in both alignment
and inference models. In this section we study how the system’s performance is
affected by representations of conversation segments, with variations of different long-
distance—relationship modeling in alignment and inference models. Speciﬁcally, we
want to compare the basic and structural representations under implicit and explicit
modelings of long distance relationship in the alignment model and in the inference

model.

Figure 6.5(a) shows the comparison of entailment results while using two different
conversation representations, under the setting of implicit LDR modeling in inference
model. Figure 6.5(b) shows the same comparison under the setting of explicit LDR
modeling in inference model. In each of these ﬁgures, we conduct the comparison
under three settings of alignment models: one with implicit modeling of LDR, one

with explicit model of LDR, and one with annotated alignments. We can see that

117

 

 

 

 

 

0.75 + Fact 0.75 r + Fact
I+ Belief + Belief
+ Desire + Desire
0.65 - * Intent 0.65 - * Intent
> F J >.
1% 8
5 0.55 - :23; 0.55 r
U U
< <
0.45 - 0.45 -
0.35 ‘ ’ 0.35 ' '
Basic Structural Basic Structural
Representation Representation

(a) With explicit alignment model and explicit (b) With annotated alignments and explicit
inference model inference model

Figure 6.6: Effect of different conversation representations for different hypothesis
types

for all six different conﬁgurations of alignment and inference models, the structural
representation consistently yields better entailment performance than the basic rep-

resentation.

In addition, for two of the settings in Figure 6.5(b), namely, explicit/explicit
and annotated / explicit for alignment / inference models, the improvement brought by
structural representation compared to basic representation is statistically signiﬁcant
(McNemar’s test, p < 0.01). Considering the fact that these two conﬁgurations
demonstrate bigger advantage of structural representation than other conﬁgurations,
we may conclude that the structural representation has the most prominent advantage
over the basic representation when it is used together with downstream components
(alignment and inference models) that take into consideration of shallow semantics

(i.e., explicit modeling of long distance relationship).

We further break down the comparison results under these two conﬁgurations
(explicit / explicit and annotated/ explicit) by different types of hypotheses, as shown

in Figure 6.6. In both Figure 6.6(a) and Figure 6.6(b), the performance difference

118

 

between the basic representation and the structural representation is not signiﬁcant
for hypotheses of fact, belief, and desire. However, for hypotheses of intent, the
structural representation shows signiﬁcant advantage over the basic representation
(McNemar’s test, p < 0.001).

This is consistent with what we found in Section 5.4.3: no matter what entail-
ment models are used (implicit, explicit, or annotated), the improvement brought
by structural representation mainly comes from intent type of hypotheses. Such an
observation is not surprising, since most hypotheses in other subcategories (especially

fact ones) can be inferred directly from conversation utterances.

6.4.2 The Effect of Alignment Models

Different from the study in Section 6.2.3, where alignment models with different
modelings of long distance relationship are evaluated by the alignment results they
produce, in this section we study how the system’s entailment performance is affected
by using different alignment models. Speciﬁcally. we want to compare the implicit
and explicit alignment models under various settings of conversation representations
and inference models.

Figure 6.7(a) compares the entailment results while using different alignment mod-
els, based on the basic representation of conversation segments. Figure 6.7(b) shows
the same comparison based on the structural representation of conversation segments.
Different from the comparison results in Section 6.2.3, where the explicit alignment
model improved the alignment performance over the implicit model on both the basic
and the structural representations of conversation segments, here there is no sig-
nificant difference between the entailment performances of the two alignment mod-
els based on the basic representation of conversation segments (as shown in F ig-
ure 6.7(a)). This provides an evidence for the phenomenon previously found by other

researchers [60], that a better alignment performance does not necessarily transfer to

119

 

 

0.6 - + Implicit inference 0.6 r + Implicit inference

 

 

 

 

-+— Explicit inference + Explicit inference

0.58 r 0.58 -
g? 0.56 ~ ,2? 0.56 - /-
S S
u 4\ U
if 0.54 - x“! 3" 0.54 ~

0.52 ~ 0.52 -

0 5 . ' . . ‘ 0.5 . . ‘ . . ‘
ImpIicut EprICIt Imphcrt EprICIt
Alignment model Alignment model
(a) Based on basic representation (b) Based on structural representation

Figure 6.7: Effect of different alignment models on entailment performance

better inference performance in entailment tasks.

However, when conversation structures are incorporated in the representation of
conversation segments, the explicit alignment model has made certain improvement
over the implicit alignment model (as in Figure 6.7(b)). Though none of these im-
provements are statistically signiﬁcant, we may hypothetically extend our observation
from Section 6.3.3 to the situation here. That is, in alignment models, the advantage
of the explicit modeling of long distance relationship over the implicit modeling is
also dependent on the incorporation of conversation structures in the conversation

representation.

We further break down the comparison results for the two conﬁgurations in Fig-
ure 6.7(b) by different types of hypotheses, and the results are shown in Figure 6.8.
We can see that in both Figure 6.8(a) and Figure 6.8(b), the explicit alignment model
improves the entailment performance for the hypothesis type of fact compared to the
implicit alignment model (in Figure 6.8(a) the improvement is statistically signiﬁcant
by McNemar’s test, p < 0.05). This is consistent with what we found in Section 6.2.3:

the advantage of explicit modeling of long distance relationship in alignment model

120

 

 

 

 

 

0.8 - + Fact 0.8 - -*- Fact
+ Belief + Belief
+ Desire + Desire
0.7 f 'l- Intent 0.7 » .\\*-lnt8nt
>. >~.
u u
L“ E
a 0.6 - a 0.6 -
u u
< <
0.5 - 0.5 r
0.4 _ . ‘ , , 4 0.4 . . ‘ . . '
lmpIrcrt EXleCIt ImpIICIt EXleCIt
Alignment model Alignment model
(a) With structural representation and im- (b) With structural representation and ex-
plicit inference model plicit inference model

Figure 6.8: Effect of different. alignment models for different hypothesis types

is most noticeable for feet hypotheses.

On the other hand, Figure 6.8 also demonstrates the reason why the explicit
alignment model did not bring signiﬁcant improvement to entailment performance in
Figure 6.7 — it decreases the entailment performance for hypotheses other than the fact
type. So why does the explicit alignment model improve the alignment performance
for all hypothesis types in Section 6.2.3 but decrease the entailment performance for

certain subsets here? The cause can be illustrated by the follow example:

Premise:

B: Well, don’t you think a lot of that is diet too?
A: and, a lot of that is diet. That’s true.

Hypothesis:

A agrees that a lot of health has to do with diet.

This is a true entailment (assuming that in the premise is resolved to health).

However, our current entailment system was not able to identify it correctly, because

121

 

the verb phrase has to do in the hypothesis has no alignment in the premise. Actually,
the recognition that has to do is just a way of representing an arbitrary relationship
requires the knowledge of paraphrasing. Nonetheless, our implicit alignment mistak-
enly aligns the hypothesis term do with the premise term is (both occurrences), which
makes the inference model predict a positive entailment (which is correct). Since the
explicit alignment model corrects this alignment mistake, the entailment cannot be
recognized. This is another evidence to show that a better alignment performance

does not necessarily mean a better entailment performance.

122

 

 

Chapter 7

Discussions

This thesis provides a ﬁrst step in the research on conversation entailment. The
best conﬁguration of our system achieves 37.5% precision and 52.6% recall on the
verb alignment task (which constitute 43.8% f—measure as in Figure 6.2(b)), and
58.7% accuracy on the entailment prediction task (as evaluated in Section 6.3.3).
Although this is a signiﬁcant improvement compared to the baseline system for textual
entailment, a better performance is desirable.

In this chapter we identify several issues faced by the current system and discuss

potential improvement.

7. 1 Cross—validation

The size of data used in our investigation is very small. Currently we have 291
entailment examples for training and 584 entailment examples for testing. This affects
both learning of a reliable model and evaluation of the model performance. To better
use our limited amount of data, we conducted cross-validation evaluations, utilizing

part of the test data set for training.

The methodology for our cross-validation experiments is a modiﬁed version of

123

 

 

0'6 f D Dev-trained
Cross-validation
0.5 .
U
a
8 0.4 L
E
u'.
0.3 L
0.2

 

 

 

 

 

 

 

 

 

 

 

 

 

Overall Fact Belief Desire Intent

Figure 7.1: Comparing the cross-validation model and the model learned from devel-
opment data for verb alignment results

leave-one—out evaluation. The entailment examples to be evaluated are the 584 ex-
amples in the test set, such that the evaluation results can be compared to our
previous experiments. However, when evaluating each example in the test set, we use
entailment models trained from the 291 development examples together with the 583
examples in the rest of the test set. This gives us 874 examples for each round of

model training.

We conducted the cross-validation experiment for both the alignment model and
the inference model, and we evaluate the results on both the verb alignment task and

the entailment prediction task.

Figure 7.1 shows a comparison between the evaluation results on verb alignment
produced by the alignment model with cross-validation, and the evaluation results
produced by the alignment model trained from the development data only (Dev-
tr‘ained, which comes from Figure 6.2(b)). Other conﬁgurations are the same for the
two evaluations (structural conversation representation and explicit modeling of long
distance relationship). Without any surprise, for the entire test set and for most of

the hypothesis types, the cross-validation model achieves better performance than

124

 

 

 

the model learned from development data only.

To study the cross-validation on the inference model, we conducted multiple ex—
periments based on different alignment results produced by various alignment models
(including the above results produced by cross-validation), and the best accuracy on
entailment prediction is 58.9%. Compared to the previous result by inference model
learned only from development data, 58.7%, there is no signiﬁcant difference. This il-
lustrates that (1) better alignment results do not necessarily lead to better entailment
results, which is again consistent with the ﬁndings by MacCartney et al. [60] and by
our previous experiments in Section 6.4.2; and (2) the performance of the inference
model cannot be improved by simply feeding more training data. Better semantic

representation in the models become critical.

7 .2 Semantics

In Section 6.1.2 we mentioned there are different levels of modeling long distance
relationship between language constituents using the pattern of path connecting them.
Our current modeling of long distance relationship is on a relatively abstract level.
That is, we model the long distance relationship between two language constituents
by the types of nodes (N / V / U ) and directions of edges (——>/<—) connecting them in a
dependency structure. While such modeling captures some shallow semantics of the
relationship between language constituents, it is very general and thus insufficient to

differentiate speciﬁc semantic relations.

Consider two natural language statements “I have to go see .. .” and “I think it’s

)3

good to see . Figure 7.2(a) and Figure 7.2(b) show the dependency structures

of these two statements respectively. In both ﬁgures, the long distance relationship

125

 

(a) I have to go see (b) I think it’s good to see

Figure 7.2: The dependency structures for examples of shallow semantic modeling
between see and I is modeled as
V —> V —-> V +— N

according to the current explicit modeling of LDR.

Therefore, an alignment model learned from the ﬁrst example may recognize a
verb-subject relationship represented by V -—> V —> V e— N, since I is the logical
subject of see in that example. However, when the alignment model is applied to
the second example, such recognition becomes incorrect because I is not the logical

subject of see in “I think it’s good to see . . .”.

Similar problem also happens in inference models. For example, in statements
“A comes from Michigan” and “A went to Michigan”, the relations between A and
Michigan are both modeled as N ——> V <— N. If both statements are associated with
the same hypothesis containing a relational clause f rom(A, Michigan), an inference
model learned from the ﬁrst instance would recognize that N —> V (— N entails
the relation fr0m(-, ) (because A comes from Michigan entails from(A, Michigan)).
But when the model is applied to the second instance, it will make a wrong prediction

because in that case N —> V «— N does not entail from(~, ).

These problems can be resolved by adding more semantic information into the

126

explicit modeling of long distance relationship (using more detailed patterns to rep-

resent them). For example, in Figure 7.2(a), the relationship between see and I can

subject
(__.__

be modeled as V 31> V .2, V I, which is different from the relationship be-

b' t b' t
tween see and I in Figure 7.2(b), V :3» V ﬂﬂ V gig I. And the relationship
.1)" t
between A and Michigan in “A comes from Michigan” is N m V ﬂ N,

. . . . . . ' b' ,t
while 1n “A went to Michigan” it is N L0» V +8113—

N. However, such ﬁner grained
models are not likely to be generalized well to new examples given our limited amount
of data. Again, more training data will play an important role here.

Besides the semantic modeling of long distance relationship, our current system
is also insufficient in modeling some types of lexical semantics, e.g., word sense and
antonyms. Consider the verbs see and hear, in most cases they should not be aligned
together, since they represent different types of actions. However, one of the senses for
see in WordNet [64] is “get to know or become aware of”, and so is hear. Therefore,
without the capability of word sense disambiguation, our system frequently aligns

these two terms together.

The importance of antonym modeling can be illustrated by the following example:

Premise:

A: Do yo-, are you on a reg-, regular exercise program right now?
B: Yes, and I hate it.

Hypothesis:

B doesn’t like her exercise program.

A correct prediction of this true entailment would be to align the term like in the
hypothesis to the term hate in the premise, and then recognize that the antonym of
hate (like) plus a negative modiﬁer doesn’t has the same meaning as the original term
hate. However, the polarity check module in our current system only checks for the

number of negative modiﬁers but not the polarities of the verbs themselves. As the

127

 

verb like has one negative modiﬁer and the verb hate has none, the polarity check
module identiﬁes hate and doesn’t like as different polarities, and thus mistakenly
predict the entailment to be false.

Lexical semantics have been extensively studied by other researchers [63, 68].
Tools and knowledge bases in this area should be utilized in the entailment system

to improve their performance.

7 .3 Pragmatics

The most important pragmatic features in conversation entailment are ellipsis, pro-

noun usage, and conversation implicature.

7.3. 1 Ellipsis

In Section 5.1.3 we have summarized some unique features frequently seen in conversa-
tions. One of them is ellipsis. Ellipsis in conversations can be particularly challenging

to both the alignment and the inference models. For example:

Premise:

A: Did you go to college?
B: I’m going right now.

Hypothesis:
B is going to college.

In the conversation utterance of speaker B, the object of the verb going is omitted,
which is actually the term college in speaker A’s utterance. Such a relationship
between the verb going and its omitted object college needs to be recognized in the

alignment model in order to align the hypothesis term going to the term going in

128

 

speaker B ”s utterance, and needs to be recognized in the inference model in order to

infer the relational clause to(going, college).

7.3.2 Pronoun Usage

In Table 4.1 we have seen a conversation entailment example with special pronoun

usage:

Premise:

A: Sometimes unexpected meetings or a client would come in and would
want to see you.

B: Right.

Hypothesis:

Sometimes a client wants to see B.

In this example the pronoun you in speaker A’s utterance does not refer to speaker
B, but has a meaning of speaker A him/ herself. In other cases, a pronoun can refer

to a general concept, for example:

Premise:

A: Um, matter of fact in the United States we used to have extended
families.

B: Uh-huh.

Hypothesis:

A used to have extended families.

where the pronoun we in speaker A’s utterance is a reference to the general concept
of people in United States, while not necessarily involve speaker A him/ herself.
In both of these cases, the generic or rhetorical usages of pronouns pose special

challenges to the correct entailment prediction for the hypotheses.

129

 

LS

 

7 .3.3 Conversation Implicature

In Section 4.3.3 we have pointed out that conversation entailment is a quite chal-
lenging task, for the human annotators could not reach agreements on the entailment
decisions for a considerable number of examples. We attribute some of the disagree—
ments to different understandings of conversation implicature among the annotators.

Here we have another example, which did not cause that much trouble for the hu-
man annotators, but is still challenging to our entailment system due to the difﬁculty

in the recognition of conversation implicature.

Premise:
A: While learning aerobics, you can just trust someone else.

Hypothesis:

A trusts her aerobics instructor.

In this example, in order to recognize that the hypothesis is a true entailment from the
conversation segment, the system has to recognize in the conversation that someone

else implies a person who teaches aerobics.

7 .4 Knowledge

As discussed in Section 2.1.4, a key factor that affects the performance of an en-
tailment system is the amount of knowledge available to the system. In our entail-
ment system, all of its components (clause representation, alignment model, inference
model) contains certain kinds of knowledge. To some extent, the limitations of the
current system in semantics and pragmatics (as discussed in Section 7.2 and Sec-
tion 7.3) are essentially due to lack of knowledge.

In this section we discuss about two more kinds of knowledge that are missing

from our system.

130

 

7 .4. 1 Paraphrase

In Section 6.4.2 we have seen an entailment example that requires the knowledge of

paraphrasing. Here is another one:

Premise:

A: My TV viewing started sort of mid-sixties when I was really little.
B: Isee.

Hypothesis:

At mid-sixties, A was a small child.

Our entailment system fails to recognize that this is a true entailment, because it
cannot ﬁnd the alignment for the hypothesis term child in the premise. While a
knowledgeable entailment system would recognize that was a small child is in fact
another way of saying was really little.

The knowledge of paraphrases has been accumulated from large linguistic co-
pra [55, 78, 86], and has been applied to the text entailment task by other re-
searchers [27]. However, most of these efforts limit their acquisition and application of
paraphrases to binary representations, i.e., there are two variables in the paraphrases
(e.g., X prevents Y —» X provide protection against Y). However, as seen in the ex-
ample above, many times unary paraphrases are also useful in entailment recognition
(e.g., X was really little —+ X was a small child). Recently there is work on the
acquisition of such paraphrases [85]. The applications of this type of paraphrases on

the entailment task will be interesting to investigate.

7 .4.2 World Knowledge

The importance of world knowledge in the conversation entailment task can be demon-

strated by the following example:

131

 

 

Premise:
A: I use my credit card a great deal, um, for groceries.

Hypothesis:
Hypothesis: A does grocery shopping.

The necessary (and sufﬁcient) knowledge used to recognize this entailment is that
a credit card is used for shopping. Unfortunately, such knowledge is missing in our
system, such that the term shopping in the hypothesis becomes a new entity, for
which it cannot ﬁnd an alignment in the premise. As a result, the system is unable

to predict this is a true entailment.

7.5 Efﬁciency

Generally speaking, our entailment system is designed for offline processing. The
conversation segments and the hypotheses are given in batches. Thus efﬁciency has
not been a main focus in our development. Here we provide some general discussion on
the efﬁciency issue should our entailment system become a part of online application.

There are mainly three components in our system, the decomposition model, the
alignment model, and the inference model.

The decomposition model works upon decomposition rules in Appendix A. Its
efﬁciency is dependent on its input — syntactic parse trees and the number of decom-
position rules. Overall speaking, the complexity of the decomposition model is pro-
portional to the size (number of internal nodes) of the syntactic trees. However, when
it comes down to a particular substructure in the syntactic tree (e.g. S —> NP VP),
its complexity varies. It is simpler if there is a matching rule in the rule set for the
syntactic substructure (e.g., S —> NP VP), in this case the processing time is constant
for that substructure. However, when there is not a match for the substructure (e.g.

S —> NP VP PP), as described in Appendix A, the system will try to search for a

132

 

rule to reduce this substructure (e.g., use VP ——> VP PP to reduce S —> NP VP PP
to S —> NP VP). This is a recursive process. In the case that a single span in the
syntactic tree is very large (i.e., one parent has many children on the same layer), the
process can go very slow. Fortunately, this rarely happens on the conversation data
(since utterance lengths are short).

The complexities of the alignment and inference models can be further divided
into three parts: feature calculation, model building, and model application.

The calculations of binary features (string equality, stemmed equality, acronym
equality, named entity equality, and verb be identiﬁcation) are trivial. The calculation
of WordNet similarity can be efficient, as long as the relevant probabilities of each
WordNet class are pre—calculated and stored. The calculation of distributional sim-
ilarity needs to get the document count for particular terms in a large text corpus.
We use the Lemur retrieval engine1 to handle efﬁcient query search.

The calculations of LDR features (subject consistency, object consistency, and the
feature for the relational inference model) are dependent on the sizes of alignments,
i.e., for each hypothesis term how many premise terms it is aligned to. This size
statistics depend on the alignment model used and (for implicit alignment model)
the output threshold for the logistic regression model. For the implicit alignment
model with threshold 0.7, on average a hypothesis term is aligned to 1.57 premise
terms; and for explicit model, a hypothesis term is aligned to 1.55 premise terms on
average. In either case, the alignment size results in relatively low complexity.

After the premise terms are selected, the system needs to calculate the distance
(for implicit LDR models) or the pattern of path (for explicit LDR models) between
the selected terms. It turns out they are the same problem, with an efficient solution
of breadth-ﬁrst-search on a dependency graph. The complexity of such searching is

bounded by the number of vertices in the graph, or the number of terms in the clause

 

1http://www.lemurproject.org/

133

 

representation of a premise. Statistics on our test data show that the average number
of terms in a conversation segment is 31.3.

The complexities of model building and model application depend on the statis-
tical model used. For logistic regression model. the model application can be done in
constant time, and the main complexity lies in model building. This is an iterative
procedure (a Quasi-Newton search method). We used the toolkit provided by \Veka2,
which is an implementation of the algorithm given by 1e C essie and van Houwelingen
[53], known to converge quickly.

For k-nearest neighbour model, the “model building” is just storing training in-
stances. So the complexity mainly lies in model application, which in our current.
implementation is proportional to the number of training instances for each test in-
stance. Currently our (kN\) models are trained from the developn‘ient set of 291
entailment examples. On average each entailment example has 32.4 premise terms
and 5.38 hypothesis terms, so the average number of alignment instances (potential
pairings of terms between each hypothesis and premise) is roughly around 174 for
each entailment example, and about. 5 x 104 for the whole development set. More-
over, the total number of relational clauses in the hypotheses of the development.
set is 727. So for both the alignment model and the inference model, our current
implementations of kNN models have acceptable efﬁciency. However, if a larger set
of training data become available, speedy search techniques for k-nearest neighbours

should be considered (e.g., spatial indices).

 

2http://www.cs.waikato.ac.nz/ml/weka/

134

Chapter 8

Conclusion and Future Work

Currently there is no data or published studies that address the problem of conversa-
tion entailment. This thesis is one of the ﬁrst studies that investigate this problem.

In this chapter, we summarize our contributions and future work.

8. 1 Contributions

This thesis has made the following contributions.

1. Systems and computational models addressing the conversation entailment prob-

lem.

We developed a probabilistic framework for the entailment problem. The frame-
work is based on the representation of dependency structures of language. The
overall entailment prediction depends on the entailment relations between sub-
structures. F or conversation entailment, we incorporated conversation modeling
(e. g., dialogue acts, conversation participants, and conversation discourse) into
the dependency structures. The modeling of conversation structures has been
shown effective and has contributed to a signiﬁcant improvement (an absolute

difference of 4.8%) in system performance. Especially, it is critical for predicting

135

 

the entailment of intent type of hypotheses (with an absolute improvement of

25%).

. A systematic investigation on the roles and interactions of different. models and

representations in the overall entailment prediction.

We experimented with several alignment and inference models in the conversa—
tion entailment system, investigating different approaches of shallow semantic
modeling. Our studies have shown that the explicit modeling of long distance
relationship based on the path between two language constituents is useful in
both the alignment and the inference models. It improved the entailment per—
formance by 3.9% on the test data compared to the implicit modeling of long
distance relationship (based on distance). Speciﬁcally, its effect in the alignment
model is more prominent for the fact, belief, and desire types of hypotheses,
while its effect in the inference model is the most prominent for intent hypothe-
ses. In addition, the effect of explicit modeling of long distance relationship
is largely dependent on the presence of structural information in conversation
representations. Similarly, the modeling of conversation structure is more effec-
tive when the computational models incorporate shallow semantic information

(using explicit modeling of long distance relationship).

. A data corpus of conversation entailment examples to facilitate the initial in-

vestigation on conversation entailment.

We collected 1096 conversation entailment examples. Each example consists of
a segment of conversation discourse and a hypothesis statement. The data set
used in this thesis includes 875 examples with at least 75% agreement among
ﬁve annotators. This data set is made available at http://links.cse.msu.
edu : 8000/ lair/ pro 3' ects/ conversat ionentai 1ment_data . html. Although a

larger data set is preferable, the small collection of data resulted from this thesis

136

 

supports initial investigation and evaluation on conversation entailment.

8.2 Future Work

Conversation entailment is a challenging problem. This thesis only presents an initial
investigation. More systematic and in-depth studies are required to further under-

stand the nature of the problem and develop better technologies.

Data. The availability of relevant data is always a major issue in language re—
lated research. Although our current data enabled us to start an initial investigation
on conversation entailment, its small size poses signiﬁcant limitations on technology
development and evaluation. A more systematical approach to collect and create
a large set of data is crucial. One possible future direction is to develop innova-
tive community-based approach (e.g., through web) for data collection. Annotations

based on Mechanical Turk can also be pursued.

Semantics and Pragmatics. As discussed in Sections 7.2, 7.3, and 7.4, our seman—
tic modeling in the entailment system is pretty shallow. Our clause representation
for both the conversation segment and the hypothesis statement is mainly syntactic-
driven (based on dependency structure). As more techniques in semantic processing
(e.g., semantic role) become available, representation should capture deeper seman-
tics. The models should also address pragmatics (e.g., conversation implicature) and

incorporate more world knowledge.

Applications. Finally, as the technology in conversation entailment is developed,
its applications in N LP problems should be explored. Example applications include

information extraction, question answering, summarization from conversation scripts,

137

and modeling of conversation participants. These applications may provide new in-

sights on the nature of the conversation entailment problem and its potential solutions.

138

Appendices

1239

Appendix A

Syntactic Decomposition Rules

This appendix lists the rules used for syntactic decomposition (Section 3.1.1).

Each decomposition rule is built upon a grammar rule (e.g. S —> NP VP) and
has two parts. The ﬁrst part selects the head for each syntactic constituent. For
example, for S —+ NP VP we deﬁne the head of VP to be the head of S. Since
VP is the second child of S, we represent its head as hg. The head of VP is then
obtained recursively from rules spanning VP (e. g., VP —> VP NP). The head of a leaf
constituent is a term representing this constituent (e.g., for NNP —> John we have
11:1 2 John). It is possible that a head is the concatenation of multiple children (e.g.,
for NP —+ NNP NNP the head of NP is derived by hlhg), in which case we use a
single term to represent the concatenated entity. It is also possible that a constituent
has two heads (e.g. for NP ——> NP CONJP NP we have h1, h3). The difference

between these two notions should be noted.

The second part generates property or relational clauses from the syntactic sub-
structure. For example, for VP —> ADVP VP we generate a property clause h1(h2)
which means the head of second child (VP) has a property described by the head of
the ﬁrst child (ADVP). For S —> NP VP we generate a relational clause subj(h2,h1)

which means the subject of the head of second child (VP) is the head of ﬁrst child

140

(NP). It is possible for a single grammar rule to generate multiple clauses.

Grammar rules that are not included in this list can usually be reduced into two or
more basic rules. For example, for S —+ NP VP PP, it can be seen as a combination
of S —+ NP VP and VP —> VP PP. So we ﬁrst apply the rule VP —-> VP PP to the
last two children of S —+ NP VP PP, and reduce it to S -—> NP VP, which have a
second applicable rule in our rule set. In this way we can deal with inﬁnite number

of syntactic structures. For example, for the rules

NP —-> NNP NNP
NP —+ NNP NNP NNP

NP -—> NNP NNP NNP NNP

There could be arbitrary number of NNP nodes in the child list of NP. So we deﬁne

the following rules

nnp —> NNP
nnp —> NNP nnp (A.1)

NP —> nnp

Such that the reduction of rule NP —> NNP NNP NNP NNP can be

NP —> NNP NNP NNP nnp
NP ——-> NNP NNP nnp
NP —> NNP nnp

NP —» nnp

141

 

 

 

As we can see, the grammar NP —-+ NNP . .. NNP with arbitrary number of NNP’s
can always be reduced into a combination of the three basic rules deﬁned in (A1)
(note here the syntactic constituents are case-sensitive, with lower case constituents
serving as intermediate layer, not occurring in an original grammar rule).

The full set of decomposition rules that we use is listed in Table A1

Table A1: Rules for syntactic decomposition

 

 

Grammar rule Head(s) Clause(s)
nnp —+ NNP hl
nnp —-> NNPS hl
nnp ——> FW hl
nnp —> NNP nnp hlhg
nnp —> FW nnp hlhg
np —> nnp hl
HP 4 QP h1
np —> NN hl
up —-> NNS hl
up -—> VBG hl
NP ——> np hl
NP —> ADJP h1
NP —> DT ADJP h1h2
NP ——+ NP POS [11
NP ——+ PRP hl
NP ——> PRPfB hl
NP —+ DT hl
NP —> EX hl
NP —» DT NP h2

 

 

 

 

 

Table A. 1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)

NP —+ NP NP h2 modifier(h2,h1)
NP -—> attr NP h2 h1(h2)

NP -—> NP attr hl h2(hl)

NP —> NP PP hl preposition(h1, h2)
NP —+ NP SBAR hl modifier(h1, h2)
NP —> NP , NP hl is(h1,h3)

NP —> NP PRN hl is(h1, h2)

NP —> “ NP h2

NP —> NP ” hl

NP —+ NP , hl

NP —> NP . hl

NP ——> -LRB- NP -RRB- h2

npcc ——+ NP CONJP NP h1, h3

npcc —-+ NP CC PRN NP h1, h4

npcc —-> NP , npcc h1, h3

NP —> npcc hl

NP —> VBG NP hl, hg modifier(h2,h1), object(h1, hg)
NP —+ NP S hl subject(h2,h1)
NP —+ NP : NP hl is(h1,h3)

NP —> ADVP NP hg h1(h2)

NP —> ADVP npcc hg h1(h2)

NP —+ NP ADVP hl h2(h1)

NP —> ADVP hl

NP ——> NP : NP : hl is(h1,h3)

 

 

 

 

 

Table A. 1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)
NP ——+ CC NP CC NP h2, h4
NP -—> PDT NIP h2 h1(l12)
NP -—* NX hl
DT -> ADVP DT h1h2
attr -—> ADJP hl
attr —-> VP hl
ADJP —> JJ hl
ADJP —+ JJR hl
ADJP —> JJS hl
ADJP —+ ADVP ADJP hlhg
ADJP —-> “ ADJP ” h2
ADJP —> -LRB— ADJP -RRB- h2
ADJP —> ADJP , hl
ADJP —-) ADJP CC ADJP h1,h3
ADJP —+ ADJP PRN hl, hg
ADJP —> ADJP PP h1h2
ADJP —> ADJP S hlhg
ADJP —+ NP ADJP h1h2
ADJP —+ ADJP SBAR hlhg
ADJP —> RB hl
ADJP —> UCP hl
ADVP —> RP hl
ADVP —+ RB hl
ADVP —> RBR hl

 

 

 

 

 

Table A. 1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)
ADVP -—> RBS hl
ADVP —> ADVP ADVP h1, h2
ADVP —+ NP RB h1h2
ADVP —+ ADVP , hl
ADVP —> IN hl
ADVP —-> ADVP CONJP ADVP h1, h3
ADVP ——+ ADVP PP h1h2
QP —> CD hl

QP ——> CD QP hlhg
QP —+ QP TO QP h1h2h3
QP —> $ QP h1h2
QP —-> JJ IN QP h1h2h3
QP —+ IN JJS QP h1h2h3
QP —-+ RB QP h1h2
QP -—> OF CONJP QP h1, h3
CONJP —> CC hl
CONJP —> CC ADVP h1h2
CONJP —i CC , ADVP h1h3
PP —> IN NP hlhg
PP —-> TO NP h1h2
PP —+ IN S h1h2
PP ——> IN SBAR h1h2
PP —» vb NP h1h2
PP —-> vb PP h1h2

 

 

 

145

 

 

Table A. 1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)
PP -—> vb SBAR ’1th
PP —> : PP : h2
PP -—> PP , hl
PP —+ ADVP PP hlhg
PP —+ ADJP PP hlhg
PP —> IN PP h1h2
PP —> IN ADJP h1h2
PP ——> IN ADVP h1h2
prn ——+ S hl
prn —+ SBARQ hl
prn —’ Pm , hl
pm --> , prn h2
prn ——> . prn h2
PRN ——+ prn hl
PRN —> -LRB- NP —RRB- h2
PRN —+ -LRB- CC NP -RRB- h2h3
vb ——+ VB hl
vb ——> VBD hl
vb ——+ VBN hl
vb -> VBZ hl
vb —> VBP hl
vb —> VBG hl
vb ——> BES hl
vb ——r HVS h1

 

 

 

 

Table A1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)

VP —> vb hl

VP —> vb VP hlhg

VP —> MD VP hlhg

VP —* VP NP hl object(h1, hg)

VP —+ VP : NP hl object(h1,h3)

VP —-* TO VP h2

VP -+ ADVP VP h2 hl(hg)

VP ——> VP ADVP hl 122(111)

VP -—* VP PRT hl hg(h1)

VP ——> VP PP hl preposition(h1, hg)
VP —> PP VP h2 preposition(h2, hl)
VP —-> VP S hl adverbial(h1, hg)
VP —> VP SBAR hl object(h1, hg), adverbial(h1, hg)
VP —+ vb ADJP hlhg

VP —+ VP , hl

VP —> , VP h2

vpcc ——> VP CONJP VP h1, h3

vpcc ——i VP , vpcc h1, h3

vpcc ——> VP CONJP vpcc h1, h3

VP —> vpcc hl

S -—* S , hl

S —-> S . hl

S —-> “ S hg

S —> S ” hl

 

 

 

 

Table A. 1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)

S —> VP hl

S ——> NP VP h2 subject(h2,h1)
S -> PP S h2 preposition(hg,h1)
S —i ADVP S 112 h1(h2)

S —+ SBAR , S h3 adverbial(h3,h1)
S ——+ S CC S h1, h3

S —> CC S h2

S ——> CC , S h3

S -+ PRN S h2

S -—+ NP ADJP h1h2 h2(h1)

S —-> S S h1, h2

S —+ NP hl

S —> CONJP S Ill/12

SBAR ——> S hl

SBAR —> CC S 12.2

SBAR —> WHNP S 17.2

SBAR —+ WHADVP S h2

SBAR —> WHPP S h2

SBAR —-> IN S h1h2

SBAR —+ RB IN S h1h2h3

SBARQ —> WHNP SQ . hg subject(h2,h1)
SBARQ —+ WHADVP SQ . hg subject(h2,h1)
WHNP —> WP hl

WHNP —) WDT hl

 

 

 

 

Table A.1: (continued)

 

 

 

Grammar rule Head(s) Clause(s)
WHNP —» WHNP PP h1h2

WHADVP—> W RB hl

SQ —+ S h1

PRT —> RP hl

PRT ——> RB hl

 

 

 

149

 

 

Appendix B

List of Dialogue Acts

Table B.1 lists the 69 dialogue act. labels used by the annotation system of Switchboard

dialogue corpus [38].

Table B.1: The dialogue act labels used by Switchboard

annotation system

 

 

 

 

q question

5 statement

b backchannel / backwards-lookin g

f forward-looking

a agreements

'/. indeterminate, interrupted, or contains just a floor holder

(“u unrelated response (ﬁrst utterance is not response to previous q)

* comment (followed by *[fcomment. . .]] after transcription to ex-
plain)

+ continued from previous by same speaker

©,o@,+@ incorrect transcription (can add comment to specify problem fur-
ther)

 

150

 

Table B. 1: (continued)

 

 

“h

aap
ad

aa

ba
bc

bd

 

collaborative completion

about-communication

declarative question (question asked like a structural statement)
[on statements] elaborated reply to yes-no-question

tag question (question asked like a structural statement with a ques-
tion tag at end)

hold (often but not always after a question) (Let me think) (question
in response to a question)

mimic other

quotation

repeat self

about-task

accept-part

action-directive (Go ahead. We could go back to television shows)
accept (0k. I agree)

maybe

reject (no)

reject-part

default agreement or continuer (uh-huh, right, yeah)

repeat-phrase

assessment/ appreciation (I can imagine)

correct-misspeaking
downplaying-response-to-sympathy/compliments (That’s all right.

That happens)

 

151

 

 

Table B. 1: (continued)

 

 

bf

bh
bk
br
br‘m
br“c
by
cc
co

fa

fc
fe
fo
fP
ft
fw
fx

na

nd

ng

 

reformulate/summarize; paraphrase/summary of other’s utterance
(as opposed to a mimic)

rhetorical question continuer (Oh really?)

acknowledge-answer (Oh, okay)

signal—non—understanding (request for repeat)
signal-non—understanding via mimic

non-understanding due to problems with phone line

sympathetic comment (I’m sorry to hear about that)

commit

offer

apology (Apologies) (this is not the I ’m sorry of sympathy which is
by)

conventional-closing

exclamation (Ouch)

other-forward-function

conventional-opening

thanks (Thank you)

welcome (You’re welcome)

explicit-performative (you ’re ﬁled)

a descriptive/narrative statement which acts as an afﬁrmative an-
swer to a question

answer dispreferred (Well...)

a descriptive/ narrative statement which acts as a negative answer
to a question

no or variations (only)

 

152

 

 

 

Table B. 1: (continued)

 

no

my

00
qh
qo
qr
qrr
qw
qy
sd

SV

t1

t3

 

 

a response to a question that is neither affirmative nor negative
(often I don’t know)

yes or variations (only)

other

open-option (We could have lamb or chicken)

rhetorical question

open ended question

alternative (or) question

an or-question clause tacked onto a yes—no-question

wh-question

yes-no-question

descriptive and/ or narrative (listener has no basis to dispute)
viewpoint, from personal opinions to proposed general facts (listener
could have basis to dispute)

self-talk

third-party-talk

nonspeech

 

The tags in Table B.l were used in combination in annotating the Switchboard
conversations. Thus a total number of 226 combined labels were created. After that,
they removed the tag combinations that occurred infrequently, removed the secondary
carat-dimensions (“2, “g, “in, “r, e, q, “d, but with some exceptions [38]), and
grouped together some tags that had very little training data. This resulted in 42

classes of dialogue acts. The mapping between the dialogue act classes and the original

A A

153

 

tags are in Table 8.2. These 42 acts were later summarized as a comprehensive list
by Stolcke et al. [84]. In this thesis we also use the same set as the tagging system of

dialogue acts.

Table B2: The dialogue acts used in this thesis

 

 

Dialogue act Tag Example
Statement-non-opinion sd Me, I’m in the legal department.
Acknowledge (Backchannel) b Uh-huh.
Statement-opinion sv I think it’s great.
Agree/ Accept aa That’s exactly it.
Abandoned or Turn-Exit 7. - So, -
Appreciation ba 1 can imagine.
Yes—No-Question qy Do you have to have any special train-
ing?
Non-verbal x [Laughter], [Throat-clearing]
Yes answers ny Yes.
Conventional—closing f c Well, it’s been nice talking to you.
Wh-Question qw Well, how old are you?
No answers nn N 0.
Response Acknowledgement bk Oh, okay.
Hedge h I don’t know if I’m making any sense
or not.
Declarative Yes-No-Question qy“d So you can afford to get a house?
Other 0 130 be Well give me a break, you know.
by in
Backchannel in question form bh Is that right?
Quotation “q You can’t be pregnant and have cats.

 

 

 

 

 

154

 

Table B.2: (continued)

 

 

 

Dialogue act Tag Example

Summarize/reformulate bf Oh, you mean you switched schools for
the kids.

Afﬁrmative non-yes answers na ny"e It is.

Action-directive ad Why don’t you go ﬁrst?

Collaborative Completion "2 Who aren’t contributing.

Repeat-phrase b“m Oh, fajitas.

Open-Question go How about you?

Rhetorical-Questions qh Who would steal a newspaper?

Hold before answer/ agreement “h I’m drawing a blank.

Reject ar Well, no.

Negative non-no answers ng nn“e Uh, not a whole lot.

Signal-non-understanding br Excuse me?

Other answers no I don’t know.

Conventional-opening fp How are you?

Or-Clause qrr or is it more of a company?

Dispreferred answers arp nd Well, not so much that.

3rd—party~talk t3 My goodness, Diane, get down from
there.

Offers, Options Commits 00 cc co I’ll have to check that out.

Self-talk 131 What’s the word I’m looking for.

Downplayer bd That’s all right.

Maybe/Accept-part aap am Something like that.

Tag-Question “g Right?

Declarative Wh-Question qw‘d You are what kind of buff?

 

 

 

 

Table 82: (continued)

 

 

 

Dialogue act Tag Example
Apology fa I’m sorry.
Thanking ft Hey thanks a lot.

 

 

 

156

 

Bibliography

157

[1] E. Akhmatova. Textual entailment resolution via atomic propositions. In Pro-
ceedings of the PASCAL RTE Challenge Workshop, 2005.

[2] J. Allen. Natural language understanding. The Benjamin/ Cummings Publishing
Company, Inc., Redwood City, CA, USA, 1995.

[3] J. Allen and M. Core. Draft of DAMSL: Dialog Act Markup in Several Layers,
1997. ‘

[4] A. Anderson, M. Bader, E. Bard, E. Boyle, G. M. Doherty, S. Garrod, S. Isard,
J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. S. Thompson, and R. Weinert.
The hero map task corpus. Language and Speech, 34:351—366, 1991. EN.

[5] J. L. Austin. How to Do Things with Words. Harvard University Press, Cam-
bridge, MA, 1962.

[6] L. R. Bah], F. Jelinek, and R. L. Mercer. A maximum likelihood approach
to continuous speech recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 5(2):179—190, March 1983.

[7] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project.
In Proceedings of the 17th international conference on Computational linguis—
tics, pages 86—90, Morristown, NJ, USA, 1998. Association for Computational
Linguistics.

[8] R. Bar-Haim, ll. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and
I. Szpektor. The second pascal recognising textual entailment challenge. In
Proceedings of the Second PASCAL Challenges Workshop on Recognising Tertual
Entailment, Venice, Italy, 2006.

[9] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique
occurring in the statistical analysis of probabilistic functions of markov chains.
Annals of Mathematical Statistics, 41(1):164—171, 1970.

[10] L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini. The ﬁfth
pascal recognizing textual entailment challenge. In Proceedings of the Second
T ext Analysis Conference (TAC 2009), Gaithersburg, Maryland, USA, 2009.

[11] T. Bocklet, A. Maier, and E. N6th. Age determination of children in preschool
and primary school age with gmm-based supervectors and support vector ma-
chines/regression. In TSD ’08: Proceedings of the 11th international conference
on Text, Speech and Dialogue, pages 253—260, Berlin, Heidelberg, 2008. Springer-
Verlag.

[12] J. Bos and K. Markert. Recognising textual entailment with logical inference.
In Proceedings of Human Language Technology Conference and Conference on
Empirical Methods in Natural Language Processing, pages 628-635, Vancouver,
British Columbia, Canada, October 2005. Association for Computational Lin-
guistics.

158

[13] C. Boulis and M. Ostendorf. A quantitative analysis of lexical differences between
genders in telephone conversations. In Proceedings of the 43rd Annual Meeting
of the Association for Computational Linguistics (A CL ’05), pages 435—442, Ann
Arbor, Michigan, June 2005. Association for Computational Linguistics.

[14] C. Brockett. Aligning the rte 2006 corpus, 2007. Technical Report MSR—TR—
2007—77.

[15] G. Carenini, R. T. Ng, and X. Zhou. Summarizing email conversations with clue
words. In WWW ’07: Proceedings of the 16th international conference on World
Wide Web, pages 91—100, New York, NY, USA, 2007. ACM.

[16] J. C. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec,
V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska,
I. McCowan, W. M. Post, D. Reidsma, and P. Wellner. The ami meeting corpus:
A pre-announcement. In S. Renals and S. Bengio, editors, Machine Learning for
Multimodal Interaction, Second International Workshop, Edinburgh, UK, volume
3869 of Lecture Notes in Computer Science, pages 28—39, Berlin, 2006. Springer
Verlag.

[17] F. Chang, G. S. Dell, and K. Bock. Becoming syntactic. Psychological Review,
113(2):234-272, April 2006.

[18] Y.-W. Chen and C.-J. Lin. Combining svms with various feature selection strate-
gies. In I. Guyon, M. Nikravesh, S. Gunn, and L. Zadeh, editors, Feature Ea:-
traction, volume 207 of Studies in Fuzziness and Soft Computing, pages 315—324.
Springer Berlin / Heidelberg.

[19] K. W. Church. A stochastic parts program and noun phrase parser for un-
restricted text. In Proceedings of the Second Conference on Applied Natural
Language Processing, pages 136—143, Austin, Texas, USA, February 1988. Asso-
ciation for Computational Linguistics.

[20] J. Coates. Language and gender: a reader. Wiley-Blackwell, 1998.

[21] S. Cohen. A computerized scale for monitoring levels of agreement during a
conversation. In Proceedings of the 26th Penn Linguistics Colloquium, 2002.

[22] I. Dagan, O. Glickman, and B. Magnini. The pascal recognising textual en-
tailment challenge. In PASCAL Challenges Workshop on Recognising Textual
Entailment, Southampton, UK, 11 - 13 April 2005.

[23] C. C. David, D. Miller, and K. Walker. The ﬁsher corpus: a resource for the
next generations of speech-to—text. In Proceedings 4th International Conference
on Language Resources and Evaluation, pages 69—71, 2004.

[24] M.-C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed de-
pendency parses from phrase structure parses. In LREC, 2006.

159

I25]

[25]

I27]

[28]

I29]

[30]

BM

[32]

I33]

I34]

[35]

R. de Salvo Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons. An
inference model for semantic entailment in natural language. In Proceedings of
the Twentieth National Conference on Artiﬁcial Intelligence (AAAI), 2005.

E. Dermataso and G. Kokkinakis. Automatic stochastic tagging of natural lan-
guage texts. Computational Linguistics, 21(2):137—163, June 1999.

G. Dinu and R. Wang. Inference rules and their application to recognizing textual
entailment. In Proceedings of the 12th Conference of the European Chapter of
the ACL (EA CL 2009), pages 211—219, Athens, Greece, March 2009. Association

for Computational Linguistics.

G. Doddington, A. Mitchell, M. Przybocki, and L. Ramshaw. The automatic con-
tent extraction (ace) program—tasks, data, and evaluation. In Proceedings of the

4th International Conference on Language Resources and Evaluation {LREC),
2004.

P. Eckert and S. McConnell-Ginet. Language and gender. Cambridge University
Press, 2003.

C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.

V. S. Ferreira and K. Beck. The functions of struCtural priming. Language and
Cognitive Processes, 21(7—8):1011—1029, November 2006.

A. Fowler, B. Hauser, D. Hodges, I. Niles, A. Novischi, and J. Stephan. Apply—
ing cogex to recognize textual entailment. In Proceedings of the PASCAL RTE
Challenge Workshop, 2005.

M. Galley. A skip-chain conditional random ﬁeld for ranking meeting utterances
by importance. In EMNLP ’06: Proceedings of the 2006 Conference on Empirical
Methods in Natural Language Processing, pages 364—372, Morristown, NJ, USA,
2006. Association for Computational Linguistics.

M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. Identifying agree—
ment and disagreement in conversational speech: Use of bayesian networks to
model pragmatic dependencies. In Proceedings of the 42nd Meeting of the Asso-
ciation for Computational Linguistics {ACL’04), Main Volume, pages 669-676,
Barcelona, Spain, July 2004.

N. Garera and D. Yarowsky. Modeling latent biographic attributes in conversa-
tional genres. In Proceedings of the Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP, pages 710—718, Suntec, Singapore, August 2009. As-
sociation for Computational Linguistics.

160

I35]

I37]

I38]

I39]

I40]

I41]

I42]

[43]

I44]

[45]

[46]

D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan. The third pascal recogniz-
ing textual entailment challenge. In Proceedings of the ACL-PAS CAL Workshop
on Textual Entailment and Paraphrasing, pages 1-9, Prague, June 2007. Associ-
ation for Computational Linguistics.

D. Giampiccolo, H. T. Dang, B. Magnini, I. Dagan, E. Cabrio, and B. Dolan.
The fourth pascal recognizing textual entailment challenge. In Proceedings of
the First Text Analysis Conference (TAC 2008), Gaithersburg, Maryland, USA,
2008.

J. J. Godfrey and E. Holliman. Switchboard-1 Release 2. Linguistic Data Con-
sortium, Philadelphia, 1997.

D. Graff. The AQUAIN T Corpus of English News Text. Linguistic Data Con—
sortium, Philadelphia, 2002.

A. Haghighi, A. N g, and G. Manning. Robust textual inference via graph match-
ing. In Proceedings of Human Language Technology Conference and Conference
on Empirical Methods in Natural Language Processing, pages 387—394, Vancou-
ver, British Columbia, Canada, October 2005. Association for Computational
Linguistics.

Z. S. Harris. Distributional structure. In J. J. Katz, editor, The Philosophy of
Linguistics, pages 26—47. Oxford University Press, Oxford, 1985.

V. Hatzivassiloglou and K. R. McKeown. Predicting the semantic orientation
of adjectives. In Proceedings of the 35th Annual Meeting of the Association for
Computational Linguistics, pages 174—181, Madrid, Spain, July 1997. Association
for Computational Linguistics.

A. Hickl. Using discourse commitments to recognize textual entailment. In
Proceedings of the 22nd International Conference on Computational Linguistics
(Coling 2008), pages 337—344, Manchester, UK, August 2008. Coling 2008 Or-
ganizing Committee.

A. Hickl and J. Beasley. A discourse commitment-based framework for recog-
nizing textual entailment. In Proceedings of the ACL-PASCAL Workshop on
Textual Entailment and Paraphrasing, pages 171—176, Prague, June 2007. Asso-
ciation for Computational Linguistics.

J. Hirschberg and D. Litman. Empirical studies on the disambiguation of cue
phrases. Computational Linguistics, 19(3):501—530, September 1993.

J. R. Hobbs, M. E. Stickela, D. E. Appelta, and P. Martina. Interpretation as
abduction. Artiﬁcial Intelligence, 63(1-2):69—142, October 1993.

161

I47]

I48l

I49]

I50]

I51]

I52]

I53]

I54]

I55]

I56]

WI

[58]

A. Iftene and A. Balahur-Dobrescu. Hypothesis transformation and semantic
variability rules used in recognizing textual entailment. In Proceedings of the
ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 125—
130, Prague, June 2007. Association for Computational Linguistics.

V. Jijkoun and M. de Rijke. Recognizing textual entailment using lexical similar-
ity. In Proceedings of the PASCAL Challenge Workshop on Recognising Textual
Entailment, pages 73—76, 2005.

H. Jing, N. Kambhatla, and S. Roukos. Extracting social networks and biograph-
ical facts from conversational speech transcripts. In Proceedings of the 45th An-
nual Meeting of the Association of Computational Linguistics, pages 1040—1047,
Prague, Czech Republic, June 2007. Association for Computational Linguistics.

P. Kingsbury and M. Palmer. From treebank to propbank. In Proceedings of the
3rd International Conference on Language Resources and Evaluation (LREC-
2002), Las Palmas, Canary Islands, Spain, 2002.

K. Kipper, H. T. Dang, and M. Palmer. Class-based construction of a verb
lexicon. In Proceedings of the Seventeenth National Conference on Artiﬁcial

Intelligence and Twelfth Conference on Innovative Applications of Artificial In-
telligence, pages 691—696. AAAI Press / The MIT Press, 2000.

D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL ’03: Pro-
ceedings 0f the 415t Annual Meeting on Association for Computational Linguis-
tics, pages 423-430, Morristown, NJ, USA, 2003. Association for Computational
Linguistics.

S. le Cessie and J. van Houwelingen. Ridge estimators in logistic regression.
Applied Statistics, 41(1):191—201, 1992.

D. Lin. An information-theoretic deﬁnition of similarity. In ICML ’98: Pro-
ceedings of the Fifteenth International Conference on Machine Learning, pages
296—304, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.

D. Lin and P. Pantel. Dirt - discovery of inference rules from text. In Knowledge
Discovery and Data Mining, pages 323—328, 2001.

Q. Lu and L. Getoor. Link-based classiﬁcation. In Proceedings of the Twentieth
International Conference on Machine Learning, Washington, DC, August 2003.

R. K. S. Macaulay. Talk that counts: age, gender, and social class differences in
discourse. Oxford University Press US, 2005.

B. MacCartney and C. D. Manning. Modeling semantic containment and ex-
clusion in natural language inference. In Proceedings of the 22nd International
Conference on Computational Linguistics (Coling 2008), pages 521—528, Manch-
ester, UK, August 2008. Coling 2008 Organizing Committee.

162

[59] B. MacCartney, T. Grenager, M.-C. de Marneffe, D. Cer, and C. D. Manning.

[60]

[51]

[62]

I63]

I64]

I65]

[66]

I67]

[68]

[69]

Learning to recognize features of valid textual entailments. In Proceedings of
HLT-NAACL, 2006.

B. MacCartney, M. Galley, and C. D. Manning. A phrase~based alignment model
for natural language inference. In Proceedings of the 2008 Conference on Empir-
ical Methods in Natural Language Processing, pages 802—811, Honolulu, Hawaii,
October 2008. Association for Computational Linguistics.

M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor. Treebank-5’.
Linguistic Data Consortium, Philadelphia, 1999.

S. Maskey and J. Hirschberg. Comparing lexical, acoustic/ prosodic, structural
and discourse features for speech summarization. In IN T ERSPEECH—2005,
pages 621—624, 2005.

D. McCarthy. Word sense disambiguation: An overview. Language and Linguis-
tics Compass, 3(2):537—558, March 20 2009.

G. A. Miller. W'ordnet: a lexical database for english. Commun. ACM, 38(11):
39—41, 1995.

D. Moldovan, C. Clark, S. Harabagiu, and S. Maiorano. Cogex: a logic prover
for question answering. In NAACL ’03: Proceedings of the 2003 Conference
of the North American Chapter of the Association for Computational Linguis-
tics on Human Language Technology, pages 87—93, Morristown, NJ, USA, 2003.
Association for Computational Linguistics.

G. Murray and G. Carenini. Summarizing spoken and written conversations.
In Proceedings of the 2008 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 773—782, Honolulu, Hawaii, October 2008. Association
for Computational Linguistics.

G. Murray, S. Renals, and J. Carletta. Extractive summarization of meeting
recordings. In in Proceedings of the 9th European Conference on Speech Com-
munication and Technology, pages 593—596, 2005.

R. Navigli. Word sense disambiguation: A survey. ACM Comput. Surv., 41(2):
1—69, 2009.

J. Neville and D. Jensen. Iterative classiﬁcation in relational data. In AAAI
Workshop on Learning Statistical Models from Relational Data, 2000.

[70] S. Oviatt. Predicting spoken disﬁuencies during human-computer interaction.

Computer Speech and Language, 9(1):19-—35, 1995.

163

I71]

WI

[73]

I74]

I75]

I76]

[77]

I78]

I79]

I80]

[81]

[82]

K. Papineni, S. Roukos, T. Ward, and W.—J. Zhu. Bleu: a method for auto-
matic evaluation of machine translation. In ACL ’02: Proceedings of the 4 0th
Annual Meeting on Association for Computational Linguistics, pages 311—318,
Morristown, NJ, USA, 2002. Association for Computational Linguistics.

M. J. Pickering and S. Garrod. Toward a mechanistic psychology of dialogue.
Behavioral and Brain Science, 27(2):169—190, 2004.

A. Pomerantz. Agreeing and disagreeing with assessments: Some features of
preferred / dispreferred turn shapes. In J. M. Atkinson and J. C. Heritage, editors,
Structures of Social Action, pages 57—101. 1984.

S. S. Pradhan, W. Ward, and J. H. Martin. Towards robust semantic role label-
ing. Computational Linguistics, 34(2):289—310, June 2008.

L. Rabiner and B. Juang. An introduction to hidden markov models. IEEE
ASSP Magazine, 3(1):4—16, January 1986.

R. Raina, A. Y. Ng, and C. D. Manning. Robust textual inference via learning
and abductive reasoning. In Proceedings of the Twentieth National Conference
on Artiﬁcial Intelligence (AAAI), 2005.

D. Reitter and J. D. Moore. Predicting success in dialogue. In Proceedings of
the 45th Annual Meeting of the Association of Computational Linguistics, pages
808—815, Prague, Czech Republic, June 2007. Association for Computational
Linguistics.

S. Sekine. Automatic paraphrase discovery based on context and keywords be—
tween ne pairs. In Proceedings of the Third International Workshop on Para-
phrasing (IWP2005), pages 80—87, 2005.

R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast — but is it good?
evaluating non-expert annotations for natural language tasks. In Proceedings
of the 2008 Conference on Empirical Methods in Natural Language Processing,
pages 254—263, Honolulu, Hawaii, October 2008. Association for Computational
Linguistics.

S. Somasundaran, J. Ruppenhofer, and J. W iebe. Detecting arguing and senti-
ment in meetings. Antwerp, September 2007.

S. Somasundaran, J. Ruppenhofer, and J. Wiebe. Discourse level opinion re-
lations: An annotation study. In Proceedings of the 9th SICdial Workshop on
Discourse and Dialogue, pages 129—137, Columbus, Ohio, June 2008. Association
for Computational Linguistics.

S. Somasundaran, J. Wiebe, and J. Ruppenhofer. Discourse level opinion inter-
pretation. In Proceedings of the 22nd International Conference on Computational
Linguistics (Coling 2008), pages 801—808, Manchester, UK, August 2008. Coling
2008 Organizing Committee.

164

I83]

I84]

I85]

I86]

I87]

l88l

l89l

I90]

I91]

[92]

S. Somasundaran, G. Namata, J. Wiebe, and L. Getoor. Supervised and unsu-
pervised methods in employing discourse relations for improving opinion polarity
classiﬁcation. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing, pages 170—179, Singapore, August 2009. Associa-
tion for Computational Linguistics.

A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor,
R. Martin, C. V. Ess-Dykema, and M. Meteer. Dialog act modeling for automatic
tagging and recognition of conversational speech. Computational Linguistics, 26
(3):339—373, 2000.

I. Szpektor and I. Dagan. Learning entailment rules for unary templates. In
Proceedings of the 22nd International Conference on Computational Linguistics
(Coling 2008), pages 849—856, Manchester, UK, August 2008. Coling 2008 Or-
ganizing Committee.

I. Szpektor, H. Tanev, I. Dagan, and B. Coppola. Scaling web—based acquisition
of entailment relations. In D. Lin and D. Wu, editors, Proceedings of EMNLP
2004, pages 41—48, Barcelona, Spain, July 2004. Association for Computational
Linguistics.

M. Tatu and D. Moldovan. A semantic approach to recognizing textual entail—
ment. In Proceedings of Human Language Technology Conference and Conference
on Empirical Methods in Natural Language Processing, pages 371—378, Vancou-
ver, British Columbia, Canada, October 2005. Association for Computational
Linguistics.

M. Tatu and D. Moldovan. Cogex at rte 3. In Proceedings of the ACL-PAS CAL
Workshop on Textual Entailment and Paraphrasing, pages 22—27, Prague, June
2007. Association for Computational Linguistics.

A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Transactions on Information Theory, 13(2):260—269,
April 1967.

R. Wang and G. Neumann. An accuracy—oriented divide-and—conquer strategy
for recognizing textual entailment. In Proceedings of the First Text Analysis
Conference (TAC 2008), Gaithersburg, Maryland, USA, 2008.

T. Wilson and J. Wiebe. Annotating attributions and private states. In Pro-
ceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky,
pages 53—60, Ann Arbor, Michigan, June 2005. Association for Computational
Linguistics.

C. Zhang and J. Chai. What do we know about conversation participants: Ex-
periments on conversation entailment. In Proceedings of the SICDIAL 2009
Conference, pages 206—215, 2009.

165

[93] C. Zhang and J. Chai. An investigation of semantic representation in conversa-
tion entailment. In Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing, 2010.

166

MICHIGAN STATE UNIVERSITY LIBRAR

1293 03063

72

 

[3|