... . 0.6. “111““. h; .. : _ 0 . .... . . . .. .....2. 0.0.0... 00.. .0... .03. ‘5. +7... 014:...5-4ffl 1.05.9250 cfii 26.00%: .. .1. ...- . 0 0 3.6.6.. .0 ... . .. - 8. .00... ... .. : .... . . .0. 0. . . .. 0 . ._ -. ... ... ..- .... .... .0 J. ... . .—...6.. ..0. .mef...i..Mu.'t.000m. can»... . 0 6 .0. 01. . 0‘0. 6.01:: a. a . 0 0 0 .0. ... Q 0. .0.-I: . .00 . 00. .. . H. 0 . ..- .00 .0000. '0 u. 0 .0 .. 0 ..0. ...00. .0. 0 )0 9.0 ...-0. . . ...: . 0. 211.1.- ..6 . ...? .... .. .0 0.. . 060 0.. . 1. 00 0 0 . - ... 00...... ... mung? 3.11.03”. M0 ..0. 1.304:- 00...... ... .0. £0... ”36......30. .. ...» ...." .06 ..0. .0....0. .u... ...-(M. hum-..- .. an." .0.............ur.s... ..0... ...-....6-L L... ..0 . o ... 0.2.3:...1 . J 2.2.18.2“... . ..0... . .. .0.-.1101"... . . ....... . .... . 0. 6 .- .... .....9 : fit.‘:«.h..§.1..t...3 ......0.. 0...... 0.; .. . .....1u...0 02% “1.0.0:... .0» 6.0651,. 3... ..1.6 306 ......0‘. ......6 13...... . .... -0... 0 0 . 0... .. 3.1 o. . ...: .... ...: .. ... 1...... .l...’ .... 1.0.2.696! ..0 . .....2 ... 0.. .. ... .. (2210”. 005.15.12.31: 0.9.. . ..0.- 0341.. . . $00020. .. . . u. 00...? 9.0.0.“? it” . 600 - . .J. .001...-0.-O0.0‘.0 . . ...: 0.0.. 0. 0... .903... ...-.20 ..9. .. 6 . 2 ...... ... . . .... . . .36». ... ..0 . .0... . 0.1!. ... .00 00 0- 00.)0000»l’-. 0-..0. 1....,..'.r0.00‘03.0 13.009??? 0.3 . I'm-.00 .50....“ w... N.0u7.l0u...l I. $.00 000003.016n9 .4 .33.???” 3.3.0.600. 00.0.0 ....0 . . .60 006 .. 60.05”»...060 . 0 :30... 0 00 00.0 . . .6. .- .. 0... 300.006.... ... 000 . .0 .06.. . ; . . .... ...... .P 0.0 60. . . 0. L 0.0.... 0-. . . 0. 0-. ...-00 ..0 0. 0 0. ...0‘360100 0‘0. 0 . . 1.1:... 3......222121 0030.09.930h...0|..0.10.6.:.0.600..:.$-.0 0 . p.10. 6! ... . . 00 . ...013 80.6.1... .. 01.03.01 . 6o ..0. . . . .6: ...: ...... s . .6: -6 .. 6 .. . .. . . c. 20... 1 00.2.2. . ... . .1.0..|..0- 01653th5...«”0\.§..00. .n. 0 0ng0 _ 0a.. . .01 .l' 0.1.0. 9.0.0. O“! 0. . O0. 0 . 0‘. .0001. .0000 0.0060 ...-0 60... .000 .1... . .0... ..r0- 16-. ..0. .0 00.. Q 900 .00.. . . . 6 00 . . . .. . 1.000... .6603. . . .. 0. 6- .10 ... .0 0.. .30 s .0.-1.. 0 .0. ..0 . 00. 1. .‘0060 060. ..’.006 A .0) P . 0... a. ... 0 .10 1.4.0.0001 ..0. 0.... J V (.0... - - . a .0.. ...".100...0... 01... . . L $0.11.... ..0 0. . 6:0 . .0.-.19. . . . 1344-10.. 1. ... .o.0.. . ...... I ...... 1... . 0.....0. 6. . . . ..0. . 0000.: _ .. . . ....u “0.8.0.3515. ...). nut“ .03.. . lOu‘JoS...n0Umk£ . 00 ..l 169 T300316. . . . ...3 .010: . . ,. ... .00 .26.“..3 ... 0. .. .».. . . ..0 130 000....0. 000.. ..0. ..0: ...: .... . .. .. ..0. .9 . 1...: 601. 0.00““ .0 .u-07 .1} - 0. 9 0. 00 .6 00.. . 0 . . . . ... . ... 6 Q- . . . 9 100.0 n .a . . .00 .....0 . . -. . . .0 ‘2 . 0.... . prawn. L 90v“- -1010 a u .. n. ”“1003! }u“.h’v"'.’. 9'0 .0.-HO.CW06600“W0-t1.h.-3 0 < .. 5. Joann-l "Nnuho... 000100. u ..u0h...’ . . ..‘0 .n . 0.004 Q, 0 0.. .uuwflotu. - 6 0H . .nshv. ('9 0 n 0. . n » 0 .. v 0. Ldt0. .. 01.1...“ Q.» “mm: .0 .é—‘Uofin .0“ . .9 ..0 . . . 0 '0: ..3» o 00 09“ 00 . .0 0-,...00-1. ‘0 00 0.0 . .0. ...00 ... V 3 0 . 00. .1. 0 0000 ..0 . . .. . .. .0 6' .. fig; .. 9 l . .0 .00. .f. 0 .0 0 .00. ... . .. a 0 0. 00.0 \3. .. ’ 0. 0. .6 .30. 0 .0. 0 00. 0 ..0 00”“ 10%» Q. 0. . ? um... .h 0‘. 0...?(9. 9.01.9.0 {91.043". . ..I 60 0 0.3.650."- 000000” m.”- 0£éwth09h0006r060 6115013. 1.30. “000000.. .16 "HR-40.00... ..0. . . . 30.0.0 .m.....~ 0 :00. 1.01.A0:... 0, .3 0 .00.? .0 .. . ”.... ......1- . .00. .. 0 "01.. ..u ”.0. 000.810.00. . .1. O .00. 9.30.? J 0..” ..0... .90 0 .. 0 0. 30 .0 .Q0c. . 0.00000 0060...... 00.000 000666... 0. 0 I . 6 . 7 0. - . . . 0. . .13.. .. .0 n - .0. 0.0. .. ...0.. -. . 1... .1 . t o .0.. ..0 06‘ L0. 0 . . 6.10%.” 0. a...” 6... .01.. . W100“. 6a.... 003.00... =1 0‘. «0000.. 0:21:63 .1..00| 0.010"- 0 0 $3904. "u. 009v.0.0..0wn.... . I. 6W00.0u. ”-000 ”6.0”... ..0-"00 ..0” ”.0.-00.0 r... :9. 0.40.00. «row-67.. .0350. 0 0 . ..0“ x .6 . . . r. . a . 1...”. ..0. 0-.60... . o 3..... 000v.1.50»u0.... 00”! 00.1.6.0 .6 ‘QEXQQS . .vruuflr: '03. a. 0..3.0 an 9-0 {0,0159 .W’ 0- ...-00 0 ‘3 60600.1 . 00 0.00,.- .0... 600000.011}. ’3 .0.... 0. ..0 0 ...0. .. .... . . .0 -.- .0. o .. ... ...0 r ... . ..a n 00.0... 0... 0... 9:00-03 50-"...6| .00 0‘: 001.0. ...-0.0 .Q. .6: 0... Al|0Q60 0.0. .0‘..‘0 . . ... 91.0066 . ~00". n0. 6.6 30.9.0.4 | 0. . .0. . ..0 . ..0. .00 .u . . ... v .10 . 00...... .00.? ..0. .- . ...... 0 . 0 ..60 0“. 09 31.1.0.0..- 0130"” fiJQ-i... 0..“ ....0000 ..u. 9.. 4.0. L 0‘”! , . 0. . 0. 30000000.- ...: .- 80.0 E a... . . I. ... . ...! ...L.........180-. .0 0. .. 0... .0. 56.000.03.001. ‘06 . . -- .JW...” 00 0.. .10.... .0. .0011. . 0..9... _. 6.. ..1. .1-1.0.0...|.00-...0.. . . . .. ...-6.3.1. .. . . 60 19.619.30.00 69000- .10- I . 40 N0 600. .01100 9'..Q.,0.0..0.0 ...0 . . . 1-0.. .. ...-0.00.600 0 0.. ..061001. 0. . 00... .l 0608.. . 3.“.JIQI .0...j-Q _ “I. .1. ..0. 0 ..0... :0; 300.0 _ 0:: 0:100”; $0. 0-. 0 0 90.9.5." ..0 .011... 00. 00 ..0» .6. .0 .0 ..0} .000“! 9.. 0*“ . .100 16’ 6. l . n . . I o . . . 0 3.0.1.0.... 4 ...oI-nulfluuqut... ...-3003090.! 0 .060un..fl .. .... . . . Lamontnaounu.“ gran“... 0510.313“ . . r. “...—)0}! . 0V.0|0.004...6.0. 91:20» .-...t...... v .. 6 - o 09: .. L 31:91 0.... ... . 00.00 000.510 l.“1u00.n01.o...09”n4~.nt. .0 0- I. 00‘ . 6.0-0 .0 .. . . .06 I. 0 000 , ... . .0 0 .0. 0. 5.0 600.01.01.09 0.0.01. ..6. .\0 .11... ...?!“H...00..00..0 .0. .116331000 .0 .0.‘ ... 0.. 0.6.6 .1. . . a. 0.. .0.; 006.0 .. .0.. ..0. ..- 9.3.1. 6 ~1.......»Go .h....6 . 9.1.00. 1.! .0 00.. .0 Q o .0. . ..0. - .0....01 .. ... .0 ..h’ .0116“ .0.... _. ...6 «p.14 .u .aur..rr?vfi 0.25.10.10.31”... 04".". .0. 7.0.0. .. 1...! 3.10.0155. . J- 01. “MN-0 .41 ...“.1 . 1.9.0. 00 6.. r0000 . . . 00. . 0600 . 201‘ 0|. ..0 0. 10.0.0; 0 0. 0 .0 . '.“H’ I? '6‘!’.\ angst. 0.!”0 91.9-0HLU : 0"...O0QH 0‘0 ...”.60'0. 050%.... ...“...‘0030.’ 0 00 ..M.‘060 I-looay’nlf 6. I JWQNTQ ..0 0 .. .- .6 . .001 - . - 00000.. 0-0 . I ...0 .4 ....NJ. ”Viv!“ - 31. ... ..0-00 W309»... $0$10600PJ 0.0.1 2 . -. ... ...... .1 1.6-.an ..0u6.u . . . . ”0.0 . . ..ql .u ... “Burl... 1.5 Nfi 3%. . 68d... (“0”..an . , ..1 00 . I ..3 6.. u. :30...) ...... 1.0.. F1 .10...- .... .901... ...00 106.02.. -3005. . . . .. . . . . . . ..... . 0.6.... . .. :25 .0 ...“. ..6060 0.00 . J .36....r32011. 1.0- .411': 01037.. 9.0.1.016 . 8. 00.. 1 ...6. 0...}...— -. :0.- 0 0 .0009. . . . . . . : . , . . . . 5 . 16...... ......616 .. ..0. .0. 1.3.... 0:00.06— 0... .... "Rh; 3.093300% . ......” 506:...21PWf-‘l £051.06. -. . 11.0w! 0 60... .... 3.330.300.3033. . .01....3. . 4.. . ... .6... 3.33.0 .18“. 16...: . . . ... ...: .. I. . ...0 0.00..1. 1%....06 0.... ..gvcflmmh‘. #340. .- 000.10?- ....JJ 14- 0.0.0640... .Iq. 0m 0.6.6... "6.. 0. :09 ”6. A700... ..0. .3970. . .3..- .l . ... I. ... . 0 . ..0 ... ..I. “...! J . .- 1r... . 00:.“ .0.-...u0 0. i .....0. .34 . .0. 0 0. .0 a is. . . ...,ri Q V000319 .007 0. ‘00. o . 0 . .0000 .0) . . 00. 0 . . :6 00.090 .0 .0 .0: 00.0010. “all 0. :1 Li ... 0 .took’0 .21.. .0.-... ..0! b-TJJS. ... .n. 000.0900. .0 0. .-6.... 00,00 ... .06 0.0. .00 . . . . 1.8.0.0-...26.0£Hvbiworuto600101vu~9 . ..hvo. 010.. . h. 0 0.0.". .....u 060. . 0 0 a. 0.0.3 ... 0. 0.! . .610 .6 6.00. 3. ..0 ..0-... '0... 6.1.. . . 30 .- . ..1. . ..0. .00... ...0 00190900360. 0313.31.65 .l.. 0000.09.60.00.“ .....l. 0- 0. 603.0 ...... .- 6-000 .0 0 ..0-.109 06.. . . .. . . . .. . . . . . . .. . . ..0.0... 0 ...-.0 6:00.702. .1. .0. ..0 0 8.0.10.0) . 0.0.0.3..000. .0. a. 0669.00.06.00 0 0“: 30.3.0-3“ 0...}; 0 06.300.02.00. 300000.006 9:0 . .. . . . . . . . . . . . .. .. .0 . .6 .0117. 0.!2 . .0.. . . .0 .0.. 6. 6- .. 0.20. 16.0. 0'. 0.00.1813 30.01-01.11 0... 0.0. 109.14. 1......36 6300.31.31. .0.-.60....1... -.0 . . . _ . .0. . 01169.... ...... 06)}?! ...... . .10 .5006. .. n .00.)...0..0.... 0%...21000L2 0.- 01 0.. . .6. .080 .. 07.00.. . 91351.109‘.. 0- 7“ ... . 06.: {.0.-.1321... .....z... . .....- -.. .. 30.. . . 0 0.. 1-. 0.. an... 009$. .... 1.00.00.63.00 0- 0. “-0.3; 360.014.0003"... .10. 01.01.001.009... v: ...006.|...I1000... ...-..0: 30.0.0... ..3-...; . .-|. . ..6..|........ ...3...) .. 0 .1 . ..0. '0 .0 0.00.03. P0 (0060“ 0 90.00 . 1 do .000, 000 3.0.. 0.0; - P6090 ..0» 0". 0000 .0, .0103: . . . . . . .0’ ... .60“ I 8.0. 0.... ...0-r'l10 30.333 ..0-.00. . .616”... .... .1000300: .23.. ..000 «0. .31 0|.92.....0.0121..6)a’0| ... .. or 2.1... o. .6 .32: .u ...r’..?. . 9.26....) .6: . .8... . . . . , _ .0.... . . 40.. f .. 0.00.1000. 0.0.... '66.” 060 5.0.0.0.... . .....0. .- 0”00 .l 1.0 «00 ...-00.0. 0 06-30-10. 01... 081.001 ’19. 6 0. 0- 06 0 . ..0. .. . ya. .0000 6 56104.0... .. . 0 o . .0. .. ..00 . 00. . . . . .. . .. .. ... c . 60.0... f .0.-..0 .0.- . .413“). ...-6.81.... $05.6.”- id"! 00.0...06... 0. 09.5.... _ .... . 1090.15.01.91. -0.. ....E- .1. .... 0...... .... .. 2...... 9:92.33: 0...)... z . . . . . 2.6.... :5. . I 1.0. “000. 1.0 0910.0. .09” 0 $00. MVP”. 0. .0.. ..00 .. .-.. . ..3- 0. 066 6.. 0.0 0 0.- . .200. 0603.60.60.35», ‘. 0. . 0.6 . . . ~30» - . .... . . . .90... . . . . . . . . .. . .. ...-.0. .. ..A . o 1 ..0. . 0. 6 0 .9- ... 0- . , . 00.. - J... I . .. .. . 0- 3.. . .. g 10 00“ gr. 0 03.num 3..-0%“. 9‘ ”“00 00? 1 ‘0“‘050 COLXNW‘. .. .99900d90‘nh00‘03‘90u 0.00””. «run-“0.00116iflorl . 0 0.000.- 66.’.. n ...}...H‘u-r“. .0'6QW“ .00.“- “0000 .0 1&2“. .6006 110.50.10.00- Q...‘ .29“. Q. O.H..-...wnu. . 0 60.600 .0 . .0. . 9.00"“ 0. . 9 ...-000. 6’ .89 6 0. 0.." .0060. ..0-00 . . . . 1 .6. 0 0. :- v 0- . is 0000., .. .000 -. 0. .. . 0 . ..0. . .l . 0. A . ....o. - 19110.00}. 4001.. 0 ..0 ”009.60 90.0“" - ~010|0000000000090ufl C. 0 ‘00... 0.1 0 90 01 ...‘lu-u ”00.10“". Vanni”. .0000..0.0.¢ku0 .00“ ....»09 o «.0200 0' .0. 0-“ 0000. ..30...—n5.0.6 ...0 . .. ..0-1...... .0 n? 60.0.. .3. .00 n . .. ..0... .0 o. ,- a. .3000. 3.5.06.3»... 1.891100». . 3.3.10.0 .0.... .... 91.0.0. .6 :6... 6 66 .... 0.6 . .. .0.-0.. . ...: .0 31.13}... . . . . .... 0... .20.? ..0 . .3 Z .0810... 460..9"1”1’10k6{w.1100,-0‘0. ...-1 50,300.. ...|...000 ...0 10.. 3.. 001.0 .0 b.3026...- 30 . ..0- .10....03 6 .0 .0. o 0 0 0‘6 130.... . .0. .0 . 0. . 6 3A.. .0.. ... 0. I00 .3006. 01» .1 ‘0. .09 00.00. 00.1.... 66. .0..6 0.0001... ...-0.. ..l..:.- ...-.0060. 6.. ...0. 00. .... ...00I0. . 6000...»? 0... . r86..100..:..0..l 0 ... . , . . . 1000060003.. «0100» 6. 0.0.... .6. «.000 00. .. .6. 03.0 ...... 0.... )0 00 916.00.. . - 00’ . 0.00 0.2.02 20......0. ..00 .0600- ...-00.. . . . . . 6.5- .0 :9). - 0' . .0 £00.00" .0 .30.! 06001... 100000 016000.00 30.6. 066 '00- £7. . b . 0-1 [Qt-00.0. 0.- . . . . . . 0 - 6.01.00 .06 0.1.. 1.“ 000- 9 0000.000.’ 0‘0“. .0 9,063.. 0. 00. .. ...3-06.600 ...-0.1.1. .0.... 06.0- .‘v .230 .- 90,1 0 0.. ‘09. o. 00 .00 .0.. 010000.Q0T-. ... 08.’00~. .10.. ...-“’0 o .966..-.0.| .00 .90 6 p . 0. .0. 9 ..0..- ..6... 00.1.. .. .9105. 0.. ...l0.&...la«.|~ . . .0030. .0 .0 1. :0,- ’0’. o - . . . ..013601. 0.0-00. - ‘10. p. 6.076. 0 .9. 09.56.... 102.0321“ . . $3001.10. .0.. 00-8. . . . . . - .. ... ..n . - ... 00.112.131.13 1...!- ......... 0.... 12.63-63.00! 00060 ..3-000009100. .. 6.)...‘0500 0 . . . . .. 0.0.16... ..0.-. . .0160}. ..0! I .... :- 3.1.2219 01.6.0010. .00 900.06....11-90 00 ... l0 . u. 1.06010060020: .3. .1400. 0.0.. 0.00.30.19.01... ... ...-0 00 00!»??- . . . . . . . . ...6 l. . .0 ...3 00.1.06. .0. 0....0... ..0-0.. ...-$041.00. 0 . .0.-06‘... . ...... .0.... 00.0 600.0 0 66.70. 0000.0. .000 .L 16.0.... {30000.\~.0I.6.000.. ’0... 0 . . .0. (...-..0. 6) ..0-.0. 0 6:...00-6.’ 0... 0 0 ... . ..0. . 7 1.0.1.0.... . . .. 6....‘60. 0 6..) I ..0 .41.: 0.6 . . .0. 1.9.0. 3.0. Z 09.6.0.0."n «......6-.006 .30.. . . . .16., 00.. 06 .0.. .0603... 0-1.. 003910... .01 9.0600 030' ...] 03.7.0.0 01.9363 n u . w . ...“ M50698”. .100! m0...a.».61.01..6.§0.. .. 0. 9100.100“? . 0.90.... ,0 .0. ._ . .. . . . .. .. 0. 0.....00. . 60.0. ..0. 0. 1.0.0... ..00..... 00.1000.- 0...‘ ...1. 09.0.5 M0..‘0uvlnm3.t.f0~ 03$ 0. .0 {..0 a 00.... .2006. ..n.“ . ..1... 610‘: 9.1.0.216. 09.1009. . 10.90.. - ..3-.... 0. .0. . ...-.6 . . . . . . . . . .. . . (..1...06.000.. ... .20.... 2...!1‘f. 6- . 0011.03.05 ...0 03.0 9.151.063.1061: I: K v 1. r0: -9.). 000.... .5} 0.00.0 ..0 .110... .... ...3. .30.?!0500053. . .0....30 ... .. . t , . . . .. . . . . . . . . .000 . a6. . .1000]... 0.. . .0 . 0.9.8.5160. 613000.. 5?. C .033 '0 _,- .6: .0.“..46006100-0. .0....0....J 0 6v... ..1-u0.-.~..000..... 0.0.3..“ .l..0.0.00. .000. 30100.0...)17199 . . . ..0-3... .1. . . : . . .. ..0. .0.. .00... 06...? . 0:00.19 v...-00.-....;|| y. . a; O. 1190...... o. 0 300.10.;0u 06 v .. . 0 1%‘96 0.. 1000.00.00 01 0.0.0 1-00 00000... . .190...- . 6. ...-.... {.00. 0|. .0 0|. . 0.. .60.. 0 0- 00 0 0.1.6.060. . . . - . .. . . .0 . .3 0.00 00 ..0 00- 0-51 6 0.0 0.000 0.00.. 010.030.31.0013 9' . 16......- $3.71... .......fi.110. .00. Wm.”6060..0L .101239,-u00|0.......0_0.1..:. 360.0... 0. . ...... 6 600.103... 302.010.220. . .396 . 9... . .. . .. 2.0.. . .0. ..0-21......- . . .. .0...- .u! 0 0.. 1. . ......1...’ .... ... ..0... .0. . . ..0: . 0. .0. 1... ... ..b. .00.. .1061. 00.. 00.01.. ...: I 123': 06.0.0 _ . .0 6.0.0.; .0 .80.. 00......0fi .. .71 .0.. .010. ..- .:L. .1 30:010.) .06 . ... ...... 0. .0... .... 2.1.3.7322... .. .-.0 .. o . 6.. .. . ....090031005 0 0.0- 011... . ...1 o . . .3: . .0- . ...3... . . _ .. . ...... .- .. . .- . .... .0.-6...! ...... .6 .00. 3.05 ... 009.. . . 0. 003.007.030.1303rqé .l. 00:020 0&0 “90.1.70300’0310000‘r'. .30....00003039110100 . 06... 6’..60.y000.- 00169-000000. .21.}; 0.0.1 . 0. .... . ...\.-.000. 0 3000. r: ..006 .46000000‘... . . .. .. ... .0-0. .0.. u . .00. ..0. 01600.... 0110.10... D 0.1.0 00- 0. 10.00, ”nu-010.00,: 0.0 . 00”. 00 .6010.o1..0 0 00041. 0000.06.00 .. .0 I. 0. 60009000 - |00 0 00.00..- 0 . 050-6. 0. 30.0.0 ..0' - .I. 60. 06 . .0? 00‘ 0.9. . .00. 0. 10-00-0091 00. 6|. 0.. 0.. 0.. . 0 :01 66. .... - ..0 00 .00 6. o. 0 . u‘ 01 . 300.... t 0. I. .00 00.0. ..I- . ...).10. 1,100.305’ .00 0 0 “N. 61.01.31.110... 11. .-.0.. {M030 1... 3.1.3.3....” .1... 0.1.0.0- ...031} . 13 ..0. 02:01.0. 0.0.0.... ......0... 2... ..0-1.1.9.... 0.20 1.35.8.1 ... ...3. 0.. 31.9.1130... 6...: 3.... . ..0: .00....» ..0- - 0303.0 . . . . .. . . . .0. 52.131.00.2100.‘ 06. 610.0...1660 ..0. «0% . L0. .036.- .09 . 2... 0,. . ... 0.. - ... 0. {30.0. 1.60.. .09. 1. 9.0. 0". .101 .0. ...1-0...0..:0¢ .0-... 30.0.3 ...-.0... 0 300.10.. 0... ...3 .. 0610.626 .0.. 60‘ .3611 . 3.6.. .00. ... . .. .... P . . . . . . . . . .0. . . 00.3.0110.” Jug-«[460... .11 .... .- 0.0.3.0310 .0. ... ...:H‘ 0 0.0.. ...!- - .5136..06400 000.... 0 05 o .0192... 00’ 0 .....0.....0 . .. “V ...-1.1.6.6000 ....L 0.0....» 6.. .0 0.. 60.00. 0.. .. . 90......0. ......1. 1. . .10 0.01. 0 6......1-0 -0.0..0 . ..3. .... 0.0.30 0 ..0 ..0.» c 10.00. 0.0". 0 ..‘o. .16.!‘310 90.000.00.100. I19... 0. ..0 .60 0.. . .0. .35.. 10000.6. . "1.... 0...} .000 ..0-0 .16." 0 ..0-0002.330 £060 .61. .. 021.3350, 0 0600 ..0. .... I. . !30.|-.»I.0| 0. . ... 66.10.400.100... ...-60.0 0.6 :11...- - .. . .. 6.. . 0.00. 0000 .. 0 09.. ‘00-: 600 90!.0: ..0-«000000.. .00 0 190.010“ .0. 9.....000. - s 0‘ 000 90.3.5309. .. ... . v k. . 061.65.": .. 0 01.00.00.109 .009 0 ..0. . ...-..0 .. ..0 . 0.... .. 6. 0. : 0. . -. 6-. 0. .. 0 ..0... . .0.- ...... ..0 0.. . . ..0. 09...... ... .. ..0 . 1169.....11850. 00'... -.0. 0 .9 00. 000. 00.‘00 ... Hf“ 036‘.“ 0040000- |0 0'. #0.; ”#106 “.6... .0.”- .."JI 0.00 1-00. 0.0-“. '0 0.0001 0.. 0600".- 0§:00 L. 0.1,... .0 (00 N0 ..0”. Q0“. ”6-4., . n .. . 0-9Qm0l0 0 0 .000.‘-.or.. .. .0 . .9090”: .... . . . 0 ...00900.10 ..‘0 0 ....0 .. \- .0KLQ 0 ~06 ..9”...Q0..Q.. 000.60 9.0.. ..0. 0? 010,140,010... 000.«...0...-000..01=.’..01 .050... ... .0 0.3.... H 00 ..00...‘.000.o.0. 00000.10... ...-:7 .Q6 .‘h-on'...’ .. 6(1. 3.0.0010 0-00 0. 1.0.0.- .0... 000. 0.9 .. .0 3:10.. .00 .. ..0: 0. 0.311.200 ...-...). 9.31.20 .00..» . . ..0 00.00....-.0...0.... .0. .00.? .I... . 0. .20.. ....I..~..0. . .00. . 0. .0.: -001 .00 .00 0...” 00000. .000. ..u. 0.00 .....0“. . .1. .56. .600. .0. 0 .0 . - ., 0.0 -. 10600: 6.. ..601.r 66.000... . ....0. 3 .10- . .1 0.000. ..0-1.. .3. 60 . .00 ...).u’o .... 2. h 6.. . . I. .60“...0£. I 0 0.0.0 0‘.’-|-.1.0.0 w 006 7 , 1-00.. 00 0 .0 0 .00.. . .1 .00.. ..0-0. .10.... 6.. 0 10‘ 0... . o. . I. .60 0000 0 0.00.2000 0.6.. 9.000, 00. . 0 . 9. ..6 . 0. .6 9.000.. .1700 0.000 .. 00000. . . .0 . 0. ..0. 0' ._ 006 ... 0.0.6 ..7 l . . i... .00 000 .- .NI. ..0 .05 ..0. ....0. . any-0:. . .00..r0..|6.......0luu0.0.13.09 .000 L36 ... «0. . . . .0- . 00.007...“ .300. 60“.. .91....0 ....n 0.. $.60. . on 6.0.0.??? 0 . .0 u .. .... 0 .00 0 .0 0.0.1.... . 0.13.50... ... .- :30.- 210080. .0.-060. n. 00:“..0 ... .... 0. . . ...: .. .wuuv... .. 0.]: 200.0. .0 ”0-4” 0.60.13.60.1-9 I... . . .. ...! .0 1:3! .. . ... ...6 n ..0 0.0.1.0..0...-..0..0-.. . . .... 0 ...-I. .. 35600.13... .6-1........\- 0. 00000 0.0““: ... . 0.6. 30 . - .. .0.: 3..... ...-.0360. . .000 .2‘ .9 .. .0 03.. ... . 0 0- 000001. 0 ...00. 0. . . 3‘70 6001.000 0.... ..0-0.. ..0 .00 00 500.30.. 0-. ..3. n . o ...0 . .631... . 00. .. :40 .010 0 001 0. 16.1.76 or ...0 .0 ..0 6.... 70‘. ... .....0... 1.06 5| . «daft-0.60... .t. 000060........J... 60.000. . 0 .. ..0-0.6.5.1300on ... .0. .0... 3...- .... 6.0.1.90 . 30.9.5.2...20. 0.. 0 0.0.0.6. 3.020. ...3, 3.9.0.2192. .. - ....0 ... 6.... . 1.1. 0 .03“ l: 2 9.0.6.0. . . 0636;196:010. .0”. 3..-...: . L. 1 6.. .1... ....S. 0...! 0 . .... 0.9.23.1. . _ .0... ... ... 06.0.0.1! . .. .- . ... 5.0.5 .. 3631... ...-0.0.0:»... 1.0 .. - ..0..I.-..0 .81....£.-.. . 0.. 0. . 3... xs......7...l¢... ..0- ....-. .00 ... ... 2...... .. 3.... .. ...v......0.0...6. 0. 0.0 .2211. . Ghana; . ..0 ..I .0... L ..0 5-0.. 0... .60. 0. 1. 0.203.... .....6. . ..3-0 ... 9.2 “g. .0. ...... 3.3.01.3! 0|. .. .. .. 30.2.1... 7.66.0.1 . . . ... .. ..6. .6.- .. 6 . . 6.20 . .6- . . 0H 0 .-:0.rl 4000- 40. 00.. 0.0. :0 .0 060.33.. .Q. ."9 ..0 .01.... .0..“90000. .”0.0.‘.0. $00001... 000‘. ..na 6... . . 20.6 00.0. ....V 6.100002... ..0.0.. .... .0 ... ...01‘0301’09 ”000:.060000... .firiv- 010i}; 0..- .600. .3 .... 60.1.6.6. .. 0- ..0 . I03...6.0 0. 63. 0 > .... 12005:... .. 3.....0 0 . - 0 .0. . . ... . 2 .-.. .0....0..- ... .0. 0 00 . ......1. . ..0? . L. 6 .0 70.000.300.630 03.0.01... . ..0 .11 10.3.1900 .0015 . .06... 0J0...0.¢0. 2090.010... 0.. 1....»‘00. 0o «000......(000. .0000... .0 0...: 00.. .. 6.0.0.3639. 6. .....30r’...06.003..0.. "000. ..0. 0.001.000. .10.“ 1.0.6...) ..3...0...000..0-.90.-.1.11< 0. .000- . . . ...}. ... 0-9"... .. . .110... .... .0- 1.01....5 I: 0061.00.00.360flha 0.. .113”, 0.003. L030) 120“...- .. 1.0.0.. 0M3... 6 . 0. O. 6.060.135. 4.0. 7.00. 3.00000.- . . .... . 0.1.7.0 0.1.00.- .0 . . 0 0-6 . .0. 0.00. 9. . 0.60.0.2: . 46. .. ... 3... 3.1.0.. 60. . 0... ... 1|.»0000na000..0| . .0": ““0350oudtowlga.’ 0....I‘nga - . .0 J. 3000. . 6 O3, .30.! 0.6.. .0911... 900000. 0. V “00 .0. 0.0. 0 0‘0. 50.0 ...0 0 0103.3 0 . 0. .0.-00.: . .... 0 0. 60.0061... 0-0 . 0 00 0 0060 ..0. Q ..0 «1 ‘0 .00 . . . - hr .1136...” ... 6.. .- ““30, 06.0”.“1.w0..’l60- I . .. .1 0 .111”. ..u. Q... . 01...“; .91. 10‘ L06.- 00.. ..0... .00.»!06». an!” .0. In: «0.0 - 00410. . ..001000000. .1. .0 .306 . .901....._... 00 0....”0750501096-000033 g? 10010110. “-0210. . ..‘30..0 7000.21. .. ... .5. 0- 10:09.10. .... 06.0002L 9 v. 0024-, 01.0..1..-...06-. .350. .0 ..0... ..0001: 0. .. ... ....- .. .10.» | 0000-00-31.... 00:0 .. . . . . -0....906 ...3: 0 00 . .0.. 17.. 00.06. ... 6.1.6.90 . “0.00009 ...-00. s... {5.0.9.00 ha0‘0.r.1011§g 06.00. .1.- .0- .0- 4. -. "00000 1. .0... 13.00.260.01. If: .. l- . .000. l. -.. 6.1000. 5000.... 69-630 .0 . .0 . ...-000 .00 ...-.100. ....0. 001100. 0. 2.01031 106000.010... . . 0. . .0. 0.. 0. .0...0. . q 2...... . 0.0.00.0..00 .0. l . . ,. 000.200.00.02 0.40 l. 0.00 9. . 6 0.9.10. 2.3.0........0..... ..) 0.; . 2 .00.... .... .0.... . . 206.0... . 4.. 3.....09..V|-1|.1 ....0.60-. .... .0. ... ...... . ..0 12.6.68 3.0. 0 . $0.. ...0 ... 0.... 00.- .. ...! 10.1.0... 0. . . .0.. Q ...!016JQ1I 0:0- ...\ .0. .. 00. 00.00.. 30.6.00 1 - 9.00.66. v.00; 00300902001600.1300 .6 .100. .. . . , . . ... . . I 0 0 - 0.0-0.0 . . ... . ..0 6 L .00 ‘0. 50. -. . 0 16.0.0 .000- 60.. o ..0-0. 0 ...00 .. . .0. -.Q0 ..0. .0....) 0 001‘ 0.00 0 .‘6- 0060.... . . . . . . . . 100... .0-00. ...-0 .01.»... . - ...0 0 . . r... .. . ...-t... .0..r.r-..l 9.50 ..... ..0 0. ......I... . . .032... 03. . .. . . . .. . ... . ......0 6.1.0.5.... . Vii-1.3.0.1. 0.. . . .0 v 025. ..3... .|.U:.t.0.0...0‘10500........0. .000. .00. 1).. 0.01.0600.‘ .0....1... .00.]..0u0f . . . . . . .00.. . 1 o .0 26. .00 9. . . . . 1... . ..0-1. .- ... . ... 0L0 . ... .0.-.009. ..F.$0 0.36... . . .10.... ... 6.. .0 . . 0.6. .. - ...6 ...-0C4... 00.\0. . 0 20.0.- 0 Jon-.0 ...-..0.— 1.1.3.0..0.00).,6011’0‘.6...0£ 0.00.0 H3 .. .00 Q... inn“... ...3- ..I. . ..u - . 0 . . .wn‘ . 1L... I. 30“.: . ...-n0. . .00 0.0. 3.10 .60.. 0‘ “10.16.60... 60 0.0 .000». 0.30. 0 2.0.1.1.... 0. 0 00. 0 WI].0.0003..D0| .....0.0c..v.| ’00... .0 .0- . . . ...!0 .0 . . 0. ...-.1. .60 .000 0D .0000. 0.70.0000 .1.0 .0600. 0.00 v .Q. .0 - 0 . .06.... .00. 6.6.0.: 0 .0” .0... 1 t .. . I. 6.. .. . ... .. . ...... 6 0... 9 ... 026?... 0.0.» ..0unh.-..1002?1. 61.5.2.0; .JO) 003... . ..| . 1 I ..- .... . .20 .. ..6- .30 v . 0-... 1L 00.0.0...6 0. .d .J'. . . 60000 .36 0.: . 06.3.3235- ...0 6.0.0.0..‘030. . 0. \l. 0 .60. 0‘ 08.7.3... 0.1 0 10. 6.... .0. 3.100013%. . . ... 0.....60.60..00K.6 . .022... . .1236- 0.. 06 . ..0600. 10.0.6. ... 1.....0. 36.00.10.9051 6.-01.o?.60......11.-...!. . .. ...... :0 .1.... 2.0.. 30.6.0330.- .. ...6 00. -.. ...0- 0 0 ...fi- .00.... 100. ...3}. .0. 0.. 1.0. ..0 0 :0 .. 3.1.. ....04..k66.-..-... 1.3:... . . .61....- . . 6- .. 60 91.3336. 0.7060 .....- 0020-... .. ... .0... .. ,- -:- 1.00.3.0. 0 .2 00.19 .... ... ...... .0130? .10? ..3 $0.. 0.006...- .000-7. . 9Q . . .0r 3|... 0.. 06.060. .0 46000 . . .0. . .0 0.100.000. .hn.’ ...6: . ... .0. 1.00.311. 21.60.... . 300.. .00! 2.10.0- ..6.6(.0|O.... 0- ... 1 . 1.0. ..0... ‘0I . (0 .-0\ 0611000 .00.: 0.0 0 . 0- 00.01 .. 60 0.00 0; ..0. . ..0 .06... .0 .0 6 u I 66.100.107.019. I. ..0. ...‘0. ’0-0 ...3 0 0....Jo....o-.\60. 10 64 .10... .0’ o 00.6. 31.660 06. .64.. 0. .1 0. .0. - . , ... .. . . H . 20.0.52... ... .61 ... 0- .461... . a . «20.0.0.- 0. . ... 0~0-0... .6. .799... 00.0...0 . 4 .. .0060. 4.. 0. v .. ...1 .01.! ..0. ..0. o. -..3...- 0...... . . ....0. 0010.00 ..00n0; 60.1.0 3. ..0. 2|. ...L.. ..00. .0. Tu. ..0. 3..-.01.... .. t0. . 0. 00.0.00 00..- 0' 0.... - ... 0. 9.10... .036 0.....- ... .. 9.0-... ...-.... o. .. 6... 6'0... 0 .. .. M0.62.0.6... .00. 0. 6.. 0....0- 0.1.09. 3.1100....m.01002.......0-60..0..6 .... 0.001... .‘.l.0 Q-..' .0- 0 9.600066 . .0 .... ..0 .. ... .. 0 0.02.... - 1.60.0 . f . .6 .60. .... . 0. .0.-.1096 . . . 20.3.0001. :00...1..10...0-..60000 0 ..0000-0. 1‘. 0 (F6. 0 .3v.......v.1. .125... .. 2... ........|.-. .2 0 6.. :- 0 .. 0.0.!- 329-05.... . . . . .. .0.-9... 110.... 1.0.6.... 0.06: .. f- . 091.200.130.- .90. . ~. 0.. 000.. I0 . 90 609.0000 ‘0 0" . 01 0 0. ...0. - . . 0000000 .Q ..Q 0 .. . :00. . . . .... .9. ..IQ6000. I ‘0... 1 . 0- .0000 . ...0- 0.6.0.231... o. n .....0l0 .0.-.1...!0 . . ....v... . 6. 0.....- .03... .l. . . . .....1. “.0 0 ”110.302000...0 .0006}. ’1... . . -. . 00.. - 04 a .09.. 0 . 0.060} . 0-00. 0.0.... ... lv‘lldufiuso I: ...Q .10 6. 1000-0. .'?06 —.0 000‘ 0;... . . .66 0 0. .0 ...0'0 . l 11 1.01660 . . . . . ._ .. . . . . , 00.6. ..N .. .0.| . 00. 00 0-3.»..6009. .3100319 00.! 8.0.9 .00 ..3 5:60.139“? ‘8 ...... 301.000 ..0 .06.-.. . . . . . . . . . . . . . _ . . . . . . 6.. .. ..0. 0. 0 .v ...: . .. 00. 0 1. . ...3-0.9 .0.-.0... . .0.. ..0-6.0.0... 60-00.:‘000-1iofinuu 619000-1113. £91 . .2...0. ...-10.0 .60... .. . . . . . . .. .. . . . . . . . . . . . .0....8. .. ...0. 69“.}; . .03.... ...: . .10....00... ... ... ... 0. ...‘PW-A..600.:.7 VI ..1 6 6 n-r . . - .. 01. 1 19-. . . . .00 ... . . -. . . . . .. . . . . . . . . . . .. . . 01 . .00 006’ 1.00.0.9.- .. 00 0‘ .0010... .0 . 0 1.00.! ‘0...- 0 .0 .10-01. 6.0.."r II. .000! 1000...! ..3-‘0‘ ‘ . 0-. 0 .. ..0- v. 6.3.00. 51-»...- ...-0...).-. ..0 . 0! 0.. 00.6. .60. . 2.! ...-0 .013 .. .12.... 3.2.20.6... .. ‘61; 0.53. 1.133.306. ...000 .. . _ .1 .|0.0.| ...600 0. . .0. ...1. ...0.'....0 . 6.15 .6. .2063... \9001003510.00%\|2.’190 thv ‘0‘... Q03 .0 3.0.0. . .1. 1-... 0.. .304 . .... - .91. 6 ..0. 066. . .... .0.. .0 0-00 . 0. 600.3 .060 1. . .. . 10.0.. ..0 .0 .... .... 0.. . 21-3-17 3... .0. . .0 .0 .0 3.0.“. 3.10. .0.. 0:. 5:0 . ’6 1. 0. L 0.1.. v ... 1.. . .00: 60130310300000". 90.100.301.00. ,.I 32.2.1906! .. . ... ....J 01......&.-0o:.i.0.|c ...-...... .00. ...! .90.. ..91335 .. ...-..0 ..0. .0 020-? .1000. . ... .. .64.. ...-...}...60 .. ..C .....1..0...0..l... .... . .00 -. . 03.6....119760... ...... 10.0 . ..0-0 00.6. .0 :00. 6.6 30......9000.0 . n 00 .: 0 0. 1.13.0- -...-.- - ..00- 0 66.. n . Q ..0; .. 1.6.1.0.... 0.1010. 0 . 0 0... .0.. .00... .0. .0.-0L: 0. . 63.6! 0.00- 200. . .0. 9.... 0.. . ......9 000.00. .00.. .0 02...: .. .0. .0.. 0.01M.... 0. 400.. .. . ...... 00. K... .... ...0. - 00.1.6.6. 0.7 .. 01 .. A 1 .0 I .0 0.. 0. .0. 0 0 26.0 19’ ... ..0- ..00. ... .6 . a. 060 0c. 19......0 . 1... . . . .I.‘ 9.6.20.1! v. 60.1.. ..0 . . .1263 ...-96061.11... .... .... .. ... v.- ....b1.06.!0..0..|- 006. . .. . . ... 0: .1D6.. . 0.3.0.00217: - ..3. 0 ‘60... 000 0... I. ..6 0 .....6 0Q .0. 06 50.1. . . . . . . . . . . .. .. . 0 6000 ..0.- .. . . 6 60 .6... .... 6 .0. .0 . . ..0-00190000....t0 301.001.000.000 ...: . . . .. . .. . . . .. .. 0.001.013.0109. . 3.3.110... ..-.... .0.... “0.000. 000...} 0 - 0.... 33!, I 6 0 . 0 000.4. 0.000 . .0 8.0 .. -00. ..00 0 . . .... .9 :60 .0. ..00600. 00-0 .9. . . ..0 0... . 0. . .060 0.0.70.0 .. ... . . 1.0.21.0...) 30.01.13: ... ... 00 .0. .... .0. . .. 0600.0. 3.0021... ..0096‘.J.|... 0 60.. .0.-0.6.1.0300 Q1109 0. 6-. - ... ..Q-uor‘ 0 1.1. .00! ..0. .0 ...IQPD-ot. ,. . ..0... u. ..l...0. 0. .. ... - ‘0. 6.6 0000 0 - ...-.0 10.0....- .ioo‘.00.00vl£0$\‘06|1990.f008 0.00.0638”. . I . _ ...0.60.a00.0-_.1 . .l0.|06 .. -.Q. 0.00)... .390. -..0-0. 0-‘00 -Q. . 0:. 0. 060 s 0 . . .0 0. ~ 0.. O. .0. .00 0 . 1-0 .. 0000. 6. ”0. . .0... ..‘00606. Q0'10...I0 160.010.01; 0. .0.. 0.0..|. ’Q-v. 003.130.150 .Q...000-7.u . .006 - 0....|....o. 00 60...... . . 1000-00-0 6,. . 0000. .. .021: 6. ... 1.1.... 00.. . . 0.1100. .00 ... . ,- 0- 9. . . .66 - 6. .030_0....000.0.0-0< 00.7,6060 . . . . . 0 .00-0 0 .n0 0-1 ... 16-6. .. . . . . .60.. 0.. . 3.4.11. 16.....- . ...-..0 ...6. 0.... . 90.00. . . ... . ... . a... . . . 0 .600 ‘10-..016) . 00... 0030 0rl0. 0 3‘ .0606 V0 0 106.9992! . ... . .0 0. ..0... ......70- 0 . 209.... . . . . . I... .~ ..0 .0.... ...-6.0.. . . S 6.000.- s. 6 .0 00.. 1|. 0.0.5.6. .1004. 01130.1 . .. ......2. ...-......t... ._ ...... ..0. ....03. o... .6 . .01... .00... .... . ......083..\-6.6.l.0000..60.60.|.. 11.00.060.916...- .. 03...!!13t01‘ . ... ... : . ... ...3-P3131 .... . .51... 0.0.10.3. 3.00.1. . .0.-0.000001”- .110101 16.0. 6’00 )0 1‘6 . . . ..- ... . .... .... ....- ...... ... . . 03.1.3.6: .3091}: aka)... ...3-I031 . .u‘. ..0 ...3-.06 . . ... 2 : M6 .... . 0- .0.1.016. 000.- 01.. 0 0600. :61... . .... :10... 3..... 0.. . . . .. .0 0.0.0.919? 01199-010000 .. .1 . .6... .01.... . .. .0400... .. .0... 0 6 6-. 6. ‘0... .... ... “...? z- . . . . . . .. .... 097. . . .. . . . . .. . . ..9 ...?90220 002...; . a. . .61. 0.0.1.0 0 910...: . . -.. 01 I ..1 I 0 0:080 .. . 6 . . . .. . . . . . . . . . . . ,wd. . .. . 5.900.111.10..00Q- .0.!9.00028100r|.t$9 .. 3... l7 , L o. ....6. . . . . . ...: .... . . . . . _ . . . . ...q ..0. . . . . . .. . .1003.- 0 ....IA 3.3... 09.0...3.9..10000. .10; . . . . . . .....T . . .... .. . u. o... . .....2 -. -. . -. . .. 3 {0160.0 -610... \....£0.l (.0. .0.-9....31106993 ...-.0600 0 0 .9 .0. . . . 0 . 0.. 0 . . . . 5". .0....-l |.03 0.01.0.- 10.¥Tl.' ‘10.! 1530;000:0060! \.. 0.6-0 . 8 .. 0 - . ..0 .0.. .0. .. . . . . . . .... 60. . . . 1101 000. 0.0...0. 6.. 0 1.: 0 .00- 00... 1. ...0.. l ....0 . . . .. . . . . . . ...... . ..k .... .. . . . . . :P.’ 3.0... ..... l1....h£.l-«nu-00.00F9. 0.30.00.66.06. i . . . I 2.0.. .. ....Hasn 0.0.36.1... .010..‘0£..‘0| .0...|-6|.6|0¢..0u ‘0. .000... . . .0... ..000. .. ”g. 0 . . .00.?! A “0:6 60. #39,}! ..0". 300001301: .- ... . 1... 26.. 1-0... . 6. .7? . .0306). 20.1.1. 0.0.0.0...1-5. 09.! 0.3. 00 .01,- 0.90”” : 6 .. -.. . .. to... .. ... .... .0. ... .0... .. .51.:(0..0008.|06:}I0.0|.f 3..-191.00.103.00 00 . . l 6 300.00.006.2011-‘0‘ 3..-00.0.3.1... 0030-11-00- 0 0- - 0 a. . .0 . 90‘... .-.-Q... . . . . ...- ....1.Q0.0 .00000. ... 0n.- 0 D . f 0.10 Q6. 0. . 0.0 0‘ 0. I.- [0.20 0 000. I . c.3393. 0.. 0-0}. . ... . ....0’2 - .....S. .. . 76.3-6.0 09 a 0 00010.00. 0 0 0001.0..l. 0. ..Q Q .0.. . . . .. 00 ...-10.0000! . . .1 10.10 0. 0. . . . . . .. . .. 0 .009 . 00 . .. 0 90:36-12... .0 00- 0 0.0. . ...... . ..0 6.0!. 001 0.0.13. 0’ .0 : 00 ..3. . .thuo ... 9 ......I ..Q- ..0-0 .6..I!0.\.0 ...7. . Q. 000.1 v.0Q0.. 6.0. .0 .1900... ...0. . ... . . ...0. ... ..0 -|.... . .‘.0 .0 ..I 0 .00.. 6.0.. A- 0... 000 1‘0. 00. u... ....6. .0..00.‘00...00.. . . .00.6|... ....06006 10.1 0 o. -|. . . .0. . :73... ... 3. .. . . L. .0060... 0 00......» ...J I... ... .0. .30.: 000., .... ..0... _ .0 ..0 060 ..0: 9 0. .30..V0-01-. ..l. ... .0. . . 0 .. .... . _ . .. - 14.0....0-‘00. ..I .0. . .. .... .o .0. 0.0“... . . 00. .n. . 6 ... .60.... 00.0 ..I 00 0 00.... .60 0 .\.. .0 .6. Q..‘ ..00 .00... 5‘” ”5.0 .. 10.... ..0. 6.1 ... .6 .0. . 0-0 6...0. . . 0.0-. . O. . . run“. I. L... ..0. .....0. . .0 ..0 ans... 0. .60. 10.20. 0600-3... 0.0.9160. 0.26.! ..u.0»-J0.00...-00100.6. . . . _. 000 0- X - .16.. .0.-.0391, .21”, 16. 0.0. ...-360.001.00.00 0.0!? O ... .0 .0. . .6 .6 6 .0 v . . 06000 '. 0 0.0.00 . . X0 - :0 . ..0 02.0. 0. 6 . t 0 ...0- | .- . .6. . . ... . .. . .. . . . (-91.111630. . . 66 0.0.10. .0.1.11 .9 .006...-.’. 0 -\‘)00.’90)I§1\‘|6000“§0§0. ......- 6- 0. .0.... .Q. ’0... 00.0.0000 .. | .0 . . . . . . . . . . . . . .. . 11...... 106 .0900. .....91 6.06.000? C. 003.311.912.917: 6 .‘010'3'6’000 6:69.01991‘31 .... 50...? .0 ... 0. ..0. 6.. . . . . . . . . . . . . . . . .. .. . . .. . . . . . ... . : .10....6)! . 2”... 6.0.3.1.]... I. ...v .210. .fi70601620M0-6‘J06'!|.‘01 .9300. 01.090. -. ...0 . . .0 . .6. v. . u . . . . . . . . .. . . , . . . . . . . . . . . . . . ....33-1-906: 01.696.01.300.)- .6.H 16.0 0 .0....nl900-019.‘ .0. . .. 3.00 . 000 . .. ..0 91 . . 0 0 0 ... . ...-1.6 - 0 .7. 0 ..66 61 . ...... . 9 || 10.00 000.0,... “.0.-.1030”. 0" (0.0.9191-00! . . . .6 n0 “.0..." 6 M3 0 ...-0. 0.0.... .06. 9 J... . ...10}M.L.».“.. ...». -. 0-... - 0. Q.. 0.. .0. . .. 0. 0.. .000. 020.16.0-‘6‘ N "who. lrfi1‘uouvi‘o 900. 0. .0 - 0 ... 6 00’ 9 . . . ' ”0-.- .. .3001... 6. ...-.09.. ...“.X. ... 6.70 .0 .. . ..0. . .3“... . . 6.... ..-. . . 0...? .. 6100060 06.14 10312001519090.0001. 901,665."..rah‘ I ‘3... .2000 1‘ 1.0. ... '0. .0 0 ..0 a .0 . 3.7.. 0. . 1.65 - .. .. ...... 0 ,0. 05-11-0000. . 0161961000.:0000’ ".900112!00¢0i92930 609.6- 161..- . .0.. .6 . .7 ....... p 106;. - .. . 6a.. .08.. 19.0.. ...-0. 1.06-1006...:6111\.\2.-.306 totittglfii . .. .. .. - .. 9 . ...-1.106. . ...- 701 0. 1.,0.0.0.\0h.- 66.0 .493. 619.0060|0000 1.0006019001996-060110190‘. — .0"... .anvfldn . . . . .9 . "H0... 0. 0.00. ..0..- .. .- ..0.Q£.0....~u.6-.0 0... . 0. 0-1.0 A. 0... -0 00.060. ...... . ..1. 9 0. 0.. . 0 .-. 0 0. ..100‘0w06- .‘60flkfl‘hflfi6‘gflu 0 IO-il’. . 1 0.16006IONL4 ‘0” O 00 - . 1 . .. - . 62669 .06 0 .0‘.0- 0100. 60. . . .. . . . v . 0‘- 00... 11010 01 1 . 0 0' 00 ‘ .1.... .. ... l.0.t. . . $.31... . ....10. -. ..0. ..6 . ... . . .... .. .. . . . . 3..... . ...0. .01.... - ..0; io- .. . . . . . ... o . .. ...-.... «3.. .0. . ,_ .. . . .. . .. ...? . 1. . .. . . . ... v.4... .. ...... ... 7- . .1. .910... .J.....o|..v.00 .... . 9 6.060100173009006. .l..0-0.000663‘ 1. 36%. 0.1 4.. 00C. 7. .. J... . . ... .. . 0-1.. .. - .. ...: . .60 . .. . .0. . . . . . .... \. .. . ...606 ....0. .. .0. . .6! 5-6 I. .90 -.. l . 0 . -..|. 0.. 0 ..”0.uu0....d 16.3”” .0.“.. 06:. n-.n.:....-00. 01. ... .0 0a.... .0.“. ....00 .6.— -. ..0 6. 01'."- .l.6.0.0.0.:00. . ..00 . ”.02.... J0 . .1 . .. , ..l. . .0... .0 .. . .1.- 0 {-06.6 0a 0. - . .. 6 . . . 66.. .. . .00 ... .. 0.. .0. .060 .. . I. . ...0.!.. ... . . . .6 . . . ...0. w.... ... ...v ....nml.400. 0.. 31.13.. 5? .15.. . .X. .... .60 00...}. 06 ...3. 0.. .0. . .. ...61 3..-... .06 6... . . . - 0.3016 .. x 1.- ..3! . ... 6 . .66.. . . . . .. ...!- 61. 0;. n... . 6. .0 ... . .. 0. 0.. ...? . . . . 0 ..0. . .00 .0. ..L . - . o ..9. . .0 - ... 20.6.1... . . . 0-90. . ..- ..0o-0.0 .. 0 . . . 1.0.... ..u $6.0 .. ... . . .. . 0... 0:0: 00.. .6. 0. .. . .v 6 i H . . .60...6000 .0..00- k. u: 0.000.. . ..00. . 00. 0-. . . 0.. .. ...»..0 .0 ..00. . .0Q .0 o. . .. ... 0. 0 .v .. ...: . ... ... .... ... up . p .0 ..L..‘60. 0. .1‘01 . . . . . . . 0....00. .. .10.. ....0... 66 0.016....- .0 . 00 . . I110 . . . . . . 1.0.. . 5.61:0...- .0. ..0. .. 69. ......6... 06“. .II.J. . . .. . .. J 0 .0011... . O . 7.0.. .. 0.. Q". 96 1.66. .3662... 6. . .. (..0...- . 00.0.5 0. ..10 00- 9- . ..00H0....0000 06.. 0 0 .. . 1. 9.. 6 0. . 076-6 60.... 001...... Z .... 090.. . . 6.... ... V 0 .. o ...9: . .‘6 . .. ..3-0...! . . . . -0.- . .. 0 64.0... . . . ..‘l0000 . . 00.0 .v - . .1 .09 o 6.. 0 .0 .. a. 06- 0. .lo! I 00. . r: ’- ... .. 6. ... . 6. V .0 .0... . c - ... 0. 6. 0O. 000 -\ u. .0 .0 t 0- . .. .6 6000. . L. '60. . ... 6 6 I 7-0. 66.0.00 0 .. 0 0 01.6.1 , ... to. .6...0... .-. 0 ...-...0 ,0 o . .0 . ..0. _ 1 :0 ..HI “ufi _ L “0.0 9 .J Q.6v “€010 01.0 ..0-M. .0.-00w 0.. .X‘A - ..0-6L . L.f‘.‘~. *Wfioku‘.” ’0'.uu..000-..0‘.-u|.&- &... .0 PW“? 0|! . . - 0h:- 0 . . - :o . 0.0 3 . . . . 0|»..0...0.....6_mu 1‘ .u .0 1‘ him-16"....u. ...“. %.h.0.l0 .0. “I I ._ $1. om..0~.\.0001 . Qua-ymfit-F 0.". . ...:w.-Qu.. L02» . AZWMTM. u.“ 3. .00.w.0l|..014«\m0n.0'00.5..£. l. .... . .01.». .. ... . r 4". ... 11.6.. Mal»: . ...... .. .... ...-10.10.343.19 6. 1.. 3:... 1...... w...- .... . #0.. 1.5%.... 0.... . 0. t . .0 . at: 0.. 0;“. fit‘“? 5...). Hum-... 3.06.00. ..0.. 0...... 001.031. ...-0 do. .u... ... “...... .10... . .0.-03.002090. .. ...! 3.10.1119 ... b. . .00. .. 6 ”6L .. . . 1:0.....0J:.0 .... ... 0.. .....6 H... 0hr ..h. 0PM:. .0..." . .001“ 0. A 00 0-... iv0....| ..0 .00.. ..th watt... 1...}. .0.. ..0" 0... . .. ....L: r...r._. 06.. .w_ 1.0.0.0100... 0.20.0.1..- . . .66 I ...-00.. . .7. 0.-0 ....Q . .. '1'6 0.0. n .. .0 This is to certify that the dissertation entitled BOOSTING AND ONLINE LEARNING FOR CLASSIFICATION AND RANKING presented by HAMED VALIZADEGAN has been accepted towards fulfillment of the requirements for the Ph.D degree in Computer Science 9 4 / Major Professor’s Signature 09/27 /2o\ 0. Date MSU is an Affinnative Action/Equal Opportunity Employer LIBRARY A“ Michigan State University__ ——.——u——, PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:IProj/Aoo&Pres/CIRC/DateDue.indd BOOSTING AND ONLINE LEARNING FOR CLASSIFICATION AND RANKING By Hamed Valizadegan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2010 ABSTRACT BOOSTING AND ONLINE LEARNING FOR CLASSIFICATION AND RANKING By Hamed Valizadegan This dissertation utilizes boosting and online learning techniques to address several real-world problems in ranking and classification. Boosting is an optimization tool that works in the function space (as opposed to parameter space) and aims to find a model in batch mode. Typically, boosting iteratively constructs weak hypotheses with respect to different distributions over a fixed set of training instances and adds them to a final hypothesis. Online learning is the problem of learning a model when the instances are provided over trials. In each trial, a new sample is presented to the learner, the learner predicts its class label and then receives some feedback (partial or complete). The learner updates its model by utilizing the feedback and then a new trial starts. We consider several learning problems, including the usage of side information in rank- ing and classification, learning to rank by optimizing a well-known information retrieval measure called NDCG, and online classification with partial feedback. Using side information to improve the performance of learning techniques has been one research focus of machine learning community for the last decade. In this dissertation, we utilize the abundance of unlabeled instances to improve the performance of multi-class classification, and exploit the existence of a base ranker to improve the performance of learning to rank, both using the boosting technique. Direct optimization of information retrieval evaluation measures such as NDCG and MAP has received increasing attention in the recent years. It is a difficult task because these measures evaluate the retrieval performance based on the ranking list of documents induced by the ranking function, and therefore they are non-continuous and non-differentiable. To overcome this difficulty, we propose to optimize the expected value of NDCG and utilize boosting technique as the optimization tool. Online classification with partial feedback is recently introduced and has applications in contextual advertisement and recommender systems. We propose a general framework for this problem based on exploration vs. exploitation tradeoff technique and introduce effective approaches to automatically tune the exploration vs. exploitation tradeoff param- eter. © Copyright by HAMED VALIZADEGAN 2010 To my loving parents, Simin Rahimi and Reza Valizadegan, for their unlimited and unconditional encouragement, support, and love. ACKNOWLEDGMENTS During my Ph.D, I have received support from a number of people without whom the completion of this thesis was not possible. First of all, I would like to express my deepest gratitude to my thesis advisor, Dr. Rong Jin, for his unique supervision and guidance. He motivated me to work on a diverse set of problems in machine learning and provided me with an excellent mathematical and optimization knowledge support. Under his supervision, I have learned different aspects of conducting high-quality research and become capable of publishing papers in prestigious research venues such as NIPS and WW. For a number of years, I have also worked closely with Dr. Pang-Ning Tan with whom I published a few papers in data mining. I would like to present my sincere appreciation for his valuable support during those years. I will never forget his kindness and help. I would also like to thank my committee members, Dr. Anil K. Jain, Dr. Joyce Chai, and Dr. Selin Aviyente for their valuable feedback and discussions during my compressive and thesis exams. I want also thank the Department of Computer Science and Engineering at Michigan State University that provides me with the financial support in terms of teaching assistant for a number of semesters. I would like to particularly thank Dr. Abdol-hossein Esfahnian, Dr. Eric Tomg and Linda Moore for their amazing attitude in helping graduate students in the department. The contextual advertisement group of Yahoo! kindly provided me with an exceptional work atmosphere during Summer and Fall 2008. I would like to thank everyone in their group, particularly Dr. Jianchang Mao, the head of contextual and display advertisement science and Ruofei Zhang, my direct mentor. It has been a great pleasure to collaborate with Dr. Hang Li, the research manager of Information Retrieval and Mining Group at Microsoft Research Asia, and Dr. Shijun Wang vi from National Institute of Health with whom I co-authored research papers in ranking and online learning, respectively. Finally, I should thank the members of LINKS and PREP labs for all the great supports they have provided me with during my Ph.D. Particularly, I would like to thank Wei Tong, Fenhjie Li, Yang Zhou, Pavan Mallapragada, and Matthew Gerber. vii TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xii 1 Introduction 1 1.1 Classification ................................... 2 1.2 Learning to Rank ................................. 3 1.2.1 Training set ................................... 5 l .2.2 Evaluation ................................... 6 1.2.3 Learning .................................... 6 1.3 Batch Learning .................................. 7 1.3.1 Boosting .................................... 8 1.4 Online Learning .................................. 11 1.5 Contribution of This Dissertation ......................... 13 1.6 Benchmark Data Sets ............................... 15 1.6.1 Classification Data Sets ............................. 15 1.6.2 Ranking Data Sets ............................... 16 2 Semi-Supervised Multi-Class Boosting 18 2.1 Introduction .................................... 19 2.2 Related Work ................................... 22 2.3 Multi-Class Semi-supervised Learning ...................... 23 2.3.1 Problem Definition ............................... 23 2.3.2 Assemble Algorithm .............................. 23 2.3.3 Design of Objective Function ......................... 25 2.3.4 Multi-Class Boosting Algorithm ........................ 27 2.4 Experiments .................................... 3 1 2.4.1 Experimental Setup ............................... 32 2.4.2 Evaluation of Classification Performance ................... 33 2.4.3 Sensitivity to the Combination Parameter C .................. 36 2.4.4 Sensitivity to Base Classifier .......................... 36 3 Optimizing NDCG Measure by Boosting 40 3. 1 Introduction .................................... 41 3.2 Related Work ................................... 43 3.3 Optimizing NDCG Measure ........................... 44 3.3.1 Notation ..................................... 44 3.3.2 AdaRank Algorithm .............................. 45 viii 3.3.3 A Probabilistic Framework ........................... 46 3.3.4 Objective Function ............................... 48 3.3.5 Algorithm .................................... 50 3.4 Experiments .................................... 54 3.4.1 Experimental setup ............................... 55 3.4.2 Results ..................................... 56 4 Ranking Refinement by Boosting 58 4.1 Introduction .................................... 58 4.2 Related Work ................................... 61 4.3 Ranking Refinement ............................... 62 4.3.1 Problem Definition ............................... 62 4.3.2 Encoding Ranking Information ......................... 63 4.3.3 Objective Function ............................... 64 4.3.4 Boosting Algorithm for Ranking Refinement ................. 69 4.4 Experiments .................................... 74 4.4.1 Experimental Setup ............................... 74 4.4.2 Results for Relevance Feedback ........................ 77 4.4.3 Effect of Base Ranker ............................. 78 4.4.4 Effect of Size of Feedback Data ........................ 79 4.4.5 Results for Recommender System ....................... 79 4.4.6 Time Efficiency of Ranking Refinement .................... 80 5 Online Classification with Bandit Feedback 85 5. 1 Introduction .................................... 86 5.2 Related Work ................................... 87 5.3 A Potential-based Framework for Classification with Partial Feedback ..... 88 5.3.1 Problem Definition ............................... 88 5.3.2 Banditron .................................... 90 5.3.3 Potential-based Online Classification for Partial Feedback .......... 90 5.3.4 Exponential Gradient for Online Classification with Partial Feedback . . . . 95 5.4 Experiments .................................... 97 5.4.1 Experimental results .............................. 100 6 Robust Online Classification With Bandit Feedback 102 6. 1 Introduction .................................... 102 6.2 Related Work ................................... 105 6.3 Balancing between Exploration and Exploitation ................ 106 6.3.1 Preliminary ................................... 106 6.3.2 Finding Optimal 7 using [’y‘t aé gt] 3 rt and [9} = gt] 5 pt .......... 108 6.3.3 Finding Optimal 7 using [37¢ aé gt] 3 1 and {Q} = yt] _<_ pt .......... 110 6.3.4 Finding Optimal 7 using [3} 76 gt] 3 rt and {1;} = gt] 3 1 ........... 111 6.4 Experiments .................................... l 12 6.4.1 Experimental Settings ............................. 112 6.4.2 Experimental results .............................. 1 l3 ix 7 Conclusion and Future Work 116 7.1 Summary and Conclusions ............................ 116 7.1.1 Boosting .................................... 116 7.1.2 Online Learning ................................ 118 7.2 Future Work .................................... 119 7.2.1 Boosting ........... ' ......................... 119 7.2.2 Online learning ................................. 120 APPENDICES 122 A APPENDIX 123 A] Proof of Lemma 1, Chapter 2 ........................... 123 A2 Proof of Lemma 2, Chapter 2 ........................... 124 A3 Proof of Theorem 4, Chapter 2 .......................... 125 A4 Proof of Proposition 2, Chapter 3 ......................... 126 A5 Proof of Lemma 4, Chapter 3 ........................... 126 A6 Proof of Theorem 5, Chapter 3 .......................... 127 A.7 Proof of Theorem 6, Chapter 3 .......................... 127 A8 Proof of Theorem 7, Chapter 3 .......................... 128 A9 Proof of Theorem 8, Chapter 4 .......................... 130 A.10 Proof of Lemma 5, Chapter 4 ........................... 131 All Proof of Theorem 9, Chapter 4 .......................... 131 A.12 Proof of Theorem 10, Chapter 4 ......................... 132 A.13 Proof of Proposition 5, Chapter 5 ......................... 133 A.14 Proof of Theorem 11, Chapter 5 ......................... 133 A.15 Proof of Lemma 7, Chapter 5 ........................... 135 A.16 Proof of Theorem 14, Chapter 6 ......................... 135 A.17 Proof of Proposition 6, Chapter 6 ......................... 136 A.18 Proof of Proposition 7, Chapter 6 ......................... 136 A.19 Proof of Proposition 8, Chapter 6 ......................... 137 BIBLIOGRAPHY 138 LIST OF TABLES 1.1 Description of the classification data sets used in this dissertation ...... 16 1.2 Description of data sets in Letor 3.0 ...................... 17 xi 2.1 2.2 2.3 3.1 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 6.1 6.2 LIST OF FIGURES Performance comparision ........................... 35 Sensitivity to parameter 0 .......................... 37 Sensitivity to the base ranker ......................... 39 The experimental results in terms of NDCG for Letor 3.0 data sets ..... 57 Reduction of the objective function Lp using the OHSUMED Data Set . . . 71 NDCG of relevance feedback for different algorithms ............ 81 NDCG of MRR with different base rankers for relevance feedback ..... 82 NDCG of MR with different numbers of feedback ............. 83 The ranking result for recommender system ................. 84 Running time of MR for different numbers of movies ........... 84 Performance comparisons of different methods ............... 98 Performance comparisons of different methods with varied 7 ........ 99 The error rates of Banditron with different choice of 7 ............ 104 The error rates of different methods over trials ................ 114 xii Chapter 1 Introduction Learning is the task of constructing a prediction model using training data. A learning task is defined by an objective function that evaluates the performance of each model in the do- main. A variety of objective functions for learning are defined for different learning tasks. These learning tasks differ in I) their type of prediction, H) the type of feedback/labeling for training data, and III) the way training data are presented to them. Based on the type of prediction, the learning algorithms can be classified into three major groups: classification, regression, and learning to rank. A regression model aims to map an instance to a numerical value. A classification model (classifier) categorizes instances into predefined classes and a ranking model (ranker) orders a series of items based on a given request. Training instances can be presented to the learner in two different ways: batch mode and online mode. In batch mode, a set of training instances are provided to the learner and the learner trains a model off-line. The learned model is evaluated based on the prediction made for unseen test instances. We usually assume the training instances are i.i.d samples from an unknown distribution and the objective is to learn a statistical model that is able to make accurate prediction for unseen instances sampled from the same distribution of the training data. In online mode, the task of learning and making prediction are performed at the same time; i.e. the learner applies the current model to each received instance, and then receives the feedback for that instance and consequently updates the model based on the instance and the feedback. In online mode, we do not have to make the i.i.d assumption regarding the received instances and the data generator produces instances arbitrarily [l]. The feedback for the training instances can be either partial or full in online mode and the label for training instances can be either present or absent in batch mode. Each of these combination results in different learning tasks. When we discuss batch learning in more details in section 1.3, we cover a brief description of semi-supervised learning, in which part of training instances are unlabeled; we discuss online learning with partial feedback in Section 1.4 where the feedback only indicates if the predicted class is correct. In the following sections, we focus on classification, learning to rank, batch and online learning to draw the direction of materials in the future chapters of this thesis. 1.1 Classification Classification is the task of categorizing instances into predefined classes and has found countless number of applications. In the fully supervised mode, the learning algorithm re- ceives a set of labeled instances, each represented by a vector of features and a label that shows its class assignment. The objective of the learning algorithm is to learn a classifier that is able to make accurate prediction for unseen examples, generated by the same distri- bution for training instances. The ability of a learner in producing models that perform well for unseen instances is called generalization ability [2] in the machine learning literature. Many effective algorithms have been proposed for the task of supervised classification, such as Support Vector Machines (SVMS) [3], logistic regression [2], and boosting [4]. Classification is one of the oldest machine learning tasks. Nonetheless, it still finds applications that demands developing new techniques. One of the major challenges we address in this dissertation is to learn a classification model from partial feedbacks. As an example, consider the problem of contextual advertisement that chooses advertisements to display on a web page for a specific user [5]. The contextual advertisement algorithms are usually based on this assumption that users provide feedback by clicking on relevant advertisements [5]. However, if none of the displayed advertisements are relevant to the user’s information needs, they will not be clicked and consequently the algorithm does not know which advertisements are relevant for the user. We refer to this scenario as partial feedback as opposed to the case of full feedback where the correct output (i.e., the relevant advertisement) is provided for each instance. This task demands new online learning algo- rithms that are able to learn over the trials in the partial feedback setting. In particular, the online algorithms need to explore the exploration vs. exploitation trade-off techniques that are primarily developed for multi-armed bandit problem [6]. The performance of a classification algorithm is usually evaluated by the classification accuracy. For the evaluation of multi-class or multi-label learning, the classification ac- curacy may not be sufficient, particularly when the number of classes is large or classes are unbalanced. In those cases, the most commonly measures used for classification are precision, recall or a combination of these two, such as F1 measure and ROC curve. 1.2 Learning to Rank Ranking is the task of ordering a list of offerings for a given request. It receives a set of offerings and a request as input and outputs the list of offerings sorted according to their relevancy to the request. The performance of a ranking algorithm is evaluated based on how well it sorts the offering according to their relevancy to the request. Learning to rank is the task of learning a ranking function that can order the offerings for unseen requests. It receives a set of requests, each with a sorted list of offerings as the training set and produces a ranking function to sort offerings for new requests. Learning to rank is a relatively new area of study in machine learning that has received much attention in recent years because of its important role in a variety of applications including: 0 Document Retrieval: In document retrieval, the request is a textual query (a set of keywords) and the offerings are documents. Users provide a set of keywords to the system and the ranking system should retrieve the most relevant documents to those keywords. o Recommender Systems: In recommender systems, the request is a user and the offerings are the items to be recommended. For example, in movie recommenda- tion system, a ranking system aims to recommend the most interesting movies to a particular user based on the history of users and movies information. o Sentiment Analysis: In sentiment analysis, the request is a text and the offerings are the attitudes of the author regarding to a particular subject. 9 Computational Biology: In computational biology, a request is a protein and the offerings are the list of different 3d structures. The objective is to provide a sorted list of 3d structures for a given protein. 0 Online Advertisement Placement: In online advertisement placement, the request is a user visiting a web page and the offerings are the advertisements. Online adver- tisement systems should rank the relevancy of different advertisements to that user and display the most relevant advertisement on the web page in order to maximize the number of clicks on the advertisements. Throughout this thesis, we use the document retrieval terminology (e.g. query for request, document for offering) when talking about ranking although the material are applicable to other domains. Since learning to rank is a relatively new problem, we describe it in more details here. A learning to rank system usually consists of three components that distinguish it from classification and regression. 1.2.1 Training set The training set for learning to rank consists of a set of queries. For each query, a list of documents and their relevancy to the query are provided. The common practice in learning to rank is to assume the existence of a set of base rankers that can be considered the feature generators for query-document pairs. PageRank [7], vector space model [8], and statistical language models [9] such as BM25 are some example base rankers. These base tankers are basically unsupervised models that measure the relevancy of each document to a query. The value produced by each base ranker is considered a feature for a query-document pair and the learning to rank algorithm aims to combine these feature values to produce a ranking function. The label information in learning to rank is in form of relevancy judgments that can be of three different types: relevancy scores, pairwise relevancy information (partial ordering) and a complete ordering. A relevancy score is a numerical value (e.g. 1,2,..) that shows the level of relevancy of documents to a given query [10]. Relevancy scores are the most widely used relevancy information. The pairwise relevancy information is the relative relevancy between two documents that indicates which document among the two is more relevant. The pairwise relevancy can often be derived from the implicit feedbacks from users. For example, in search engines, when a user clicks on one of the ranked documents, it is safe to infer that the clicked document is more relevant than the documents that are ranked before the clicked one. This type of click-through feedback provides the relative relevancy for pairs of documents [11]. A less commonly used relevancy information is a complete relevancy ordering of documents to a given query [12] in which documents are ordered in the descending relevancy. Notice that the relevancy scores can be converted to a pairwise ordering and complete ordering but the Opposite is not true. 1.2.2 Evaluation The performance of a ranking system is evaluated based on how well it predicts the rele- vancy of documents to a query. Several evaluation measures are introduced in the literature. Area under the ROC Curve (AUC), Mean Average Precision (MAP), and Normalized Dis- counted Cumulative Gain (NDCG) are some of the most-widely used measures. AUC is based on the Wilcoxon test, a nonparametric statistical test to measure the distributional difference between two sets of numbers. AUC works only for two levels of relevancy judg- ments and measures how well a ranking function places the relevant documents on the top of the irrelevant documents. AUC treats documents similarly regardless of their position in the ordered list. However, the top retrieved documents are more important because users only look for the relevant documents at the top of the list (e.g. consider a search engine in which users only look at the first few pages of retrieved links). Based on this observation, MAP [13] and NDCG [14] are constructed to put more weight on the documents at the top of the list. Similar to AUC, MAP only works for binary relevancy judgment. On the other hand, NDCG is a general evaluation measure that can handle ranking problems with multiple levels of relevancy judgements. 1.2.3 Learning Three types of learning to rank algorithms can be found in the literature: Pointwise, pair- wise and listwise approaches. Pointwise approaches [15-17] can be applied when the relevancy scores of documents are available. In this case, the relevancy scores are consid- ered as absolute quantities and a classification or regression technique is applied by treating the relevancy scores as class labels or numerical values. The pairwise approaches are the only group of techniques that can handle the pairwise relevancy information. They ap- ply a classification or regression technique to learn the ordering information of pairs of documents [18-23]. The third group of algorithms, the listwise approaches, are the most effective learning to rank techniques that have been studied in the last few years. They are motivated by this observation that most evaluation metrics of information retrieval measure the ranking quality for individual queries, not documents. These approaches consider the ranking list of documents for every query as a training instance [13, 24—29] by optimizing a listwise loss function. We describe these techniques in more details in Chapter 3. 1.3 Batch Learning In batch learning, a set of training instances are provided that are generated by an unknown distribution. The goal is to train a model off-line that is capable of making accurate predic- tion for unseen instances. As mentioned before, dependent on the type of training instances and their labels, different learning tasks can be defined. For example, in classification, each instance is a vector of features and the label is the class assignment. And in the listwise approach to learning to rank, each instance consists of a query, the list of its documents, and the relevancy of documents to the query. Training instances can be either all labeled or partially labeled that results in two dif- ferent modes of learning: supervised and semi-supervised learning. All training instances are labeled in supervised learning and plenty of unlabeled instances are provided in case of semi-supervised learning to help the process of learning. The usage of unlabeled instances are based on some assumptions about the data generating process such as manifold and cluster assumption [30—35]. We return to these assumptions in Chapter 2. In most studies of batch learning, an objective function is designed to measure the performance of a given model (function) on instances. Different learning algorithms can be designed by defining different objective function for the same task. For example, in case of classification, the negative log-likelihood function is used in logistic regression, a hinge loss leads to support vector machines, and etc. In case of learning to rank, the pointwise approaches utilize a classification or regression model, i.e. they utilize a classification or regression loss function. Similarly, pairwise approaches are concluded from designing a classification or regression model on pair of documents and a listwise learning to rank algorithm results from utilizing a loss function in the level of query. Given an objective function (loss function) L(F) to measure the performance of a given model F, learning translates to the process of finding F that optimizes L(F). A common approach is to restrict the model to a member of a parametric family F (w) (e.g. a linear model). This constraint translates the objective function L(F) into an objective function of parameters 212, i.e. L(w), and consequentially the optimal model is found by optimizing the objective function with respect to w. In this case, L(w) is called a function in the param- eter space. A different approach is to directly optimize L over function F. This approach optimizes the objective function in the function space and is called boosting. Boosting is the optimization technique we utilize in this thesis for the batch mode algorithms we cover. 1.3.1 Boosting Boosting [4, 36] is a popular technique with a greedy nature designed to optimize a given objective function in the Space of functions. This is very important because it allows to boost the performance of any base function (weak learner) once the problem is written in the function space. Boosting can be considered as a gradient descent algorithm applied in the function space [37]; in each step i, it learns a new direction ft and a step size at to move as much as possible toward the optimum point, which results in a final solution of Fn = 2L1 aifi. Instead of applying a direct optimization approach such as gradient descent, bound Optimization strategies [38] may be used; this is because f,- and a,- are dependent on each other and it is difficult to decide the values for fz- and O, simultaneously. The bound optimization strategy is often applied to decouple the dependency between f,- and ai. We use this technique in different parts of this thesis. First introduced by Schapire [4], boosting was initially designed to convert a weak learner that performs just slightly better than random guessing into an accurate classifier. Here, by random guessing, we mean a classifier with less than 50% classification error. However, as we will Show throughout this dissertation, the meaning of random guessing can change from one problem to another. In this view, given a set of labeled training exam- ples (xi, 3],), i = 1..n, a boosting algorithm provides the weak learner with a set of weighted training examples at each round. The weak learner constructs a model by optimizing its loss over the weighted training examples. In the new iteration, the boosting algorithm pro- duces a new set of weighted examples by increasing the weights for the examples that are misclassified in the previous round. The iterations are repeated till the algorithm converges. One well-known boosting algorithm is AdaBoost [39], developed based on an expo- nential loss function for classification. Algorithm 1 shows AdaBoost algorithm. At the beginning of this algorithm, the booster chooses a uniform weighting over the examples (Step 3). Given the weights produced by the booster, the weak learner constructs a bi- nary classifier that minimizes the loss Ct at Step 5. The booster then produces a new set of weights for the examples in Step 8 by increasing the weights for the examples misclassified in the previous round of learning (Steps 6 and 7). These steps are repeated for a number of times. We have the following bound for the misclassification error of the final hypothesis generated by AdaBoost algorithms: 6 g 2TH;1 et(1— at) (1.1) where 6t is the classification error for the hypothesis generated in round t. The above result shows that, under the assumption of weak classifier, the classification error is guaranteed to be reduced as the iteration proceeds. Using the Minimax theorem, Freund et a1. [39] showed that there is a mixed strategy over the space of hypotheses H that produces zero classification error over the training set if (H, X) is 7-learnable. The progress of a boosting algorithm is measured by how much the classification error (or a given loss) decreases at each iteration (or over time) and For 7 > 0, a learning algorithm is 7-leamable if for any distribution Q over training examples X, the algorithm can return h E H with at most % — 7 classification error Algorithm 1 AdaBoost Algorithm 1: Input 1. A weak learner, and a set of training examples 2. A set of training examples (x1,y1),...,(:rm,ym) where 1:,- E X and y,- 6 {—1,1}. 2: Initialize F(:r,-) = 0,2' = 1, .., m 3: Initialize 01(2) = 1/m,z‘ = 1, ...,m 4: repeat 5: Find the classifier ft : X —> {—1,1} that minimizes 6t = 2&1 Dt(i)I(y,- 74 ft($i)) 6: Compute at = %ln(l—:t—€t) 7 Compute F(a:,-) = F(:c,') + aft(:r,-), i: 1, ..,m (2.) = Dt(i)exp(20tyift($i)) 8: Compute the new weighting Dt+1 t malization factor. 9: until reach the maximum number of iterations where Zt is the nor- defined in the following form: M(Pt1 Q0) S HtT=16(M(htr Qt)) (12) where 6 is an increasing function of the loss, M (Pt, Q0) is the suffered loss when the majority vote Pt is used over H and Q0 is the uniform distribution over X (i.e. M (Pt, Q0) is the computed loss of weighted majority vote over the original samples), and M (ht, Qt) is the computed loss at round t (i.e. the loss suffered when a single hypothesis ht is applied over the weighted samples set Qt). Beside classification and regression, boosting has been applied to a wide range of ap- plications including: o Semi-Supervised Learning: Boosting can be utilized to adapt a supervised learner to the problem of semi-supervised learning. For example, [40] used binary classifier as the weak learner and boosted it for the task of semi-supervised classification and [41] exploited a binary supervised learner as the weak learner and boosted it for semi-supervised clustering. 10 0 Learning to Rank: Boosting is used to learn a ranking function to order the rel- evancy of documents for a query. RankBoost [19] and AdaRank [42] are example applications of boosting to ranking. RankBoost uses pairwise binary classifier and boost it for ranking and AdaRank adapts AdaBoost to optimize information retrieval evaluation measures such as Normalized Discounted Cumulative Discount (NDCG) and Mean Average Precision (MAP). 1.4 Online Learning Online learning is the task of learning when the examples are provided sequentially (over the trials). In each trial, the learning algorithm receives a new example, classifies it and then acquires some sort of feedback. Using this feedback, the online learning algorithm updates the model in order to better classify the future examples. The feedback provided to the online algorithm can be either full or partial. In the full feedback setting, after classifying an instance, the algorithm receives its true class label. One well-known example of such online learning algorithm is the well-known Perceptron algorithm [43]. In the partial feedback or "Bandit" setting, the true label is not revealed and the feedback is limited to whether or not the algorithm classified the instance correctly. Since the difference between full and partial feedback in the above discussion only makes sense for the case of multi-class classification, the online classification with partial feedback is called multi-class bandit learning [5]. The objective of the learner is to generate a sequence of hypotheses that guarantees a small cumulative loss in the long run when compared to the best hypothesis in the space of hypothesis; i.e. T T 1 1 . 5: 2:1: M (Pt. Qt) S ? mgn t§=1:M(P.Qt) + 5(T) (1-3) 11 where 6 is a decreasing function of T and should approaches zero when T approaches infinity. The bandit feedback has several real-world applications such as online advertise- ment [5] and recommender systems [5], as described in the following 0 Online Advertisement: In online advertisement, we often assume that a sponsored ad is likely to be relevant to the user’s query if it is clicked by the user, and irrelevant otherwise. In the case when the sponsored ad does not receive a click, the online advertisement algorithm is unable to locate the advertisements that are relevant to the given query, leading to the partial user feedback. 0 Recommender Systems: A recommender system recommends some items (e.g. movies) to the user. The assumption is that if one of the recommended movies are selected by user, that movie was a correct recommendation. However, if none of the recommended movies are chosen by the user, the recommender system is not able to discover the right set of movies for that user. While the problem of online classification with full feedback is well-studied, online clas- sification with bandit feedback has received attention only recently [5]. Kakade et a1. [5] introduced Banditron as an extension to Perceptron [43] to handle the partial feedback set- ting. Online learning with bandit feedback can be regarded as the problem of multi-armed bandit [44] when some side information (e.g. the feature vector of instances) is available. Multi-armed bandit is the generalized version of one-armed bandit game (a traditional slot machine) in which several levers are provided and the player aims to choose a lever that maximizes the rewards in the long run. At each stage, the player only knows the reward for the lever he chooses; the rewards for the remaining levers are unknown to the player. In a more abstract level, multi-armed bandit problem refers to the problem of choosing an action from a list of actions to maximize rewards given that the feedback is (bandit) partial. The algorithms developed for this problem usually utilize the exploitation vs. exploitation 12 tradeoff strategy to handle the challenge arising from partial feedback [45—47]. 1.5 Contribution of This Dissertation We address several important ranking and classification problems in this dissertation. Uti- lizing side information in ranking and multi-class classification, direct optimization of in- formation retrieval measures such as NDCG, and online learning in the bandit setting are the subjects we cover, as summarized here: 0 Semi-supervised Classification: The focus of semi-supervised classification is on constructing better models by utilizing unlabeled instances when the number of la- beled instances is small. Several semi-supervised classification algorithms are devel- oped based on manifold [32—35] and cluster [30, 48, 49] assumptions. Most of these techniques work for binary problems and converting techniques such as one-versus- one and one-versus-the-rest are applied to use them for multi-class problems [50]. This converting procedures has several well-known problems including imbalanced classification and different output scales of different binary classifiers. We utilize both manifold and cluster assumptions in Chapter 2 and design an objective function that directly addresses multi-class semi-supervised problem. We solve this objec- tive function in the function space using boosting technique. Our empirical Study Shows the superior performance of this boosting algorithm compared to the existing boosting algorithms for multi-class problems. 0 Ranking by optimizing NDCG: The objective in this problem is to learn a ranking function by maximizing Normalized Discounted Cumulative Gain (NDCG), the most frequently used information retrieval evaluation measure for ranking problems with multi level relevance judgement [10]. This is a difficult problem because NDCG is a non-differentiable and non-continuous loss function. In order to overcome this difficulty, we introduce the expected value of NDCG and solve it in the function space 13 using the boosting technique. The detailed discussion of this boosting algorithm is provided in Chapter 3. Ranking Refinement: In some real world applications, there are two complementary sources of information for ranking, ranking information given by the existing ranking function (i.e., the base ranker) and that obtained from users feedback. One example of such applications is relevance feedback, where the two sources of information are the relevance scores obtained from a ranking function like BM25 [51] and the relevance judgments obtained by the users. The key challenge in combining the two sources of information arises from the fact that the ranking information presented by the base ranker tends to be imperfect and the ranking information obtained from users’ feedbacks tends to be noisy. We encode these sources of relevancy information in form of pairwise relevancy and design an objective function to combine them. We also design a boosting algorithm to solve the resulting objective function. The detailed discussion is provided in Chapter 4 where we perform extensive experiments to Show the superiority of our proposed framework to several baselines. Online Multi-class Learning with Partial Feedback: Unlike online learning with complete feedback that has been extensively studied [52], the problem of online multi-class learning with bandit feedback was introduced very recently [5]. Ban- ditron, the first introduced algorithm for multi-class learning with bandit feedback, is a direct generalization of Perceptron to the case of partial feedback that uses ex- ploration vs. exploitation tradeoff strategy to handle partial feedback [5]. Using potential function and exploration vs. exploitation tradeoff technique, we develop a general framework in Chapter 5, of which Banditron is a special case. The major problem with Banditron is that its performance could be sensitive to the parameter that trades off between exploration and exploitation [53]. We develop an effective approach in Chapter 6 to reduce this dependency. l4 1.6 Benchmark Data Sets Throughout this dissertation, we use two sets of data to study the performance of the pro- posed methods, one set for multi-class classification and one set for learning to rank, as described in the following subsections. We use 5 folds cross validation to run all the exper- iments except for online learning. 1.6.1 Classification Data Sets Multiple benchmark data sets from UCI data repository [54] and LIBSVM web page [55] are used in our study. Here is the list and a brief description of these data sets: MNIST. MNIST is comprised of grey scale images of size 28 x 28 for hand written digits. It contains 60000 training samples, each represented by 780 features. Protein. Protein has 17766 samples, represented by 357 features and three classes. Letter. Letter contains 15000 instances of 26 characters, each represented by 16 features. optdigits. This data set consist of normalized bitmaps for handwritten digits from 30 people. It contains 3823 instances, each represented by 64 features. pendigits. This is another collection of images for handwritten digits. It contains 7495 samples, each represented by 16 features. Nursery. Originally developed to rank applications for nursery school, it has 12960 records, each represented by 8 features belonging to one of 4 classes (we removed one class that only had two samples). Isolet. Isolet contains 7797 Spoken alphabet that belong to 26 classes, with each letter forming its own class. Every spoken alphabet is represented by 617 attributes. Notice that for some of these data sets, there were two separate sets, one for training and one for testing. We only used the training set in our experiments. The information related to these data sets are summarized in Table 1.1. 15 Table 1.1: Description of the classification data sets used in this dissertation Instances Features Classes Isolet 7797 617 26 MNIST 60000 784 10 Protein 17766 357 3 Optdigits 3823 64 26 Nursery 12960 8 3 Letter 15000 16 26 Pendigits 7495 16 26 1.6.2 Ranking Data Sets We use data sets from information retrieval and recommender systems to study the per- formance of ranking algorithm in our studies. For information retrieval, we use ver- sion 3.0 of LETOR package provided by Microsoft Research Asia [56]. LETOR Pack- age includes several benchmark data sets for ranking, along with the state-of-the-art algo- rithms for learning to rank and tools for evaluation. There are seven data sets provided in the LETOR package: OHSUMED, Top Distillation 2003 (TD2003), Top Distillation 2004 (TD2004), Homepage Finding 2003 (HP2003), Homepage Finding 2003 (HP2003), Named Page Finding 2003 (NP2003) and Named Page Finding 2004 (NP2004). There are 106 queries in the OSHUMED data sets, with each query equipped with around 1000 man- ually judged documents. The relevancy of each document in OHSUMED data set is scored in three levels: 0 (irrelevant), 1 (possibly) or 2 (definitely). The total number of query- document relevancy judgments provided in OHSUMED data set is 16140 and there are 45 features used to represent each document-query pair . For TD2003, TD2004, HP2003, HP2004, NP2003, and NP2004 there are 50, 75, 150, 75 150 and 75 queries, respec- tively, with about 1000 retrieved documents that are manually judged for each query. This amounts to a total number of 49058, 74170, 147606, 148657 and 73834 query-document pairs for TD2003, TD2004, HP2003, HP2004 and NP2003 respectively. For these data Unlike the classical supervised learning, in Ieaming to rank,the representation of documents depends on the given query. Hence, features are extracted for each document-query pair, not just for individual documents 16 Table 1.2: Description of data sets in Letor 3.0. Query document pair Queries Relevancy level Features OHSUMED 16140 106 3 45 TD2003 49058 50 binary 63 TD2004 74170 75 binary 63 HP2003 147606 150 binary 63 HP2004 74409 75 binary 63 NP2003 148657 150 binary 63 NP2004 73834 75 binary 63 sets, there are 63 features extracted for every query-document pair. A binary relevancy judgment is provided for every query-document pair. This information is summarized in Table 1.2. For every data sets in LETOR, five partitions are provided to conduct the five-fold cross validation, and each partition is further divided into the training set, testing set, and vali- dation set. The retrieval results for a number of state-of—the-art learning to rank algorithms are also provided in the LETOR package. We will describe these algorithms in details in Chapter 3. In order to evaluate the performance of the proposed ranking algorithms for Recom- mender System, we use the MovieLens dataset, available at [57], which is one of the most popular data sets for the evaluation of information filtering. It contains 100, 000 ratings ranging from 1 to 5, with l as the best rating and 5 as the worst rating for 1682 movies given by 943 users. Each movie is represented by 51 binary features: 19 features are de- rived from the genres of movies and the rest 32 features are derived from the keywords that are used to describe the content of movies. To extract the content features, we down- loaded the keywords of each movie from the online movie database IMBD and selected the keywords that are mostly used by the 1682 movies. 17 Chapter 2 Semi-Supervised Multi-Class Boosting Most semi-supervised Ieaming algorithms are designed for binary classification. They are extended to multi-class classification by approaches such as one-against-the-rest. The main shortcoming of these approaches is that they are unable to exploit the fact that each exam- ple is only assigned to one class in the case of multi-class Ieaming. Additional problems with extending semi-supervised binary classifiers to multi-class classification include im- balanced classification and different output scales of different binary classifiers. Given that there are well-known multi-class classification techniques such as decision tree and multi- layer perceptron, the research question is whether it is possible to use these techniques as weak learner and boost their performance for the task of semi-supervised Ieaming. The main challenge in designing such boosting algorithms is that the definition of the loss for unlabeled exampels is not clear. One approach is to generalize the notion of margin for labeled instances to unlabeled instances. This approach computes the margin for unlabeled examples by considering their assigned labels at the current iteration of the algorithm. However, Since the labels computed in the early iterations is likely to be inaccurae, this strategy produces undesireable results. Unlike the exising boosting algorithms for semi-supervised Ieaming which are only based on the classification confidence (margin) of the exampels (i.e. cluster assumption), 18 we utilize both the classification confidence and the similarity among examples (i.e. the manifold assumptions) to design a loss function for multi—class semi-supervised learning. We further develop a boosting algorithm for efficient computation. Empirical study with the multiple benchmark datasets shows that the proposed MCSSB algorithm performs better than the state-of-the-art boosting algorithms for semi-supervised learning. 2.1 Introduction Semi-supervised classification combines the hidden structural information in the unlabeled examples with the explicit classification information of labeled examples to improve the classification performance. Many semi-supervised Ieaming algorithms have been studied in the literature. Examples are density based methods [30, 31], graph-based algorithms [32— 35], and boosting techniques [40, 48, 49]. Most of these methods are based on either manifold assumption [32—35] or cluster assumption [30, 48, 49]. Under the manifold as- sumption, the data is assumed to reside on a low dimensional manifold within the original high dimensional space and the class assignment of unlabeled examples can be derived from a classification function that lives in this low dimensional manifold. Under the cluster assumption, the examples of the same class tends to be closer to each other than those of different classes. As a result of this assumption, the decision boundary is expected to pass through the low density regions. Thus, a given semi-supervised Ieaming is usually speci- fied by a combination of two terms, with one term related to the classification error on the training examples and the other term related to how well the model satisfies the assumption (either manifold or cluster assumption). While most of semi-supervised classification approaches were originally designed for two class problems, many real-world applications, such as speech recognition and object recognition, require multi-class categorization. To adopt a binary (semi-supervised) Ieam- ing algorithm to problems with more than two classes, a common practice is to divide a 19 multi-class learning problem into a number of independent binary classification problems using techniques such as one-versus—the-rest, one-versuS-one, and error-correcting output coding [58]. The main shortcoming with these approaches is that the resulting binary clas- sification problems are independent. As a result, these approaches are unable to exploit the fact that each example can only be assigned to one class. This issue was already pointed out in the study of multi-class boosting [59]. In addition, since every binary classifier is trained independently, their Outputs may be on different scales, making it difficult to iden- tify the most likely class assignment based on the classification scores [60]. Though cali- bration techniques [61] can be used to alleviate this problem in supervised classification, it is rarely used in semi-supervised Ieaming due to the small number of labeled training ex- amples. Moreover, techniques like one-versus-the-rest, where the examples of one class are considered against the examples of all the other classes, could lead to the imbalanced clas- sification problem. Although a number of techniques have been proposed for supervised learning in multi-class problems [59, 62, 63], none of them addressed semi-supervised multi-class learning problems, which is the focus of this chapter. Given that the supervised multi-class classification is a well-studied subject, an im- portant research question is whether it is possible to develop a general semi-supervised framework that is able to improve the accuracy of a given supervised multi-class Ieaming algorithm by effectively exploring the abundance of unlabeled data. The immediate answer to this question is boosting technique. The objective of semi-supervised classification is to learn a hypothesis that makes minimum number of misclassification on the labeled exam- ples and utilizes the unlabeled data for a better generalization. Given a loss function for the labeled and unlabeled examples, a boosting algorithms can be defined by reweighting each instance based on the current value of the loss. One straightforward approach to define the loss for unlabeled examples is to consider the classification confidence as the loss for unlabeled instances. The difficulty comes from the fact that the classification confidence related to the unlabeled examples are unknown. 20 One approach to address this problem is to use the class labels predicted by the current model as the pseudo-labels for the unlabeled examples and utilize them to obtain the clas- sification confidence (or margin). Assemble [48], as described in Section 2.3.2, is con- strucuted based on the idea of pseodu-labels. The problem with utilizing pseudo-labels to compute the loss for unlabeled examples is that the pseudo-labels assigned in the early steps of the algorithm is not precise and can lead to undesireable result of the boosting algorithm. Particularly, this approach does not directly utilize the underlying properies of data described as a manifold or cluster assumption. Moreover, since all the existing semi- supervised boosting algorithms are designed for binary classification, they will still suffer from the aforementioned problems when applied to multi-class problems. To avoid the above problems, we design a boosting algorithm in this chapter by con- sidering a multi-class loss function that utilizes both the manifold and cluster assumption; i.e. it consists of two terms, one releated to the consistency of the predicted labels and similarity between the examples, and one related to the consistency between the predicted labels and the true labels of labeled examples. To minimize this loss function, we develop a semi-supervised boosting framework, termed Multi-Class Semi-Supervised Boosting (MC- SSB), that is designed for multi-class semi-supervised learning problems. By directly solv- ing a multi-class problem, we avoid the problems that arise when converting a multi-class classification problem into a number of binary ones. Moreover, unlike the existing senti- supervised boosting methods that only assign pseudo-labels to the unlabeled examples with high classification confidence, the proposed framework decides the pseudo labels for un- labeled examples based on both the classification confidence and the similarities among examples. It therefore effectively explores both the manifold assumption and the cluster- ing assumption for semi-supervised learning. Empirical study with UCI datasets shows the proposed algorithm performs better than the state-of—the-art algorithms for semi-supervised learning. 21 2.2 Related Work Most semi-supervised Ieaming algorithms can be classified into three categories: density based methods [30, 31], graph-based algorithms [32—35], and boosting techniques [40, 48, 49]. As mentined in Section 2.1, these methods are based on either cluster or manfold assumption, dependent on how they utilize the unlabeled examples. Denisty-based meth- ods are usually based on finding a decision boundary that passes through sparse regions and have the maximum margin to both labeled and unlabeled examples [30, 31, 48, 49]. Cluster-based learners utilize a similarity measure between examples and construct a graph to propagate the labeling information to the unlabeled instances [32—35]. Semi-supervised learning algorithms can be also categorized into inductive and trans- ductive learner based on their functionality. A semi-supervised learner is called trans- ductive if it does not produce a classifier and cannnot operate on the unseen exampels. Otherwise, it is called inductive. The algorithm we developed in this chapter works in the inductive mode. Semi-supervised SVMS (S3VMS) or Transductive SVMS (T SVMS) are the semi- supervised extensions to Support Vector Machines (SVM). They are essentially density- based methods and assume that decision boundaries should lie in the sparse regions. Un- like their name, TSVMS can work in inductive mode. Although finding an exact S3VM is NP-complete [64], there are many approximate solutions for it [30, 31, 65-67]. Ex- cept for [67], these methods are designed for binary semi-supervised Ieaming. The main drawback with [67] is its high computational cost due to the semi-definite programming formulation. Graph-based methods are usually transdactive learner that aims to predict the class labels that are smooth on the graph of unlabeled examples. These algorithms differ in how they define the smoothness of class labels over a graph. Example graph-based senti- supervised learning approaches include Mincut [32], Harmonic function [33], local and global consistency [34], and manifold regularization [35]. Similar to density based meth- 22 ods, most graph—based methods are mainly designed for binary classification. Semi-supervised boosting methods such as SSMBoost [68] and Assemble [48] are di- rect extensions of Adaboost [39]. In [49], a local smoothness regularizer is introduced to improve the reliability of semi-supervised boosting. Unlike the existing approaches for semi-supervised boosting that solve 2-class problems, we focused on semi-supervised boosting for multi-class classification. 2.3 Multi-Class Semi-supervised Learning 2.3.1 Problem Definition Let D = (.731, .., xN) denote the collection of N examples. Assume that the first N1 exam- ples are labeled by y1,..., le. Each y,- = (yz-1,..., yzm) E {0, +1}m is a binary vector that indicates the assignment of 2:,- to m different classes, where. yf = +1 when 2:,- is assigned to the kth class, and yf = 0, otherwise. Since we are dealing with a multi-class problem, we have 2254 yf = 1, i.e., each example 2:,- is assigned to one and only one class. We denote by g, = (3211,. . . ,y‘im) e Rm the predicted class labels (or confidence) for exam- ple mi, and by I? = (Qir,...,3)1-[,)T the predicted class labels for all the examples. Let S = [Sm-M,-x N be the similarity matrix where Sm- = SJ},- 2 0 is the similarity between 2:,- and xj. For the convenience of discussion, we set 3,3,; = 0' for any x,- G D, a convention that is commonly used by many graph-based approaches. Our goal is to compute 37,- for the unlabeled examples with the assistance of Similarity matrix S and Y = (y;- , . . . , 311-51)? 2.3.2 Assemble Algorithm Assemble [48], a boosting algorithm for semi-supervsed classification as depicted in Al- gorithm 2, is construcuted based on the idea of pseodu-labels. At each boosting iteration, xT is the transpose of matrix(vector) 3:. 23 Algorithm 2 Assemble: Adaptive Semi-Supervised Ensemble Algorithm 1: Input: 0 D = (x1, .., xN): The set of examples; the first N, examples are labeled. 0 s: The number of sampled examples 2: Initialize F(i) = 0,i = 1, ..., |D| 3: Initialize w1(z') = l/Nl,’i = 1, ...,Nl and w1(i) = 0,1 = N1 + 1, ..., IDI 4: repeat Set y,- = F($i),i = N1 + 1, ..., IDI Find a multi—class classifier ht that minimizes 6t = 2:131 wt (2)] (y,- 75 ft(a:,-)) Compute at = $11K?) Compute F(a:,-) = F(a:,-) + (112(2),), 2': 1, .., |D| wt(i) exp(atz{(yi#ft (1%)» normalization factor and [(12) outputs 1 if a: is true, and 0 otherwise. 10: until reach the maximum number of iterations 999$??? Compute the new weighting wt+1(i) = where Zt is the the boosting algorithm creates a new classifier and redisributes the weights by emphasizing more on the less-confident instances. Beside Assemble, several other boosting algorithms have been proposed for senti- supervised Ieaming based on the idea of using pseudo-labels [49, 68]. They essentially operate like self-training where the class labels of unlabeled examples are updated itera- tively: a classifier trained by a small number of labeled examples is initially used to predict the pseudo-labels for unlabeled examples; a new classifier is then trained by both labeled and pseudo-labeled examples; the processes of training classifiers and predicting pseudo— labels are altered iteratively till stopping criterion is reached. The main drawback with this approach is that it relies solely on the pseudo-labels predicted by the classifiers learned so far when generating new classifiers. Given the possibility that pseudo-labels predicted in the first few steps of boosting could be inaccurate, the resulting new classifiers may also be unreliable. This problem was addressed in [49] by the introduction of a local smooth- ness regularizer. However, these approaches do not utilize the underlying properies of data described as a manifold or cluster assumption. In what follows, we design a boosting algo- rithm for the problem of multi-class semi-supervised classification based on manifold and cluster assumpption. 24 2.3.3 Design of Objective Function The goal of semi-supervised Ieaming is to combine labeled and unlabeled examples to improve the classification performance. Therefore, we design an objective function that consists of two terms: (a) Fu that measures the inconsistency between the predicted class labels 17 of unlabeled examples and the similarity matrix S, and (b) F, that measures the inconsistency between the predicted class labels I’ and true labels Y. Below we discuss these two terms in detail. Given two examples 3:,- and xj, we first define the similarity ngj based on their pre- dicted confidence score 3),- and 373-: m “l9 exp 3}]? m Z3]- : Z: mexp(yz )Ak, m ( JlAk’ = Z b15719? = bTb- (2.1) where bf = exp(37£°)/ (273:1 exp(3)z’-°’)) and b,- = (b}, . . . , bl”). Note that bf can be inter- preted as the probability of assigning cc,- to class k, and Zz’fj, the cosine similarity between b,- and bj, can be interpreted as the probability of assigning 3:,- and :cj to the same class. We emphasize it is important to use bf, instead of exp(§/f), for computing Z3]. because the normalization in bf allows us to enforce the requirement that each example is assigned to a Single class, a key feature of multi-class learning. Let Z“ = [2sz be the similarity matrix based on the predicted labels. To measure the inconsistency between this similarity and the similarity matrix S, we define Fu as the distance between the matrices Z“ and 3 using the Bregman matrix divergence [69], i.e., F. = 90(2“) — MS) — tr((Z’“ — S)TV R is a convex matrix function. By choosing 90(X) = 25 293:1 Xi,j(10g Xi,j — 1) [69], Pa is written as N S,- u = Z (Sijlog-Z——:’: +Zuj— —S,-,j) (2.3) i,j=1 By assuming that 229% Zl‘j z 2211 N]? and log a: z a: — 1, where Nk is the number of examples assigned to class 1:, we simplify the above expression as Fu~ ~ 2, j— _1 82-2 ,j/Zg‘ Since S 2 Jcould be viewed as a general Similarity measurement, we replace 32-2, j with Sid and simplify Fu as N 5,, N s,- , Fu z ”2:1 ~55 = ”:21 2k: —-—1———jbkbk (2.4) Remark I We did not use _ 0; w,- is a measure of the failure of the algorithm on example 50,-. Using the new weighting on the training examples, MCSSB learns a multi- class model that minimizes the loss on the weighted training examples, by adopting the sampling approach as described in the following: MCSSB samples 3 instances by replace- ment, with probability of each sample proportional to its weight. 3 sampled instances are passed to the weak learner to obtain a multi-class hypothesis. In our experiments, the num- ber of sampled examples at each iteration is set as s = max(20, N / 5). After creating a weak classifier at this round, MCSSB adds it to the current classifiers to reduce the value of the objective function. For the experments, we ran the algorithm with different numbers of iterations and find that both the objective function and the classification accuracy remains essentially the same after 50 iterations. We, therefore, set the number of iterations to be 50 to save the compu- tational cost. 30 Algorithm 3 MCSSB: Multi-Class Semi-Supervised Boosting Algorithm 1: Input: 0 D: The set of examples; the first N1 examples are labeled. 0 s: the number of sampled examples from (N — N1) unlabeled examples 0 T: the maximum number of iterations 2: Set F(z') = 0,2' = 1, .., |D| 3: repeat 4: Compute of and fill“ for every example as given in Equation 2.12. 5: Assign each unlabeled example 2:, to class k; = arg minjc(oz£c + 0511“) and weight k’I‘ k’l‘ 6: Sample .9 examples using a distribution that is proportional to w,- . Train a multi-class classifier h(:r) using the 3 samples examples 8: Predict hf for all examples using h(:1:), and compute or using Equation 4.14. Exit the loop if or g 0. 8: H(:1:) (— H(:z:) + ah(:r) 9: until reach the maximum number of iterations Theorem 4 Shows that the proposed boosting algorithm reduces the objective function F exponentially. The proof of this theorem is provided in Appendix A.3. Theorem 4. The objective function after T iterations, denoted by F T, is bounded as fol- lows: T (,/At +oAt— ,/Bt +013“)2 FT 3 Foexp —Z " [Ft—1 “ l (2.21) t=1 where Au, A), Bu and B1 are defined in Lemma 2. 2.4 Experiments In this section, we present our empirical study on the classification data sets that were described in Chapter 7. We refer to the proposed semi-supervised multi-class boosting algorithm as MCSSB. In this study, we aim to Show that (1) MCSSB can improve the per- formance of a given multi-class classifier with unlabeled examples, (2) MCSSB is more effective than the existing semi-supervised boosting algorithms, and (3) MCSSB is robust 31 to the model parameters and the number of labeled examples. It is important to note that it is not our intention to show that the proposed senri-supervised multi-class boosting algorithm always outperforms other semi-supervised Ieaming algorithms. Instead, our objective is to demonstrate that the proposed semi-supervised boosting algorithm is able to effectively improve the accuracy of different supervised multi-class learning algorithms using the un- labeled examples. Hence, the empirical study is focused on a comparison with the existing semi-supervised boosting algorithms, rather than a wide range of semi-supervised learning algorithms. 2.4.1 Experimental Setup For each classification data sets, described Section 1.6.1, we split the examples into 5 partitions, with one partition used for training and the others used for testing. In each ex- periment, we used a small percentage (between 2 to 10 percent) of training instances as labeled examples and the remainding instances as unlabeled examples. We applied the pro- posed algorithms and the baselines on the training examples to create a model and applied it on the test examples and computed the accuracy on the test examples. We repeated each experiment 10 times and reported the average. We compare the proposed semi-supervised boosting algorithm to ASSEMBLE, a state- of-the—art semi-supervised boosting algorithm. The main reason for this choice was be- cause Assemble utlizes boosting technique and can exploit an existing supervised learning technique. This makes the comparision fair and easy because it enables us to compare MC- SSB and Assemble with base classifieres that have different quality. Also notice that As- semble is a powerful semi—supervised Ieaming technique that was the best semi-supervised algorithm among 34 participants in NIPSS2001 workshop competition "Unlabeled Data for Supervised Learning" [48]. Unlike the general setup introduced in 1.6.1, we used the test set for for mnist data set because of the huge size of the training set in mnist and the memory problem. 32 A Gaussian kernel is used as the measure for similarity in the standard MCSSB algotihm with kernel width set to be 15% of the range of the distance between examples for all the experiments, as suggested in [70]. To verify the importance of using the Similarity measure in the semi-supervised boosting algorithm and direct formulation of multi-class problem, we use two other baselines: MCSSB-Uniform that uses similar similarity values for every pair of examples (i.e. Sij = 1, 2', j = 1, .., N) that can be considered MCSSB with a bad similarity measure, and MCSSB-Absolute that considers absolute similairy between an example and itself (i.e. Si,- = 1,2' = 1, .., N) and absolute dissimilarity between two different examples (i.e. Sij = 1, i, j = 1, .., N & i 75 j). MCSSB-Absolute can be considered MCSSB that only exploits the advantage of using a direct formulation of the multi-class problem. We use decision tree with only two level of nodes, as the base classifier for all the methods in the standard setting . The combination paremeter C is set to 104 in all experi- ments. To study the robustness of the proposed methods, we further investigate the effect of the depth Of decision tree and combination parameter C on the performance of different methods in Sections 2.4.4 and 2.4.3 respectively. 2.4.2 Evaluation of Classification Performance Figure 2.1 shows the result of different algorithms when the amount of labeled examples is changed from 2% to 10%. First, notice that MCSSB significantly improves the accuracy of decision tree for 5 out of 7 data sets. For data set ’Nursery’, MCSSB performs worse than the base classifier and for data set ’Letter’, the result of MCSSB is not much different than the base clasifier. However, for both these cases, MCSSB-Absolute performs quite good that indicates the direct formulation of multi-class problem is useful and the bad i.e. 0.15 x (dmax — dmin)’ where dmin and dmax are minimum and maximum distance between examples Notice we also used neural network as another base classifier to evaluate the performance of our algo- rithm. Refer to [50] for the results on several benchmark datasets 33 performance is due to the utilization of a bad similartiy matrix. Note that for several data sets, the improvement made by the MCSSB is dramatic. For instance, the classification accuracy of decision tree is improved from 33% to 48% for data set ’Pendigits’, and from 24% to 43% for data set ’Optdigits’ when there is 2% labeled examples; the classification accuracy of decision tree is improved from 13% to 17% for data set ’Isolet’, and from 46% to 49% for data set ’Protein’ when there is 8% labeled examples. Second, when compared to ASSEMBLE, we found that the proposed algorithm sig- nificantly outperforms ASSEMBLE for all the data sets. More interestingly, Assemble reduces the performance of the base classifier for most data sets that indicates the usage Of pseodu-labelss can produce misleading results. The key differences between MCSSB and ASSEMBLE is that MCSSB is not only specially designed for multi-class classification, it does not solely rely on the pseudo-labels obtained in the iterations of boosting algorithm. Thus, the success of MCSSB indicates the importance of designing semi-supervised learn- ing algorithms for multi-class problems. Third, to verify that the outstanding performance of MCSSB is related to the direct formulation of multi-class problem and the usage of similarity measure in the boosting algorithm, we examine the results of MCSSB-Uniform and MCSSB-Absolute. Because MCSSB-Uniform does not utilize an appropriate similarity measure, it performs very poorry that emphasizes our effective approach in utilizing the similarity measure in the boosting algorithm. On the other hand, MCSSB-Absolute is the second best method after MCSSB. Because MCSSB-Absolute does not utilize any similary measure among exam- ples, we believe that this superior performance is due to our approach in direct formulation of multi-class problem. It is interesting to note that the performance of MCSSB-Absolute on the ’Nursery’ and ’Letter’ data sets is better than other methods including MCSSB that indicated the sensitivity of the proposed method to the choice of similarity method. And finally, notice that as the number of labeled examples increases, the performance of different methods improves. However MCSSB keeps its superiority for most of the cases 34 MNIST NURSERY 50 —Decision Stump 100 . +Assemble 40 +MCSSB 5. +MCSSB_Uniform 5 g -e- MCSSB_Absolute g 0 < £— 2 20 1 A w . 1 . 2 0764 0.06 0.08 0.1 .02 0.04 0.06 0.03 0.1 Percentage of labeled examples Percentage of labeled examples LETTER PROTEIN Accuracy 0.06 0.08 0.1 .02 0.04 Percentage of labeled examples PENDIGITS 0.04 0.08 Percentage of labeled examples 0.06 0.1 OPTDIGITS 004 0.06 0.08 0.1 Percentage of labeled examples ISOLET 0.04 0.06 0.08 Percentage of labeled examples V 0.1 f A A V V A V V 06.02 0.04 0.06 0.08 0.1 Percentage of labeled examples Figure 2.1: The error rates Of different methods with different amount of labeled examples. 35 when compared to both the base classifier and the ASSEMBLE algorithm. We also observe that overall ASSEMBLE is unable to make improvement over the base classifier regardless of the number of labeled examples. These results indicate the challenge in developing boosting algorithms for semi-supervised multi-class Ieaming. Compared to ASSEMBLE that relies on the classification confidence to decide the pseudo labels for unlabeled ex- amples, MCSSB is more reliable since it exploits both the classification confidence and similarities among examples when determining the pseudo labels. 2.4.3 Sensitivity to the Combination Parameter C Figure 2.2 shows the performance of MCSSB when the combination parameter C changes from 1 to 1010. It is clear that for large values of C, MCSSB is very stable. Notice the improvement of MCSSB on the base classifier for dataset ’Protein’ is very marginal for some values of C. However if you look at Figure 2.1, you will notice that the result of MCSSB for larger amount of labeled data (as large as 4%) is significant for this data set and not sensitive to the small changes of parameters C. We conclude that MCSSB is very robust to the choice of parameter C. 2.4.4 Sensitivity to Base Classifier In this section, we focus on examining the sensitivity of MCSSB to the complexity of base classifiers. This will allow us to understand the behavior of the proposed semi-supervised boosting algorithm for both weak classifiers and strong classifiers. To this end, we use de- cision tree with varying number of levels as the base classifier. We used decision tree with only one node (decision stump) up to fully-grown decision trees and plot the performance result of different methods. Figure 2.3 shows the classification accuracy of Tree, ASSEM- BLE and MCSSB when we vary the number of levels in decision tree. Notice that in each case, the maximum number of levels in the plot for each data set is set to the fully grown tree for that data set. It is not surprising that overall the classification accuracy is improved 36 MNIST NURSERY 40 . 100 . 5" 30' //W—. 5" BMW (0 (U 5 6 0 O < ZWZG—EP-a—a—e—a—e—e—HI <2 60W 10 r 40 a 10° 105 1010 10° 105 1010 C C PROTEIN LETTER 45 W 15 ' 640- ' : 5:10— a a 8 8 2 35' : 2 5: —Decision Stump ‘3 a a E a a E E' a a 5' +Assemble 3O 0 +MCSSL 10° 105 10‘° 10° 105 101° C C PENDIGITS OPTDIGITS 60 . 5o . 5‘40- M ,>,~ a a < 2 < O . 100 185 1010 ISOLET 20 ' § < 0 . 10° 105 10’° C Figure 2.2: The error rates of MCSSB with different C( 2% of labeled). 37 with increasing number of levels in decision tree for most data sets. We also Observe that MCSSB is more effective than ASSEMBLE for decision trees with different complexity and regardless of quality of the base classifier, ASSEBLE is not able to improve the per- formance of the supervised classifier by utilizing unlabeled examples. Notice that for some data sets, e.g. ’Protein’ data set, the performance decreases as the depth of tree increases. This is because, unlike other data sets, ’Protein’ has only tinee classes and large tree can lead to overfitting. 38 MNIST ——Tree 2 4 Depth of the Tree PROTEIN NURSERY 80 - 70- 6 60 . E 1 8 l 0 50 < —Tree 40 +Assemble + MCSSB 3C0 1 2 3 Depth of the Tree LETTER 25 . - --Tree 20 +Assemble > 45, +MCSSB o a, . E 1 E 6 61 < 40 --Tree < +Assemble +MCSSB 35o 2 I: 6 0o i 2 6 4 5 Depth of the Tree Depth of the Tree PENDIGITS OPTDIGITS 60 . - 50 r -—Tree 50’ 4o» +Assemble +MCSSB 64° 63 . E E a 30- :3 8 8 2 . < 20 —Tree < 1 +Assemble 1 +MCSSB G 1 A r r G 0 1 2 3 4 5 0 1 3 Depth of the Tree Depth of the Tree ISOLET 20 - . —Tree 15 +Assemble f +MCSSB 5‘ E 10 3 8 < 5 . 1 2 Depth of the Tree 1:“igure 2.3: The error rates of MCSSB with decision tree with different depth as the weak learner. 2% of training examples are labeled in all the experiments. 39 Chapter 3 Optimizing NDCG Measure by Boosting Learning to rank is a relatively new field in machine Ieaming. It aims to learn a ranking function from training examples with relevancy judgements. The learning to rank algo- rithms are often evaluated using information retrieval measures, such as Normalized Dis- counted Cumulative Gain (NDCG) [14] and Mean Average Precision (MAP) [13]. Until recently, most learning to rank algorithms were not able to directly optimize a loss function related to the IR evaluation measures, such as NDCG and MAP. The main difficulty in di- rect optimization of these measures is that they are non-continuous and non-differentiable. In this chapter, we discuss how boosting can be applied to optimize Normalized Discounted Cumulative Gain (NDCG) which is the most commonly used multi-level evaluation mea- Sure for Ieaming to rank. We start with a detailed description of AdaRank [42], one of the first algorithms designed to directly maximize IR measures. We further develop a learning t0 rank algorithm, termed NDCG_Boost, for optimizing NDCG metric. Unlike AdaRank that weights all the documents related to each query equally when optimizing the NDCG measure, NDCG_Boost weights individual documents differently even if they are all re- lated to the same query, leading to more effective Optimization of the NDCG measure. In Order to deal with the non-smooth nature of the NDCG measure, in the NDCG_Boost al- gOIithm, we propose to optimize the expectation of NDCG over the distribution induced 40 by a ranking function. We then present a relaxation strategy that approximates the average of NDCG value, and an optimization strategy to make the computation efficient. Extensive experiments Show that the proposed algorithm outperforms state-of-the-art ranking algo- rithms on several benchmark data sets. 3.1 Introduction Learning to rank has attracted many machine Ieaming researchers in the last decade because of its growing importance in the areas like information retrieval (IR) and recommender systems. Three types of learning to rank algorithms can be found in the literature. 0 Pointwise approaches: AS the simplest form, these approaches [15, 16] treat rank- ing as a classification or regression problem that learns a ranking function in order to fit the relevance judgments for given retrieved documents [16, 17]. However classi- fication and regression may not be the best for the task of ranking. This is because (i) classification problems are usually associated with unordered class labels where there is an intrinsic order among the levels of relevance judgments provided by the user, and (ii) the target variables in regression problems are assumed to be numerical values while the relevance judgments are only ordinary variables. 0 Pairwise approaches: These approaches are motivated by the fact that the rele- vancy scores in ranking are relative to each other. This group considers the pairs of documents as independent variables and learns a classification (regression) model to correctly order the training pairs [18—23], namely document da is ranked above db if the relevance score of da is larger than db. One major problem with the pairwise ap- proaches is that they assume pairs of documents are independent random variables, which is often violated in real world applications. . 41 o Listwise approaches: The listwise approaches are motivated by this observation that most evaluation metrics of information retrieval measure the ranking quality for indi- vidual queries, not documents. These approaches treat the ranking list of documents for every query as a training instance [13, 24—29], either by direct optimization of an information retrieval evaluation measure [13, 25, 28, 29] or by optimizing a listwise loss function [24, 26, 27]. Empirical studies have shown that the listwise approaches are more effective than both pointwise and listwise approaches because they utilize the query-document group structure which is a unique and useful characteristic in ranking. The main difficulty in optimizing the listwise loss functions is that they are non- continuous and non-differentiable. This is because these loss functions measure the re- trieval performance based on the ranking list of documents induced by the ranking function, and therefore their dependence on ranking functions is implicit. Given that classification is a well-studied subject in machine Ieaming, the research question is whether it is pos- sible to design a boosting algorithm that utilizes a classification algorithm to optimize an information retrieval measure such as NDCG. The easiest way to design such a boosting algorithm is the approach taken by Xu et al. in the design of AdaRank [42]. In each trial of a boosting algorithm, AdaRank re-weights the queries based on their NDCG values (com- Pared to AdaBoost that re-weights the examples based on their confidence in prediction). As we see in more details in Section 3.3.2, AdaRank treats all the documents related to each query equally when trying to improve the NDCG metric, which could significantly liInits the choice of ranking functions for optimizing the NDCG metric. In this chapter, we introduce a better boosting algorithm for optimizing NDCG metric that weights documents differently even if they are associated with the same query. In each iteration, the boosting algorithm provides a weighting as well as binary class assignments for given documents; the weak learner constructs a binary classifier from the weighted documents that are labeled \ It is important to distinguish the binary class assignment from the relevance judgments for documents 42 by the boosting algorithm. 3.2 Related Work We focus on reviewing the listwise approaches that are closely related to the theme of this chapter. The listwise approaches can be classified into two categories. The first group of approaches directly optimizes the IR evaluation metrics. Most IR evaluation metrics, however, depend on the sorted order of documents, and are non-convex in the target rank- ing function. To avoid the computational difficulty, these approaches either approximate the metrics with some convex functions or deploy methods (e.g., genetic algorithm [71]) for non-convex optimization. In [25], the authors introduced LambdaRank that addresses the difficulty in optimizing IR metrics by defining a virtual gradient on each document af- ter the sorting. While [25] provided a simple test to determine if there exists an implicit cost function for the virtual gradient, theoretical justification for the relation between the implicit cost function and the IR evaluation metric is incomplete. This may partially ex- plain why LambdaRank performs very poor when compared to MCRank [16], a simple adjustment of classification for ranking (a pointwise approach). The authors of MCRank paper even claimed that a boosting model for regression produces better results than Lamb- daRank. Volkovs and Zemel [29] proposed optimizing the expectation of IR measures to Overcome the sorting problem, similar to the approach taken in this paper. However they use monte carlo sampling to address the intractable task of computing the expectation in the permutation space which could be a bad approximation for the queries with large num- ber of documents. AdaRank [42], as was described earlier in this chapter, uses boosting to Optimize NDCG, similar to our optimization strategy. However they deploy heuristics to elubed the IR evaluation metrics in computing the weights of queries and the importance of weak tankers; i.e. it uses NDCG value of each query in the current iteration as the Weight for that query in constructing the weak ranker (the documents of each query have 43 similar weight). This is unlike our approach that the contribution of each single document to the final NDCG score is considered. Moreover, unlike our method, the convergence of AdaRank is conditional and not guaranteed. Sun et al. [72] reduced the ranking, as mea- sured by NDCG, to pairwise classification and applied alternating optimization strategy to address the sorting problem by fixing the rank position in getting the derivative. SVM- MAP [13] relaxes the MAP metric by incorporating it into the constrains of SVM. Since SVM-MAP is designed to optimize MAP, it only considers the binary relevancy and cannot be applied to the data sets that have more than two levels of relevance judgements. The second group of listwise algorithms defines a listwise loss function as an indirect way to optimize the IR evaluation metrics. RankCosine [24] uses cosine similarity between the ranking list and the ground truth as a query level loss function. ListNet [26] adopts the KL divergence for loss function by defining a probabilistic distribution in the space of permutation for Ieaming to rank. FRank [22] uses a new loss function called fidelity loss on the probability framework introduced in ListNet. ListMLE [27] employs the likelihood loss as the surrogate for the IR evaluation metrics. The main problem with this group of approaches is that the connection between the listwise loss function and the targeted IR evaluation metric is unclear, and therefore optimizing the listwise loss function may not necessme result in the optimization of the IR metrics. 3.3 Optimizing NDCG Measure 3.3.1 Notation Assume that we have a collection of n queries for training, denoted by Q = {q1, . . . ,qn}. F0r each query qk, we have a collection of mk documents Dk = {dz-“,7: = 1, . . . ,mk}, Whose relevance to qk is given by a vector rk = (rf, . . . ,rfnk) E ka. We denote by F (d, q) the ranking function that takes a document-query pair (d, q) and outputs a real number score, and by jg“ the rank of document (if within the collection ’Dk for query qk. 44 The NDCG value for ranking function F (d, q) is then computed as following: mk 2i—1 won 12;;— (3.1) log( (1 + 32') where Z k is the normalization factor [14]. NDCG is usually truncated at a particular rank level (e.g. the first 10 retrieved documents) to emphasize the importance of the first re- trieved documents. 3.3.2 AdaRank Algorithm The easiest way to design a boosting algorithm for Optimizing a given IR evaluation mea- sure is what AdaRank algorithm [42] performs. AdaRank uses an exponential loss function similar to AdaBoost. However, unlike the loss function of AdaBoost which is constructed based on the classification margin, AdaRank utilizes information retrieval measures such as NDCG to construct the exponential loss. To optimize NDCG, for example, AdaRank uses the following exponential loss function: Zexp(-£(qr, F )) k=1 Where £(qk, F) is the NDCG value for query h when ranking the documents for query qk by function F. The steps of AdaRank are given in Algorithm 4. In each iteration, AdaRank fiI‘lds a weak tanker ft that maximizes quantity m at Step 4, i.e. NDCG weighted by p. Then, it computes the combination weight for ft and adds it to the current set of classifiers in Steps 5 and 6 respectively. The authors of AdaRank paper [42] suggest using the ranking features (e.g. BM25) as the weak ranker. However, a (multi-class) classifier can also be uSed as the weak tanker. To construct a classifier that maximizes qt, AdaRank distributes the weight 12,500) to all documents of query k equally, and constructs a classifier based on the documents that are sampled according to the weights. To redistributes the weights to 45 Algorithm 4 AdaRank Algorithm 1: Input: 0 Q = {q1, . . . ,q"}: The set of queries 0 Dk = {(df,rf),i = 1,...,mk}: The set of documents and their relevancy scores for query qk. 2: Initialize p1(qk) = 1/n,k = 1, ...,n 3: repeat Find ft by maximizing weighted NDCG; i.e. 77t = 2:le pt(qk)£(qk, F) Compute at = 211101233?) 4 5 6: Compute F(df) = 2L1 alfl(df), k = 1, ..,n, 2': 1, ..,mk - - _ exp(-£(Qk,F)) 7 Compute the new werghtrng pt+1(qk) — 22:1 exp(—£(Qk,F)) 8: until reach the maximum number of iterations instanced, AdaRank increases the weights of difficult queries (e.g. those that have small NDCG) and decreases the weights of easy queries (e.g. those that have large NDCG) at Step 7. As it is Obvious from the Steps of AdaRank algorithm, it gives the same weights to the documents of each query, leading to a suboptimal performance. However, since a pointwise weak learner (multi-class classifier) is often utilized in a boosting algorithm to maximize NDCG, it is advantageous to allow every document to contribute differently to the final NDCG value. Moreover, although NDCG works in query level, not all documents have Similar contribution in improving the NDCG value at each stage of the algorithm. These Observations motivated us to develop NDCG_Boost algorithm that considers the contri- bution of every single document in the iterations of the boosting algorithm to maximize NDCG. 3-3.3 A Probabilistic Framework C)ne of the main challenges faced by optimizing the NDCG metric defined in Equation (3- 1) is that the dependence of document ranks (i.e., jf) on the ranking function F(d, q) is not explicitly expressed, which makes it computationally challenging. To address this Problem, we consider the expectation of £( Q, F) over all the possible rankings induced by 46 the ranking function F(d, q), i e £(Q F) — <1L-1_> (3 2) ’ k=1 k log(1+j,-'°) F 2"”IC -1 :2 E P” 'F")log(1+vr'~(>) "1 22‘ all-4 MS I'M: fill-d where Smk stands for the group of permutations of m k documents, and 7rk is an instance of permutation (or ranking). Notation 7rk(z') stands for the rank position of the ith document by Wk - To this end, we first utilize the result in the following lemma to approximate the expectation of 1/ log(1 + 7rk(z')) by the expectation of 7r,“(i). Lemma 3. For any distribution Pr(7r|F , q), the inequality C(Q,F) _>_ 7:1(Q,F) holds where 2 i — k1 ’H(Q, F): i:— HZ]: 1}: 1(110g ’°(i))p ) (3.3) ProOfi The proof follows from the fact that (a) 1 / :1: is a convex function when :1: > 0 and therefore (1 / log(1 + 2)) 2 1/(log(1 + 2)); (b) log(1 + :c) is a concave function, and therefore (log(1 + x)) S log(1 + (27)). Combining these two factors together, we have the reSUIt stated in the lemma. [:1 Given H(Q, F) provides a lower bound for [3(Q, F), in order to maximize [2(6), F), we Could alternatively maximize 77(6), F), which is substantially simpler than £(Q, F). In the next step of simplification, we rewrite 7rk(i) as 74(2) = 1 + 2 104(2) > «kg» (3.4) 47 where I (x) outputs 1 when x is true and zero otherwise. Hence, (nk(i)) is written as mk mic Me» = 1+ Zuwka) > «km» -—— 1+ 2 12mm) > «W» (3.5) j=1 j=1 As a result, to optimize ’FMQ, F), we only need to define Pr(7rk(i) > wk(j)), i.e., the marginal distribution for document d? to be ranked before document dz? . In the next section, we will discuss how to define a probability model for Pr(7rk|F, qk), and derive pairwise ranking probability Pr(7rk(z') > 7rk(j)) from distribution Pr(7rk|F, qk). 3.3.4 Objective Function We model Pr(7rk|F, qk) as follows mk 1 131-(«km f) = k exp 2 Z (F(d§,qk) — F(d§,q")) Z(F’q ) i=1j-«k(j)>vrk(z') mic k - k k = Z(F, qk) exp (;(mk — 27r (Z)+1)F(dz-,q )) (3.6) where Z (F, qk ) is the partition function that ensures the sum of probability is one. Equa- tion (3.6) models each pair (df, (if) of the ranking list 7r’c by the factor exp(F(df, qk) — F(d§ , qk)) if dz? is ranked before d? (i.e., nk(d£°) < 7rk(d$-°)) and vice versa. This mod- cling choice is consistent with the idea of ranking the documents with largest scores first; intuitively, the more documents in a permutation are in the decreasing order of score, the bigger the probability of the permutation is. Using Equation (3.6) for Pr(1rk|F, qk), we have 'H(Q, F) expressed in terms of ranking function F. By maximizing 72(Q, F) over F . we Could find the optimal solution for ranking function F. AS indicated by Equation (3.5), we only need to compute the marginal distribution Pr(7“k(i) > nk(j)). To approximate Pr(1rk(i) > 7rk(j)), we divide the group of permu- tation Smk into two sets: 050,1.) = {Wklflkm > “ICU” and G’b“(i,j) = {Wklwkm < 48 7rk (j ) }. Notice that there is a one-to-one mapping between these two sets; namely for any ranking wk 6 05(22, j), we could create a corresponding ranking 7rk 6 G§(z', j) by switch- ing the rankings of document df and d? and vice versa. The following lemma allows us to bound the marginal distribution Pr(7rk(i) > 7rk(j)). The proof of this lemma is provided in Appendix A.5. Lemma 4. IfF(d’-“, qk) > F(dg-c, qk), we have 2 1 1+ exp [2(F — F(dga gm] Prokm > m» s (3.7) This lemma indicates that we could approximate Pr(7rk(i) > nk( j )) by a simple logis- tic model. The idea of using logistic model for Pr(7rk (2') > wk(j)) is not new in learning to rank [20, 22]; however it has been taken for granted and no justification has been pro- vided in using it for Ieaming to rank. Using the logistic model approximation introduced in Lemma 4, we now have (14%)) written as m k 1 #1 1+ exp [2(F(d§,qk> — F>] 1+ (3.8) 22 We» To simplify our notation, we define Fik = 2F (dz-c, qk), and rewrite the above expression as mic mic k' - r7rkz' nk' z 1 >—1+§P< ()> (3)) 1+j2=311+exp(Fik_FJk) Using the above approximation for (wk(i)), we have ”R in Equation (3.3) written as n 7.21: _ W2, F) m if, 71- 2 ——2—1—. (3.9) 49 where mk 10'7”) 3.10 AE;1+eXp(Fk—Ff) ( ) We define the following proposition to further simplify the objective function: Proposition 1. 1 > 1 __ Af log(2 + A?) _ 108(2) 2[log(2)]2 The proof is due to the Taylor expansion of convex function 1/log(2 + x), a: > —1 around a: = 0 noting that A? > 0 (the proof of convexity of 1/log(1 + x) is given in Lemma 3) and is provided in Appendix A.6. By plugging the result of this proposition to the objective function in Equation (3.9), the new objective is to minimize the following quantity: M(Q, F) 2:712—1162Q’k 2' —1)A (3.11) The Objective function in Equation (3.11) is explicitly related to F via term Af. In the next section, we aim to derive an algorithm that learns an effective ranking function by effiCiently minimizing M. It is also important to note that although M is no longer a Iigol‘ous lower bound for the original objective function 5, our empirical study shows that this approximation is very effective in identifying the appropriate ranking function from the training data. 3‘3-5 Algorithm To minimize M(Q, F) in Equation (3.11), we employ the boosting strategy [38] that iter- ativel y updates the solution for F. Let Fik denote the value obtained so far for document 50 (if. T 0 improve NDCG, following the idea of Adaboost, we restrict the new ranking value for document (if, denoted by if, is updated as to the following form: ~ I: k k 2 where a > O is the combination weight and fz-k = f (dz-fiqk) E {0,1} is a binary value. Note that in the above, we assume the ranking function F (d, q) is updated iteratively with an addition of binary classification function f (d, g), which leads to efficient computation as well as effective exploitation of the existing algorithms for data classification. . To construct a lower bound for M (Q, F), we first handle the expression [1+exp(Fz-k —ij)] —1, summarized by the following proposition. Proposition 2. 1 1 k k k ~ ~ 3 + . - ex a - — . _1 3.13 1 + exp(Fik — Fjl“) 1 + eXP(F2-k _ Ff) 72.][ p( (f; fz )) j ( ) where ex PIC—F’-It p( ' 9) (3.14) (1+ exp(Fz-k — F;‘))2 k _ 7232' — The proof of this proposition can be found in Appendix A.4. This proposition separates the term related to Pi"c from that related to off in Equation (3.11), and shows how the new Weak ranker (i.e., the binary classification function f (d, q)) will affect the current ranki ng function F (d, q). Using the above proposition, we can derive the upper bound for M (Theorem 5) as well as a closed form solution for a given the solution for F (Theorem 6). Theorem 5. Given the solution for binary classifier fz-d, the optimal a that minimizes the 51 objective function in Equation (3.1!) is rk m _ 1 Zk=1Z.-,jk=1—z—2i9£Cj191(f" fik) where 0:9]- 2723-10 79 2). Remark: Notice that in order to have this boosting algorithm continue the iterations, the weak learner needs to produce models better than random guessing in the following sense. Writing a in the following form a = llog(1—6) (3.16) where k 27:- 123112—3510: I(ff>fk) 5: ZZ=1E$"=1%ll-9,’f(1(f">>f,-’“)+I(ff {0,1} that maximizes the following quantity nmk Z Z lwticlfldbyf (3.18) 1:21 i=1 7: Predict f,- for all documents in {Dk,i = 1, . . . , n} 8: Compute the combination weight a as provided in Equation (3.15). 9: Update the ranking function as Ff +— Ff + eff. 10: until reach the maximum number of iterations Algorithm 5 summarizes the boosting algorithm in minimizing the objective function in Equation (3.11). In each iteration, it computes 62- for every pair of documents of query 1:. 19k can be considered a measure of how close the rank position of documents (1" and k k . . . dj are when they are sorted by function F. The algorithms computes wi , a weight for each document, which is the summary information of document dz? when its position and relevancy score compared to all other documents of the same query. wf can be positive or negative. A positive wf indicates that the ranking position of df induced by the current ranking function F is less than its true rank position, and a negative weight wz’? shows that ranking position of (if induced by the current F is greater than its true rank position. The magnitude of wf shows how much the corresponding document is misplaced in the ranking. In other words, it shows the importance of correct ranking position of document (If in terms of the value of NDCG. Using these information, the algorithm finds out the most difficult Notice that we use F (df) instead of F ((1? , qk ) to simplify the notation in the algorithm. 53 documents and the relevancy direction of their importance at the current iteration. Using these information, NDCG_Boost maximizes fit as given by Equation (3.18) which can be considered as some sort of classification accuracy. It uses sampling strategy in order to maximize m because most binary classifiers do not support the weighted training set; that is, it first samples the documents according to wa | and then constructs a binary classifier with the sampled documents. After learning the new binary model at Step 6, the algorithm evaluates its success in improving the value of NDCG in Step 7 and 8 and adds it to the current set of binary models (the mixed strategy over binary models) at Step 9. The following theorem shows that the proposed boosting algorithm reduces the objec- tive function M exponentially. Theorem 7. The objective fimction after T iterations, denoted by MT, is bounded as follows: NEW f)? Mt— 1 where (11 and a2 are defined as follows. 251' r]? n mk _ n ml; 2 =222 ’f,I(f,’-° 5.13.19) The proof is provided in Appendix A.8. 3.4 Experiments To study the performance of NDCG_Boost we use the latest version (version 3.0) of LETOR package provided by Microsoft Research Asia [56], which has been described in Chapter 1. Besides a number of benchmark data data, LETOR package also includes multiple state-of-the-art baselines and evaluation tools for research on Ieaming to rank. 54 3.4.1 Experimental setup A number of state-of-the-art Ieaming to rank algorithms are provided in the LETOR pack- age, including some of the most well-known leamin g to rank algorithms from each category (pointwise, pairwise and listwise). These baselines will be used to study the performance of NDCG_Boost. Here is the list of these baselines (the details can be found in the LETOR web page): Regression: This is a pointwise approach that applies a linear regression to a ranking problem. It is used as a reference point. RankSVM: RankSVM is a pairwise approach that applies Support Vector Machine [18] to the ranking problem. FRank: FRank is a pairwise approach. It uses a probability model similar to RankNet [20] for the relative rank position of two documents, with a novel loss function called Fidelity loss function [22]. TSai et al. [22] showed that FRank performs significantly better than RankNet. ListNet: ListNet is a listwise learning to rank algorithm [26]. It uses cross-entropy loss as its listwise loss function. AdaRank_NDCG: This is a listwise boosting algorithm that incorporates NDCG in com- puting the weights for both queries and the combination of weak ranking hypothe- ses [42]. SVM_MAP: SVM_MAP is a support vector machine with MAP measure as the target objective function. It is a listwise approach [13]. While the validation set is used in finding the best set of parameters in the baselines in LETOR, it is not used for NDCG_Boost in our experiments. For NDCG_Boost, we set the maximum number of iteration to 100 and use decision stump as the weak ranker. 55 3.4.2 Results Figure 3.1 provides the the average results of five folds for different Ieaming to rank al- gorithms in terms of NDCG @ each of the first 10 truncation level on the LETOR data sets. Notice that the performance of algorithms in comparison varies from one data set to another; however NDCG_Boost performs almost always the best. We would like to point out a few statistics; On OHSUMED data set, NDCG_Boost performs 0.50 at N DCG@3, a 4% increase in performance, compared to FRANK, the second best algorithm. On TD2003 data set, this value for NDCG_Boost is 0.375 that shows a 10% increase, compared with RankSVM (0.34), the second best method. On HP2004 data set, NDCG_Boost performs 0.80 at N DCG@3, compared to 0.75 of SVM_MAP, the second best method, which in- dicates a 6% increase. Moreover, among all the methods in comparison, NDCG_Boost appears to be the most stable method across all the data sets. For example, FRank, which performs well in OHSUMED and TD2004 data sets, yields a poor performance on TD2003, HP2003 and HP 2004. Similarly, AdaRank_NDCG achieves a decent performance on OHSUMED data set, but fails to deliver accurate ranking results on TD2003, HP2003 and NP2003. In fact, both AdaRank_NDCG and FRank perform even worse than the sim- ple Regression approach on TD2003, which further indicates their instability. As another example, ListNet and RankSVM, which perform well on TD2003 are not competitive to NDCG_boost on OHSUMED and TD2004 data sets. NDCG is commonly measured at the first few retrieved documents to emphasize their importance. 56 OHSUMED T 1 -+— Regression ----- FRank -e— ListNet RankSVM AdaRankNDCG —~— SVM_MAP + Noce_\eoosr TDZOO3 0'4 a it é a 10 0'2 e 1i 6 @n NP2004 i Al 6 e 10 z 4 e a 10 Figure 3.1: The experimental results in terms of NDCG for Letor 3.0 data sets 57 Chapter 4 Ranking Refinement by Boosting In this chapter, we consider the problem of improving the accuracy of an existing ranking function with a small set of labeled instances. We are particularly interested in learning a better ranking function using two complementary sources of information, ranking in- formation given by the existing ranking function (i.e., the base ranker) and that obtained from user feedback. We call this problem ranking refinement. Ranking refinement is very important in information retrieval where feedbacks are gradually collected. The key challenge in combining the two sources of information arises from the fact that the ranking information presented by the base ranker tends to be imperfect and the ranking information obtained from users’ feedbacks tends to be noisy. We develop an objective function based on the pairwise approach for this problem and utilize the boosting technique to optimize it. Our empirical study shows that the proposed boosting algorithm is effective for rank- ing refinement, and furthermore it significantly outperforms the baseline algorithms that incorporate the outputs from the base ranker as an additional feature. 4.1 Introduction Most research in learning to rank is conducted in the supervised fashion, in which a ranking function is learned from a given set of training instances. The drawback with the supervised 58 approach is that they tend to fail when the number of training instances is small. In several real-world applications, in addition to the labeled training instances, a base ranker is avail- able that can be used to rank the documents. Then, the research question is how to exploit the outputs from the base tanker when Ieaming a ranking function from a small number of labeled instances. We refer to this problem as Ranking Refinement to distinguish it from supervised learning to rank. Below we show two examples for the application of ranking refinement: Relevance feedback In information retrieval, documents are often ordered by a predefined relevance ranking function, such as BM25 [51] and Language Model for IR [73], that assesses the relevancy of documents to a given query. Relevance feedback tech- niques are proposed to improve the retrieval accuracy by allowing users to provide relevance judgments for the first a few retrieved documents. The research question here is how to improve the accuracy of relevance feedback by combining the rank- ing information from the user feedback as well as the ranking information from the predefined ranking function. We can cast the relevance feedback problem as a rank refinement problem by viewing the relevance ranking function as the base ranker and the documents that are judged by the user as training instances. Recommender system The goal of a recommender system is to rank the items according to the interest of an active user (i.e., the test user). Usually, a few rated items are provided to indicate the preference of the active user. However, on the other hand, we can rank the items for the active user based on the rating information of the other users using the collaborative filtering techniques [74]. The research question here is how to improve the ranking performance by leveraging the two types of information, i.e. the items rated by the active user and the ranking list generated by the collabora- tive filtering technique. We cast this problem into theframework of rank refinement by viewing the collaborative filtering algorithm as the base ranker and the rated items as training instances. 59 .,‘-_l" Furthermore, any online learning of ranking functions can be viewed as a ranking re- finement problem in that the ranking function is updated iteratively with new training ex- amples collected on the fly. A straightforward approach toward ranking refinement is to view the scores of the base ranker as an additional feature, and learn a ranking function from a limited number of training examples over the augmented features. As will be shown in the experiments, this is not the best approach for exploiting the information hidden in the base ranking function. We believe that the most valuable information behind the base ranker is not its scores but the ranked list of documents it produces. We therefore view the base ranker and the labeled instances as two complementary sources of information, each produces a different loss to evaluate the performance of the new ranking function. The key challenge in combining these two sources of information is that the ranked list generated by the base ranker is imperfect while the labeled instances tend to be noisy. There are two research questions in this problem to address: Balancing between two sources of relevancy information: The first question is how to balance between two sources of relevancy information; i.e. how to evaluate the effec- tiveness of a given a ranking function that orders the documents for each query. This question is directly related to the design of the loss function. The common approach in machine learning to balance between two sources of losses is to linearly combine them with a constant. Since the reliability of each source is unknown, finding a good balance parameter is critical in this case. We propose the multiplication of the losses related to two sources of information as an effective and parameter free approach to combine them and show that it satisfies the Parieto Optimality condition [75]. Learning: Given the multiplicative approach for balancing between two different sources (i.e., the base ranker and the training examples), the second research question is how Notice the application of cross-validation is not possible here since no reliable source of information, i.e. the correct ordering of documents, is available in this case 60 to learn a ranking function by effectively combining these sources. Our approach to answer this question is based on the boosting framework. Our empirical study with relevance feedback and recommender system show that the boosting algorithm with multiplicative loss function is effective for ranking refinement, and significantly outperforms the baseline algorithms that incorporate the outputs from the base ranker as an additional feature for the documents. 4.2 Related Work Most learning to rank algorithms are designed for the setting of supervised Ieaming, in which a ranking function is learned from labeled instances. However, the problem of semi- supervised ranking, the topic of this chapter, has not been addressed in the literature, to the best of our knowledge. The algorithm developed in this chapter belongs to pairwise approach to learning to rank and is closely related to relevance feedback. Therefore, we describe a short bibliography of these two. Three well-known pairwise approaches to Ieaming to rank are Ranking-SVM [11, 76], RankBoost [19], and RankNet [20]. Ranking-SVM minimizes the number of incorrectly ordered pairs within the maximum margin framework. Several variants [21, 77] are de- veloped to further enhance the performance of Ranking-SVM. RankBoost learns a ranking model based on the same consideration, but by means of Boosting. RankNet [20] is a neural network based approach that uses cross entropy as its loss function. The relevance feedback techniques [78] are developed to improve the accuracy of the existing retrieval algorithms. There are two types of relevance feedback. The first type, termed user relevance feedback, enhances the retrieval accuracy by collecting the user rel- evance judgments for the documents that are ranked on the top of the list. As pointed out in the introduction section, the user relevance feedback problem can be treated as a prob- For the list of different approaches to Ieaming to rank, refer to Chapter 3 61 lem of ranking refinement. As we showed in the empirical study, the proposed algorithm for ranking refinement significantly outperforms the standard relevance feedback algorithm (i.e., the Rocchio algorithm) over several datasets. The second type of relevance feedback, often termed pseudo relevance feedback, does not explicitly collect the user relevance judg- ments. Instead, it treats the top ranked documents as relevant to the given query, and the documents ranked at the bottom as irrelevant. These pseudo relevance judgments are used to improve the existing ranking function. It is well known in information retrieval that pseudo relevance feedback may result in degradation of retrieval performance given the high probability of errors in pseudo relevance judgments [78]. This is similar to the noise of training instances in ranking refinement. 4.3 Ranking Refinement 4.3.1 Problem Definition Let D = (x1, x2, . . . ,xn) denote the set of instances to be ordered, where each instance x, 6 Rd is a vector of d dimensions. Let G : Rd —+ IR denote the base ranking function (base ranker), and g,- = G (x,) denote the ranking score assigned to x,- by the base ranking function G. Instance x,- is ranked before xj if g,- > 93-. To make our problem general, we assume the label information collected from user feedback is presented as a set of ordered pairs, denoted by 0 = {(xik >- xjk)|k = 1,. . . ,m} where each pair x,- >- xj indicates that instance x,- is ranked before xj. The goal of ranking refinement is to learn a ranking function F : Rd —> R by exploiting both the labeled pairs in 0 and the ranking information given by G. This is because any labeled instances can be converted into ordered pairs while the converse is not true. 62 4.3.2 Encoding Ranking Information The first important question for ranking refinement is how to encode the ranking informa- tion provided by the base ranking function G. A straightforward approach is to use the ranking scores computed by G as an additional feature, and apply the existing algorithms, such as RankBoost [19] and Ranking-SVM [76], to learn a ranking function from the la- beled instances. The drawback of this approach is twofold: 0 First, this approach only utilizes the ranking scores of the labeled instances. The ranking information generated by the base ranker for the unlabeled instances is com- pletely ignored by this approach. However, base ranker is a rich source of infor— mation for the unlabeled instances that can be exploited for a better ranking. This is particularly important when the number of labeled instances collected from the users’s feedback is considerably small. 0 Second, we believe that the ranking orders generated by the base ranking function is substantially more reliable than the numerical values of the ranking scores. Sim- ilar observation is found in the study of meta search whose goal is to combine the retrieval results of multiple search engines to create a better ranking list [79]. Em- pirical studies [79] showed that the meta search algorithms based on the document ranks often outperform the algorithms that directly use the relevance scores. To address the above problems, we encode the order information generated by the base ranking function G with matrix W E [0, 1]"x". Each Wz’o’ in the matrix represents the probability of ranking x,- before xj and is defined as follows 8Xp()\9i) eXth) + exp(/\9j) Wz',j (4.1) In the above, Wm- is defined by a softmax function and the parameter A _>_ 0 represents the confidence of the base ranking function. To see the effect of A, we consider two extreme 63 C38681 0 A = 0. In this case, we have WM 2 0.5, which indicates that the ordering informa- tion generated by the base ranker is completely ignored. 0 A = 00. In this case, we have 1 92' > gj Wi,j = 0.5 92' = gj (4-2) 0 92' < gj Thus, W is almost a binary matrix, implying that we completely trust the ranking list generated by the base ranker. In our experiment, we set A to be inverse to the standard deviation of the ranking scores for the first 10 retrieved documents. Similarly, we encode the ordering information inside the set 0 with matrix T as follows: 1—n/2 (x,- >-x-) eo T1,,- = J (4.3) 77/ 2 otherwise where parameter 7] E [0, 1]. TM represents the probability of ranking ranking x,- before xj in the training data. The parameter 1] reflects the error rate of training data, and is particularly useful when the labeled instances are derived from implicit user feedback that is usually noisy. In our experiment, we set 77 = 1 / 2. 4.3.3 Objective Function The goal of ranking refinement is to learn a ranking function F : Rd —> IR from matrix W and T that produces a more accurate ranking list than the base ranking function G. In par- ticular, the optimal ranking function F should be consistent with the ranking information in W and T. To this end, we measure the ranking errors of F with respect to both W and F, i.e., errw = Z Wi,jI(Fj _>_ Fi) (4-4) errt = Z 7;,jI(Fj 25“,) (4.5) In the above, we introduce F,- = F ()9) and the indicator function [(33) that outputs 1 when the input boolean variable x is true and zero otherwise. There are two problems with directly using the ranking errors errw and errt as the objective function: 0 First, both error functions are non-smooth functions since the indicator function I (2:) is non-smooth. It is well-known that optimizing a non-smooth function is computa- tionally more challenging than optimizing a smooth one [80]. 0 Second, with two objectives at hand, the problem is essentially a multi-objective optinrization problem [75]. Thus, another important question is how to combine multiple objectives into one single objective. In what follows, we will address these two questions separately. Relaxation with Exponential Functions. To address the problem with non-smooth ob- jective functions, we follow the idea of boosting by replacing the indicator function I (a: 2 y) With an exponential function exp(:c — y). The resulting new objective functions are: n e/FTw = Z Wi,j exp(Fj — Fi) (4.6) i3j=1 n 6m = Z Ti,jeXP(Fj—Fi) (4.7) iij=1 65 Note that since exp(a: — y) 2 1(m 2 y), by minimizing the errors Efi‘w and 67?}, we are effective in reducing the original ranking errors errw and errt. Another advantage of using e’fiw and e’Frt comes from the theoretic result of AdaBoost [81], i.e., by rrrinirrrizing the exponential loss function, the resulting classifier will not only reduce the training errors but also maximize the classification margin. The enlarged classification margin is the key to guarantee a low generalization error for testing instances [81]. Remark: It is interesting to examine the effect of the smoothing parameter 77 on the ranking error e’Frt. By substituting the expression (4.3) for TiJ' in (4.7), we have e’Frt expressed as follows: e’Frt=(l—77) Z exp(Fj-Fi) (xi>-xj)60 +3 2 [exp(Fi — Fj) + exp(Fj - FD] z‘,j=1 +77 n z (1"?) 2 +5 2120”“ (szc jexp)€0 i=1 17 = (1— 77) Z €Xp(Fj - Ft) + mllFllg (4-3) (xi-Ht j)60 where “Fug. is a norm of vector F = (F1, . . . ,Fn) defined as follows: uan = FT(nl—ee)F where 1 is the identity matrix and e is a vector of all ones. In the second step, the approxi- mation follows the Taylor expansion of the exponential function. The second term in (4.8), i.e., nllFllg / 2( 1 — 7)), plays a similar role as used by Support Vector Machines (SVM) [3]. In this sense, the parameter 7] essentially regularizes the ranking error err/7t. 66 Combining TWO Objectives. The problem of optimizing multiple objectives is usually called multi-objective optimization problem [75]. In a multi-objective problem, there is usually no single solution that can satisfy each objective to its fullest. In this problem, we are looking for a solution at which no objective can be further reduced without increasing the value of other objective functions, a condition known as Pareto optimality. The easiest approach to combine several objective functions that results in a Pareto optimal solution is to linearly combine them [75]. In our case, there are two error functions, each related to a different source of relevancy information. A given ranking function can satisfy only one source of relevancy information for each pair of documents in case of a conflict between two sources; i.e. decreasing the error related to one source can increase the error related to another. The linear combination leads to the following optimization problem, i.e., 11 La = ”Ye/fine + 6771: = Z (”twig + Tm) €XP(Fj — Ft) (4.9) t',j=1 where parameter 7 is used to weight error e771”. We refer to the approach based on the above objective function as “Linear Ranking Refinement”, or ILRR, for short. The main drawback with the linear combination approach is how to decide the value for 7. In our experiments, we will show that different 7 could result in very different performance in information retrieval. Since there is no easy way to find the best tradeoff, we consider the combination of the two errors by their products, i.e., n n = Z Tin” eXP(Fj — Ft) I 2: W233“ eXP(Fj — F,) (4.10) We refer to the approach as “Multiplicative Ranking Refinement”, or MRR for short. Now, the question is whether the resulting solution is Pareto efficient [75]. More for- mally, a solution F = (F1, . . . F") is Pareto optimal for the objectives e’Frw and 67:71; if 67 there does not exist any other solution F’ = (Ff , . . . , F,’,) that is either 1. C/fi‘w(F’) < fiw(F) and 6’7“?”qu S fidF), 01' 2. 5mm) 3 6mm) and 63mm) < new). In other words, if F is Pareto efficient, it guarantees that no solution is able to further reduce the two objectives simultaneously than F. Regarding the Pareto efficiency when rrrinirnizing Lp in (4.10), we have the following theorem: Theorem 8. The optimal solution F = (F 1, . . . , Fn) found by minimizing the objective fimction Lp is Pareto efficient. The proof of this theorem can be found in Appendix A9. The main advantage of using Lp rather than La is that it does not need a weight parameter. This will be revealed in our empirical studies in that minimization of Lp usually significantly outperforms minimization of La even when the optimal combination weight 7 is used for La. In order to compare the properties of the two different approaches for combination, we examine their first order derivatives. Let 5 denote the parameters used by the ranking function F (x). Then, the first order derivatives of La and Lp with respect to E are given as follows: n VtLa = 2 (T2,) + Twig) exp) 2',J'=1 VtLp = L, Z(at,j+b.,j>exp(ng(xj>—vtth.-)> i,j=1 where am = n 2,) em 3 ') (4.11) 22,321 WM @1ij - F2") Tm €XP(Fj - Ft“) b. . 2’] 22j=1TiJeXP(Fj — Ft) (4.12) 68 Note that both derivative shares similar structures. The key difference between V5 La and Vng is that in Vng, am- and bz‘,j are used to weight the contribution from W and T for instance pair (xi,xj) when computing the derivative. This is in contrast to VgLa where the weights for instance pair (xi, x_,-) are 7W“ exp(Fj - F2) and TM exp(Fj - F2). The main advantage of using ai,j and bi,j is that they are normalized, i.e., 23H “M = 223:1 bz',j = 1, and therefore the contributions from W and T are naturally balanced when calculating the derivative. 4.3.4 Boosting Algorithm for Ranking Refinement In this section, we will consider algorithms for Ieaming the ranking function F (x) by respectively minimizing the objective function Lp. The objective function La is similar to the objective function used by Rank-Boost [19] except that a weight (Tid- + 7W“) is used for each instance pair. We thus can simply modify the Rank-Boost algorithm to learn the optimal ranking function F(x). Hence, in the sequel, we will focus on the boosting algorithm for minimizing Lp. To learn the optimal ranking function F (x) by minimizing LP, we follow the greedy approach of boosting algorithms. Since the training examples are the labeled instance pairs, a straightforward boosting approach is to iteratively update the weights of instance pairs and train a new ranking function for the given weighted pairs. This is the strategy employed in the RankBoost algorithm [19]. However, since the number of instance pairs is 0(n2), this approach could be computationally expensive when the number of instance 71 is large. To address the above problem, we present a new boosting algorithm that converts the weights of instance pairs into weights for individual instances. The key idea behind the new boosting algorithm is to derive an upper bound for the target objective that decouples functions for pairs of instances into functions for individual instances. It is this decoupling that makes it possible to infer weights for individual instances from weights for instance 69 Algorithm 6 Boosting algorithm for minirrrizing Lp 1: Input: Wijj and Tio' as two encrypted sources of information 2: repeat 3: Compute 7233' for each instance pair as 'yz-J- = am- + bz',j where am- and bz',j are defined in (4.11) and (4.12). Compute the weight for each instance as w,- = 231:1 7231' — 7]",- Assign each instance the class label y,- = sign(w,-). 6: Train a classifier f (x) : Rd —-) {0,1} that maximizes the following quantity 50.4? m = Z lw’ilf(x’i)yi (4-13) i=1 7: Predict fz- for all instances in D 27.1._1 72' '5(fi,1)5(f',0) : . . . h = 1 1 Z,]—— 1.? J 8 Compute combination werg ts a 2 0g 23:1 7M 5(fj11)5(f¢,0) f (x,) 6 (3:, y) outputs 1 if a: = y and zero otherwise. 9: Update the ranking function as F (x) <— F (x) + a f (x) 10: until reach the maximum number of iterations where f,- = pairs. In addition, the new boosting algorithm is able to derive an appropriate binary class label for each instance using the computed weights. Using both the weights and the class assignments of instances, we can train a binary classifier f : Rd —> {0, +1} and update the overall ranking function by F’ (x) = F (x) + a f (x) where a is the combination weight. Note that by converting a ranking problem into a series of binary classification problems, the new boosting algorithm avoids the high computational cost arising from the large num- ber of instance pairs. Algorithm 6 summarizes the overall procedures for the proposed boosting algorithm minimizing Lp. In each iteration, this algorithm computes '72-,3- for every pair of instances that measures the uncertainty of ranking instance x,- ahead of xj. Then, it adds up the uncertainties of comparing instance x.) to all other instances, which results the calculation of the weight for instance x.) as w,- = 231:1 72-J- - 7332'— The bigger the magnitude of 21),, the large the uncertainty of F is in ranking xi. So, the algorithm redistributes weights pro- portional to the uncertainty to at each iteration. It is important to note that to,- can be both positive and negative. In particular, to, > 0 indicates that the algorithm did not succeed 70 x10 _\ O ‘9 0? Objective Function Value ON Iteration Figure 4.1: Reduction of the objective function Lp using the OHSUMED Data Set in ranking x,- on the top of the ranked list; and to, S 0 indicates the opposite. Hence, the boosting algorithm derives the class label y, for x,- based on the sign of 10,-: a positive class for placing instances on the top of the ranked list, and a negative class for placing instances on the bottom of the list. To summarize the above steps, using the magnitude and sign of 211,-, the algorithm chooses a weighting and a labeling direction for instances. Given a set of weighted binary class examples, the weak learner trains a classifier that maximizes m in (4.13), which can be interpreted as a sort of classification accuracy. Since most binary classifiers are unable to take weights into consideration, the boosting algorithm divides the training procedure into two steps: in the first step, it samples 3 instances according to the distribution that is proportional to the weights |w,-|; It then trains a binary classifier f : Rd -—> {0, +1} using the sampled instances . After Ieaming the new binary classifier in Step 6, the algorithm evaluates its success in reducing the loss (the value of the objec- tive function) in Step 7 and 8 and adds it with a proportional weight to the current list of classifiers at Step 9. We manually set .9 = max(20,n/5) in our empirical study. A similar strategy is employed in the AdaBoost algorithm [19] and its effectiveness has been verified in empirical studies. 71 Before providing the justification for Algorithm 6, notice that in order to have this algorithm continue iterating, the weak learner need to do better than random guessing in the following sense. Writing a in the following form a = élog(1—€) (4.14) 6 where 22j=17i,j6(fja 1)6(fi:0) Eider 7i,j(6(fja1)5(fi10) + 5(fi,1)5(fj,0)) applies that by better than random guessing we mean 6 < 0.5. 5: In the remaining of this section, we will provide justification to the prOposed boosting iterations in Algorithm 6. The main result is summarized in Theorem 9. Theorem 9. Let f k(x) denote the binary classification function obtained in the kth itera- tion, and 72",] denote '7,- j learned tn that iteration. The objective function after T iterations, denoted by L5, is bounded as follows: T n n Lg" g 2 Tm- 2 Wm- exp (,/—— ([12 (4.15) i,j=1 z',j=l k=1 where m. = Z)nzijétf’ttx.)name-)0) iJ=1 n Vt = Z 412,6(fk(xt).0)6(f’°(xj>, 1) i,j=1 The above theorem essentially shows that by using the proposed algorithm, the objec- tive function Lp will be reduced exponentially. The key to proving Theorem 9 is to establish the relationship between the objective function Lp of two consecutive iterations. This is because by upper bounding the log-ratio 72 between Lp of two consecutive iterations, i.e., r, 2 log Lg, — 10g Lg-l, (4.16) we will have T T Lt T _ 0 P 0 LP _ LPH 7: g Lpexp Zr, (4.17) t=1 LP t=1 For the convenience of presentation, in the following, we only consider two consecutive iterations without specifying the index of iteration. Instead, we denote the quantities of the current iteration by symbol"to differentiate the quantities of the previous iteration. In order to establish an upper bound for the log ratio, we first introduce the following lemma Lemma 5. Assume F (x) = F (x) +01 f (x) where F (x) and F (x) are the ranking functions of two consecutive iterations, respectively. f : Rd —> {0, 1} is a binary classifier and oz is the combination weight. We have the following inequality hold for any F, f, and (I: ~ L n 108 —p S -2 + Z (ai,j + bi,j) eXp(a(fj - ft)) (4.18) L” 231:1 where am- and b233- are defined (4.11) and (4.12), respectively. The proof of Lemma A.10 can be found in Appendix A.10. Using Lerrrrna A.10, we present the proof of Theorem 9 in Appendix A.11. Finally, we can show the relationship between the objective function Lp and the quan- tity m (in (4.13)) that is used to guide the training of binary classifiers in iterations. This result is summarized in the following theorem: Theorem 10. Let ”We denote the value of the quantity in Equation (4.13) that is maximized by the binary classifier f k(x) learned in the tth iteration. Assume that lit 2 0 for each iteration. Then, the objective fimction after T iterations, denoted by L5, is bounded as 73 follows: 12 n T L; S 2 Tm 2 Wm 8X1) - Z 77: (4.19) z',j=1 i,j=1 t=1 The proof of the above theorem can be found in Appendix A. 12. Theorem 10 provides a theoretical justification for Algorithm 6. In particular, by maximizing m, Algorithm 6 effectively reduces the objective function Lp. This is further confirmed by our empirical study. Figure 4.1 shows an example of reduction in the objective function Lp. We clearly see that the objective function is reduced exponentially and receives the largest reduction during the first few iterations. 4.4 Experiments In this section, we evaluate the proposed algorithm for ranking refinement by two tasks, i.e., user relevance feedback and recommender system. The objectives of our experiments are: (1) to compare the proposed algorithm for ranking refinement to the existing ranking algorithms, (2) to examine the performance of the proposed algorithm for ranking refine- ment with different numbers of training instances, (3) to examine the effect of different base rankers on the performance of the proposed algorithm, and (4) to examine the time efficiency of the proposed algorithm for ranking refinement. We use the Letor data set for Relevance Feedback experiment and Movies data set for the Recommender System experiment. The description of these data sets can be found in Section 1.6.2. 4.4.1 Experimental Setup Algorithms. To examine the effectiveness of the proposed algorithm for ranking refine- ment, we compared the following ranking algorithms: Base Ranker: It is the base ranker used in the ranking refinement. 74 Rocchio: This algorithm extends the standard Rocchio algorithm [82] for user relevance feedback that creates a new query vector by linearly combining the original query vector and vectors of feedback documents. Given the initial query Q0, the relevant documents (R1, R2, ..., Rnl) and non-relevant documents (81, 5'2, ..., Snz), the new query according to Rocchio is computed as: Q: Q0+a251 —e2S—;L (4.20) Note, in our case, that each document is not represented by a vector of word fre- quency, but a vector of features that are computed based on its match to the query. Hence, we don’t have Q0, i.e., the representation vector for query itself. We therefore set Q0 to be a vector of all zeros. We used the inner product between the new query and documents as the scores to rank the documents. W‘e vary a and 5 from 1 to 10 and choose the best setting. SVM: This implements the Ranking-SVM algorithm using the SVM light package. Note that it is commonly believed that Rank-Boost performs equally well as Ranking SVM. The experimental results provided in the LETOR collection also confirm this. Hence, we only compare the proposal algorithm with Ranking-SVM, but not Rank- Boost. MRR: This is the Multiplicative Ranking Refinement algorithm that minimizes Lp in (4.10). LRR: This is the Linear Ranking Refinement algorithm that minimizes La in (4.9). Since the performance of LRR depends on the parameter 7, we run LRR with 100 different values from 0.1 to +10 and choose the best and worst performance. We referred them to as LRR-Worst and LRR-Best, respectively. 75 For a fair comparison, the output from the base ranker is used as an extra feature when using SVM (i.e., Ranking-SVM) and Rocchio. Notice that we do not compare the perfor- mance of the proposed method with different baselines provided in LETOR because the experiments in LETOR are obtained under a different setting. We will discuss the experi- mental setup used in this chapter in Section 4.4.1. Similar to Chapter 3, we used NDCG to evaluate the performance of different methods. NDCG is described in Section 3.3.1. Evaluation Protocol. For each LETOR data set, we choose the best ranking feature compared to other features and use it as the base tanker. The best ranker for datasets OHSUMED, TD2003, TD2004, HP2003, HP2004, NP2003, NP2004 are feature number 11, 46, 46, 46, 46, 6, and 6. We followed the common practice of user relevance feedback by collecting the relevance judgments for the first 20 retrieved documents; i.e. we sort all documents of one query based on the base ranker and simulate the user feedbacks by using the true relevancy of the first 20 documents. These user relevance judgments served as la- beled instances in ranking refinement. Notice that it is well known that relevance feedback depends on the quality of feedback documents. If the underlying base ranker does a poor job in identifying the relevant documents, it is very likely that most of the feedback docu- ments are irrelevant, leading to a poor performance of the proposed algorithm. We come back to this problem in Section 4.4.3. For the experiment with recommender system, the base ranker was created by apply- ing a collaborative filtering algorithm, more specifically, the Personality Diagnosis algo- rithm [74], to the user rating data. In particular, 20 users were randomly selected as the training users, and the remaining 923 users were used for testing. For each test user, 10 rated movies were randomly selected and were used by the collaborative filtering algo- rithm to identify the 20 training users who share the common interests with the test user. Note that we did not compare the proposed algorithm to other information filtering algo- rithms because the focus of this study is to examine the effectiveness and the generality of 76 the proposed approach for ranking refinement. 4.4.2 Results for Relevance Feedback Figure 4.2 show the ranking results of different algorithms in terms of NDCG for the first 25 ranked documents. First, by comparing the performance of the two variants of ranking refinement, we observed that the Multiplicative Ranking Refinement (MRR) algorithm is more effective than the Linear Ranking Refinement (LLR) algorithm. Indeed, MRR per- forms significantly better than the best case of LRR (i.e., LRR-best) for OHSUMED and TD2004 datasets. The key difference between MR and LR is that MRR minimizes the product of the two error functions while LRR minimizes the weighted sum. We believe it is the normalization scheme brought by MR (see equations in (4.11) and (4.12)) that makes it performing better than LRR. The performance of MR is more appreciated given it does not have a single parameter that needs to be adjusted manually. Second, comparing to the other three baseline algorithms, i.e., the base ranker, Roc- chio, Ranking-SVM, we observed that MRR significantly outperforms the base ranker and Rocchio algorithms in all the cases; it outperforms Ranking-SVM in the first three data sets and yields similar performance for the remaining four data sets. We also note that the improvement made by the ranking refinement is more significant for the first a few rank- ing positions than the other ranking positions, a very desirable property for web search in which users usually only pay attention to the first a few retrieved results. We thus conclude that Multiplicative Ranking Refinement is more effective than the baseline algorithms for user relevance feedback in information retrieval. Finally, notice that the ranking algorithms show different trend of NDCG on different data sets. Particularly, NDCG is decreasing for the first three data sets and increasing for the remaining data sets. The increasing or decreasing trend is directly dependent on the number of relevant documents and the quality of ranking. If a ranking algorithm performs a good job in retrieving the relevant documents on the top of the list, it is generally expected 77 to have a decreasing trend. This is because it is more likely to see irrelevant documents in the list as we retrieve more documents. For the last four data sets, there is only one relevant document for each query. Even a good ranker is not able to retrieve the only relevant document on the top of the list and that is why you see NDCG increases until it retrieves the relevant documents of all the queries and then remains constant. 4.4.3 Effect of Base Ranker W examine how the proposed algorithm response to different base tankers, in particular the base rankers with relatively poor retrieval performance. We tested MRR algorithm with three different base rankers that are selected automatically based on their ranking perfor- mance. These three base rankers are the worst, the best and a medium quality base ranker selected from the list of features for each data set. Figure 4.3 shows how MRR algorithm performs when the selected base rankers are used. In each sub-figure, different base rankers are distinguished with a number in the legend that shows the feature number they use. The result indicates that the quality of base rankers has a direct impact on the performance of the MRR algorithm. However, the proposed algorithm is able to significantly improve the performance for a base ranker that can retrieve some relevant documents. When the base ranker performs extremely poor (like in TD2003, HP2003, HP2004, NP2003, and NP2004), all the retrieved documents are are judged as irrelevant by user and no infor- mation is available from either sources. Therefore, no improvement can be made by the proposed algorithm for extremely poor base rankers. It is also interesting to observe that for data set OHSUMED, even with the worst base ranker, MMR algorithm is able to achieve similar performance to the baseline methods when they use the best base tanker. This re- sult further confirms the effectiveness of the proposed algorithm for ranking refinement. We thus conclude that the MR algorithm is resilient to the imperfectness of base rankers. 78 4.4.4 Effect of Size of Feedback Data To investigate the effect of the number of feedback documents on the performance, we ran the MR algorithm by varying the number of feedback documents from 5 to 20. Figure 4.4 shows the result using varied number of feedback documents. We clearly observed that the number of feedback documents have a direct effect on the performance of ranking refinement. However, even with a small amount of feedback, MRR is able to improve the retrieval performance considerably, particularly for the accuracy of the first few ranked documents. We thus conclude that the proposed algorithm for ranking refinement is robust to the size of feedback data. Also notice that for data set NP2003, there is no changes in the performance of MR with different relevance feedback. The reason is that the base ranker in this case is not able to retrieve any relevant documents for most queries. 4.4.5 Results for Recommender System We evaluated the generality of the proposed algorithm by applying it to recommender sys- tem (movie recommendation). Figure 4.5(a) show the results of different algorithms when applied on the MovieLens dataset. It is surprising to observe that the results of LRR, the lin- ear ranking refinement algorithm, even with the tuned parameter 7, is not comparable to the the performance of the base ranker. In contrast, the MRR algorithm is able to significantly improve the accuracy of the base ranker and outperforms the other baseline algorithms con- siderably. This result further indicates the importance of appropriately combining the two information sources, i.e., the ranking information behind the base sranker and the feedback information provided by users. Figure 4.5(b) shows the sensitivity of MRR to the size of feedback data by varying the number of movies rated by the test user from 5 to 25. Similar to the result for relevance feedback, we observed that the size of feedback data affects the performance of MRR considerably. However, even with 5 rated movies, the MR algorithm is able to make a noticeable improvement in the ranking accuracy compared to the base ranker. This result 79 further confirms the robustness of the proposed algorithm to the size of feedback data. 4.4.6 Time Efficiency of Ranking Refinement Figure 4.6 shows the efficiency of the MR algorithm in terms of the running time for different numbers of rated movies for each test user. We chose movies data set for the experiment because the number of rated movies varies significantly from users to users, making it easy for us to evaluate the computational efficiency of the proposed algorithm. We partitioned the test users into groups where each group of users has a different number of rated movies. The running time of MR for each group is calculated by averaging it across all the users in the group. As pointed in Section 4.3.4 and seen in Figure 4.6, the running time is linear in the number of instances. Note that the relatively long running time is due to the MATLAB implementation. 80 —r- Base Ranker + Rocchio —°— SVM Mo Best_LRR -~— Worst_LRR —-— MRR 0.8 0.2 0 10 15 0 5 o 51b75 2o 25 Top Documents HP2003 4 '- _gangaaailaaaamaaaaaam‘ AAAAAAAAAA A A A ............ vn" o 1'0 15 2o 25 Top Documents NP2003 1 Top Documents OHSUMED o 5 1‘0 15 21) Top Documents T02004 25 NDCG o 5 1o 15 20 Top Documents HP2004 o 5 1o 15 2o Top Documents o 5 1o 15 2o Top Documents Figure 4.2: NDCG of relevance feedback for different algorithms 81 25 OHSUMED —0— Base Ranker-46 —+— MRR-46 + Base Ranker-36 ---o-- MRR-36 —-— Base Ranker-16 0.2 L _._ _ o 5 1‘0 1‘5 20 2'5 MRR 16 Top Documents TDZOOB TDZOO4 0.4’ NDCG W W 00 5 10 15 20 25 00 5 10 15 20 25 Top Documents Top Documents HP2003 HP2004 W W 00 5 10 15 20 25 00 5 10 15 20 25 Top Documents Top Documents NP2003 NP2004 W W 00 5 10 15 20 25 0O 5 10 15 20 25 Top Documents Top Documents Figure 4.3: NDCG of MR with different base rankers for relevance feedback 82 —1— Base Ranker —o— MRR-5 —e— MRR-10 .-...9 MRR-15 -~— MRR-20 —-— MRR—25 TDZOO3 0.2 r o 5 1b 15 Top Documents HP2003 0.951 -_ 0.9 2 0.75- M 1 0.65 20 o 140 20 Top Documents NP2003 0.8 ' 0.7' 0.6 ’ NDCG 0.5’ 0'40 5 1o 15 Top Documents 250 3‘0 25 0.9“ 0.8* . 0.4 OHSUMED 5 1‘0 15 20 25 Top Documents TD2004 5 10 15 2o 25 Top Documents HP2004 0.2 5 1‘0 15 20 25 Top Documents NP2004 5 1‘0 15 20 25 Top Documents Figure 4.4: NDCG of MR with different numbers of feedback documents for relevance feedback 83 rr—y .- W luv-al.: Movie +3 8 R k . 1 . a e . an er MOV' “ -+-Base Ranker '9' ROCChIO 0 9. ' -0- MRR-5 0 9 +SVM -e— MRR-1O ' r «a» Best_LRR 0.85 .t _a” MRR-15 8 0.8' « ‘ +Worst_LRR (9 0.3 +MRR-20 o 8 +MRR-25 ZQT zom- 0.61 0.7 0.50 5 1o 15 2‘0 25 0'650 10 20 so Top Documents Top Documents (a) NDCG chart (b) Sensitivity to the number of rated movies Figure 4.5: The ranking result for rmommender system 3.5 I I I I I I I I Time (Seconds) ... 5 A 01 I L 0'5 1 1 J l L l I 1 o 50 100 150 200 250 300 350 400 450 Number of Movies Figure 4.6: Running time of MMR for different numbers of movies rated by test users 84 1! Chapter 5 Online Classification with Bandit Feedback In this chapter, we consider the problem of online classification with bandit feedback: in each trial of online learning, instead of providing the true class label for a given instance, the adversary will only reveal to the learner if the predicted class label is correct. Unlike online learning with full feedback, learner here does not receive the loss value for all the hypotheses in the hypothesis space after it chooses one, which demands a new approach for an effective Ieaming. We present a general framework for online multi-class learning with partial feedback based on the notion of potential [83]. The generality of the proposed framework is verified by the fact that Banditron [5] is indeed its special case with the squared L2 norm of the weight vector as the potential. Using the exponential potential, we propose an exponential gradient algorithm for online multi-class Ieaming with partial feedback that has the interesting property that its mistake bound is independent from the dimension of data, making it suitable for classifying high dimensional data. Our empirical study with the classification data sets show that the proposed algorithm for online learning with partial feedback is more reliable than Banditron. 85 5.1 Introduction Online learning with partial feedback assumes that, in each trial of online learning, the adversary only reveals to the learner if the predicted class label is correct and does not provide the true class label for a given instance. Online learning with partial and full feedback are equivalent when there are only two classes. Therefore, we assume it is clear that the classification problem is a multi-class one when we talk about online classification with bandit feedback. Online learning with partial feedback is closely related to the problem of multi-armed bandit which is the generalization of a traditional slot machine game, called one armed bandit [84]. In multi-armed bandit, there are n arms to pull with unknown rewards. A player aims to maximize its reward over the trials by Ieaming the best arm to pull. When the player starts, he/she does not know which arm is more profitable. It is only over the trials that he/she learns the best arm to pull. In each stage of this game, the player needs to decide if he/she is going to explore a new arm or exploit his/her knowledge by choosing the best arm, a technique called exploration vs. exploitation tradeoff. This strategy helps the player to constantly receive feedback for all arms. The problem of online classification with bandit feedback can be considered a multi- armed bandit problem, with the feature vector of example available as a sort of side in- formation; i.e., at each round, after observing an instance, the learner needs to decide a class label (an arm). Although online multi-class Ieaming with full feedback has been ex- tensively studied, the problem of online multi-class Ieaming with partial feedback is only studied recently [5, 85]. The challenge in online Ieaming with bandit feedback is the fact that after classifying a new instance, the learner only receives the loss value for the part of the hypothesis space that have the same prediction as current hypothesis. To explore different parts of the hypothesis space, the learner needs to sacrifice the chance of correctly classifying the current instance in the hope that it finds the best model that minimizes the long-term number of mistakes. We will give a detailed description of this strategy and its 86 characteristics in Chapter 6. In this chapter, we propose a general framework to address the challenge of partial feedback in the setup of online classification. This general framework adapts the potential- based gradient descent approaches for online Ieaming [83] to the scenario of partial feed- back. The generality of the proposed framework is verified by the fact that banditron is indeed a special case of our framework if the potential function is set to be the squared L2 norm of the weight vector. Besides the general framework, we further propose an expo- nential gradient algorithm for online multi-class Ieaming with partial feedback. Compared to the Banditron algorithm, the exponential gradient algorithm is advantageous in that its mistake bound is independent from the dimension of data, making it suitable for classifying high dimensional data. We verify the efficacy of the proposed algorithm for online learning with partial feedback by an extensive empirical study. 5.2 Related Work Although introduced very recently and there is only a few work directly related, the prob- lem of online multi-class Ieaming with bandit feedback can be traced back to online multi- class classification with full feedback and multi-armed bandit Ieaming. The former pro- vides the required tools to handle the problem of partial feedback and the later offers a starting point for the development of an online multi-class Ieaming with partial feedback. Both these areas have been extensively studied and we only provide a brief review Several additive and multiplicative online multi-class Ieaming algorithms have been introduced in the literature [52]. Perceptron [43] and Winnow [86] are two such algorithms. Kivinen and Warrnuth developed potential functions that can be used to analyze different online algorithms [87]. Grove et al. [88] showed that polynomial potential can be considered as a parameterized interpolation between additive and multiplicative algorithms. Multi-armed bandit problem refers to the problem of choosing an action from a list 87 of actions to maximize reward given that the feedback is (bandit) partial [44, 89, 90]. The algorithms developed for this problem usually utilize the exploitation vs. exploitation trade- off strategy to handle the challenge with partial feedback [46, 47]. Multi-class learning with bandit feedback can be considered as a multi-armed bandit problem with side information. Langford et al. in [85] extended the multi-armed setting to the case where some side information is provided. Their setting has a high level of abstrac- tion and its application to the multi-class bandit Ieaming is not straightforward. Banditron, which can be considered as a special case of our framework, is a direct generalization of Perceptron to the case of partial feedback and uses exploration vs. exploitation tradeoff strategy to handle partial feedback [5]. Potential function and exploration vs. exploitation tradeoff techniques are the main tools used to develop the framework in this paper. Notice that the problem of bandit with side information has been also addressed in rein- forcement learning under the name of Associative Bandit problems [91—94]; however those work assume that the side information are i.i.d samples from an unknown distribution. This is unlike our online approach that no assumption is made about the process that generates data. 5.3 A Potential-based Framework for Classification with Partial Feedback We first present the problem of online classification with partial feedback, followed by the presentation of potential based framework and exponential gradient algorithm. 5.3.1 Problem Definition We denote by K the number of classes, and by x1, x2, . . . ,xT the sequence of training examples received over trials, where x,- 6 Rd and T is the number of received training instances. In each trial, we denote by g, E {1, . . . , K} the predicted class label. Unlike 88 the classical setup of online learning where an oracle provides the true class label y,- E {1, . . . , K} to the learner, in the case of partial feedback, the oracle only tells the learner if the predicted class label is correct, i.e., [yt = 'y't]. This partial feedback makes it difficult to learn a multi-class classification model. In our study, we assume a linear classifier for each class, denoted by W = (W1, . . . , w K) E RdXK, although the extension to nonlinear classifiers using ker- nel trick is straightforward. Given a training example (x, y), we measure its loss by E (maxkfll ng — wa) where 13(2) = max(0, z + 1) is a hinge loss. We denote by W1, . . . , WT a sequence of linear classifiers generated by an online learning algorithm over the trials. Our objective is to bound the number of mistakes made by the online learn- ing algorithm. Since the proposed framework is a stochastic algorithm, we will focus on the expectation of the mistake bound. As will be shown later, the expectation of the mistake bound is often written in the form T a(U) + fig! (gagixzuk — xguyt) where U = (u1,. . . , u K) is the linear classifier, (W) : RdXK H R is a strictly convex function that measures the complexity of the linear classifiers, and a and 5 are weight constants for the complexity term and the classification errors. Note that the Banditron algorithm is a special case of the above framework where it measures the complexity of W by its Frobenius norm, i.e., (W) = %|W|%.. In this chapter, we design a general approach for online learning with partial feedback that is adapted to any complexity measure (W). Finally, for the convenience of presentation, we define K W =€(maxxth -xth > (5.1) t( ) [696% t k t 31: 89 5.3.2 Banditron Kakade et al. [5] developed Banditron for the problem of online classification with bandit feedback. Banditron, depicted in Algorithm 7, is basically Perceptron adapted to handle the case of bandit feedback by utilizing the exploration vs. exploitation tradeoff technique. Af- ter receiving a new instance xt, Banditron computes the primary class assignment if]; using the weight matrix W“1 at Step 5, just like Perceptron. Using the exploration vs. exploita- tion tradeoff parameter '7, the learner decides label 3}} at Step 6 and 7 which is either gift (exploitation) or another random class label (exploration). After receiving a feedback, the algorithm computes the update matrix xtrit which, on average, is equivalent to the update matrix in Perceptron for the full feedback setting. Kakade et al. provided the following mistake bound for Banditron in [5]. Bound for Banditron: Let K be the number of classes. After running over a sequence of examples x1, . . . ,xT, with ||xt||2 g 1 for all t, the expected number of mistakes made by Banditron, denoted by E[M], is bounded as follows 2mm} 2 [Morgan EM<€U T 3 ,‘/U T + 5.2 []_()+1+maX{7 Ilpr} 7 () where U is any arbitrary weight matrix (classifier) and L. 5.3.3 Potential-based Online Classification for Partial Feedback Our framework, depicted in Algorithm 8, generalizes the Banditron algorithm [5] by con- sidering any complexity measure (W) that is strictly convex. In this algorithm, we intro- duce 0 6 RdXK, the dual representation of the linear classifiers W. In each iteration, we first update at based on the partial feedback [3]; = fit], and compute the linear classifier Wt via the mapping V* (6), where * (6) is the Lagendre conjugate of (W). Similar to Ban- ditron and most online Ieaming with partial feedback [83], a stochastic approach is used 90 Algorithm 7 The Banditron Algorithm 1: Parameters: 0 Step size: '7 > 0 2: Set wg = 0,11: =1,...,Kand90 = V*(WO) 3: fort = 1,...,Tdo 4: Receive xi 6 Rd Compute 37;} = arg maxls kg K ”(I‘VE—1 Seter = (1 -’7)[k = 17t1+7/K,k =1,---,K Sample {it by distribution p = (p1, . . . , p K)- Predict it and receive feedback [yt = 5,] 099.2939 Compute (it = 1.771: -— 12% 1311):}11 where 1 k stands for the vector with all its elements 31 being zero except its kth element is 1. 10: Compute Wt = Wt‘l — xtcitT 11: end for to predict class assignment, in which parameter 7 > 0 is introduced to ensure sufficient exploration [44]. In the following, we show the mistake bound for the proposed algorithm. For the con- venience of discussion, we define vector rt 6 Rd as T, = 1% _ 1,, (5.5) Proposition 3. For €t(W) 2 1, we have ve,(W)=(A) — 5(3) — (A — B, V(B)) (5.7) The following classical result in convex analysis summarizes useful properties of Bregman 91 Algorithm 8 Online Learning Algorithm for Multi-class Bandit Problem 1: Parameters: o Smoothing parameter: 7 E (O, 0.5) 0 Step size: 17 > 0 0 Potential function: : RdXK v—> R and its Legendre conjugate * : RdXK I—+ 1R 2; Setw2=o,k= 1,...,Kand00=vq>*(W0) 3: fort=1,...,Tdo 4: Receive xt 6 Rd 5: Compute {it = arg maac)c;rwt_1 (5.3) 1_<_lcSK 6: Setpk = (1 —7)[k=37t] +7/K,k= 1,...,K 7: Randomly sample fit according to the distribution p = (p1, . . . ,pK). 8: Predict {it and receive feedback [yt = gt] 9: Compute (it = 137t — 1§t[yt = gt] (5.4) pA yt where 11: stands for the vector with all its elements being zero except its kth element is 1. 10: Compute 6t = 6t"1 — nxtdg— 11: Compute Wt = V(6t) where 6t = (6t,...,6§() 12: end for distance. Lemma 6. Let (W) be a strictly convex function with constant p with respect to norm H ‘ H i.e.,for any W and W’ we have (W — W', V(W) — WW» 2 mm — W’llz. We have the following inequality for any 9 and 0’ <6 — 6', v<1>*<0> — V * (6')) s fine — 6’11: where *(0) is the Legendre conjugate of (W) and H - H... is dual of norm || - H. Further- 92 more, we have the following equality for any W and W' Dq,(W, W’) = 19.1,. (0, 0’), where 9 = V(W) and 6’ = V(W’). Proposition 4. For any linear classifier U 6 Rd" K, We have the following inequality hold for two consecutive classifier Wt_l and Wt generated by Algorithm 8 17.1,. (U, WH) — 0.1,. (U, Wt) + 13.1,.(Wt—1, Wt) = —(U — Wt_1,nxt6;r) (5.8) Proof. Using the property of Bregman distance function (see for example Chapter 11.2 in [83]), we have 0.1,.(U, WH) — 0.1,.(U, Wt) + 13.1,.(Wt-1, Wt) = (U — Wt_1, V*(Wt) — V*(Wt_1)) = (U _ Wt_1,6t _ gt—l) = — The second step follows the property 9t = V*(Wt), and the last step uses the updating rule of Algorithm 8. C] Now, we can bound E[|6t l2} as follows, with the proof provided in Appendix A.13. Proposition 5. For any 3 > 0, we have K 3 2/3 E 6 2 <_.L A _ l _ [Itlsl—1_7+[yt7éyt]{1 7+K(1+[7]) } We use |W|p,3 to measure the norm of matrix W E RdXK with p 2 1 and s 2 1. It is 93 defined as W = max 11, Wv 5.9 I has lulpsws £1< > ( > where u e Rd, v E RK, and |u|q and Mt are L, and Lt norm of vector u and v, respec- tively. Evidently, the dual norm of I - Ip,s is l - I”, with p"1 + q“1 = land s_1+ t”1 = l. The theorem below shows the regret bound for Algorithm 8. The proof of this theorem is provided in Appendix A.14 ‘ Theorem 11. Assume that for the sequence of examples, (x1, 311),. . . , (XT, yT), we have, for all t, xt 6 Rd, ||x||p 3 land the number ofclasses is K. Let U = (ul, . . . ,uK) E RdXK be any matrix, and * : RdXK r—> 1R be a strictly convex fimction with constant p with respect to norm I - lp,s- The expectation of the number of mistakes made by by Algorithm 8, denoted by E[M], is bounded as follows T 1 1 177T EMS—D*U+— €U+——+T [] m <1>() REA) 2pn(1_7) 7 where _ 77 ’7 K3 2/3 “—1‘z{1‘7+k‘(”[?]) } Notice that the Banditron algorithm is a special case of the general framework with *(W) = %|W|%. and | - Ip,3 = | - I22 = I - |F~ The Banditron bound is specifically obtained through approximations 7/ (1 — 7) S 27 and 1+ Iii/’7 S 2k/7 in summarizing the terms in n. 94 5.3.4 Exponential Gradient for Online Classification with Partial Feedback In this section, we extend the exponent gradient algorithm to online multi-class Ieaming with partial feedback. A straightforward approach is to use the result in Theorem 11 by setting K (1 «5(9) = [Zane-,1.) (5.10) 77‘ ll l—l a: ll H ’9; 1% u M» Ma. Wi,k(1an',k — 1) (5.11) ' 1 Fr ll H s II where each wk is a probability distribution. Following the general framework presented in Algorithm 8, Algorithm 9 summarizes the exponential gradient algorithm for online multi- class Ieaming with partial feedback. Since *(W) is strictly convex with constant 1 with respect to | - | F, we have following mistake bound for the exponential gradient algorithm. Theorem 12. Assume that for the sequence of examples, (x1, yl), . . . , (xT, yT), we have, for all t, xt 6 Rd, ||x||2 g 1 and the number ofclasses is K. Let U = (111, . . . , uK) E RdXK where each uk is a distribution. The expectation of the number of mistakes made by by Algorithm 9 is bounded as follows T Kan 1 777T EMS +— E U+————+ T wherenzl—zzlp(l—7+%+§-). By minimizing the mistake bound in the above theorem, we choose step size n as fol- lows _ K(1-'7)1nK _\/ T7 (5.12) 95 Algorithm 9 Exponential Gradient Algorithm for Online Multi-class Learning with Partial Feedback 1: Parameters: o Smoothing parameter: 7 E (0, 0.5) 0 Step size: n > 0 2: Set (90 = llT/d 3: fort = 1,...,Tdo 4: Compute W153,c = exp(6§,k)/Z,tc where Zfc = 2L1 exp(6f,k). 5: Receive X): 6 Rd 6: Compute i1} = arg maxxgrwt—1 (5.13) lngK $6th = (1 -7)[k = 37t1+7/K.k = 1,---.K Randomly sample fit according to the distribution p = (p1, . . . ,pK). Predict 5t and receive feedback [yt = at] 10: Compute 999:1 [yt = 9t] (St = 1A — 1~ —— (5.14) y y t t pilt where 11: stands for the vector of all elements being zero except that its kth element is 1. 11: Compute at = Ot‘l — nxtdg 12: end for For the high dimensional data, we can improve the result in Theorem 12 by using the following lemma. The proof of this lemma is provided in A.15. Lemma 7. (W) and * (W) defined in (5.10) and (5.11) satisfies the following properties K (W — w’, v<1>*(W) — V*(W’)) 2 Z Iwk — wm k=1 K <6 — 0’. We) — V*(6’)> 5 Z l6.,k — 61,..IE. gr II b-l where 0*): = (61,,“ . . . ,HdJc). Using the above lemma, we have the following theorem that updates the result in The- orem 12 96 Theorem 13. Same as the setup of Theorem 12 except that |x1|oo S 1. The expectation of the number of mistakes made by by Algorithm 9 is bounded as follows T Kan+ 777T EM < __ [ 1 +-Zlft(U) +2p.1_,)+rT wherenzl—Qn—p(2—2’y—4fi). Proof The proof is the same as the proof of Theorem 11 except that we have K 722 E[D¢(6‘ 165.61 12723 Zlénuxtlio s; Euatm lc=1 A simple computation shows that E[|6t|1] = 2 — 27 — 47 / K . By combining these results, we have the theorem. [:1 The major difference between Theorem 12 and 13 is the constraint on x: L2 is used in Theorem 12 and Loo is used in Theorem 13. Therefore, Theorem 13 shows that the exponential gradient algorithm is essentially independent from dimensionality d, making it suitable for handling high dimensional data. 5.4 Experiments To study the performance of the proposed framework, we applied the exponential poten- tial algorithm introduced in 5.3.4 on the multi-class classification data sets introduced in Section 1.6.1. We compared the classification performance of the proposed exponential gradient algo- rithm, Exp, to the Banditron algorithm. Since the exponential gradient algorithm assumes all the combination weights to be non-negative, in order to make fair comparison between the proposed approach and the Banditron algorithm, we run two sets of experiments for Banditron, one which is the original Banditron and one that projects the learned weights 97 MNIST NURSERY 1 _ & Perceptron 0 6513‘ — Perceptron 0.8“; G Banditron_pos ' . 4}Banditron_pos o 3‘. "a" Banditron 0-5 ’ \‘1 «n- , Banditron i‘: --O-'Exp % 0.55" “€1.13 -o--Exp s a-§§gg .3333 o 0 0 one o o--o ,0 0 - - 4 . . L 0 2 , . 4 6 0 5000 10000 15000 “8|an rounds x 10“ Training rounds PROTEIN LETTER 0.8 1%. —Peroeptron “a... a w 4} Banditron_pos 09' ' ‘-"béc‘—§.;3 -f= - ~ g. .n. _ , , m3-.. _. _ __n 0-7 tan-Banditron 08 8 0 9" '9 g on. ..0- Exp .3 - —Perceptron 5 0 '5 0.7- G Banditron_pos LE ' ,5 06 ail-Banditron 0.5- G'fi‘ygri‘éwsé . ( "0" Exp 0.5 \— 0.4 . r . l . ‘ ‘ 0 0-5 , . 1 1-5 2 0.40 5000 10000 15000 Training rounds x 104 Training rounds PENDIGITS OPTDIGITS 1 —-Peroeptron 1[ --D'Banditron_pos 5,, 0 8 “i ‘G'Banditron 0-8' #3; _. _ 0: (83:3: cr- 0 Exp 0 g~§§? '8: ‘1 ~{: a» a .‘ .. _ - " “ fl ._ n- .- «... r» _ "‘E} __a E :0.,'B"g.: G T - “ E— a S 0'6\ 0'”-3. STE“ 3- El '5 0 6 Glue ‘ BlBT'fl-‘O-«g h " -0- .0 I: °"-o---o..., 2 0,4- —Perceptron m 0 0~ 0 LU 0 4 JCFBanditron_pos ' 0.2’ --D--Banditron M 0'20 2000 4000 6000 8000 00 1000 2000 3000 4000 Training rounds Training rounds ISOLET 1a .~a.=§.:.é.z-82-~32é8£ £8 ' '- p. 0.8 ‘ ”rial-0 .9 —Perceptron E 0 6 0 Banditron_pos g ~D~Banditron 0.4 0.2 0 2000 4000 6000 8000 Training rounds Figure 5.1: The figure shows the error rates of different methods over trials with the best setting of 7. 98 .1" '.-_ nun. .l.w.n_ _I MNIST NURSERY 1 . . . . . - —Perceptron —Perceptron 0_ ‘ --D--Banditron . «Gr-Banditron 2 h's Exp 0 Exp . S 0.6' K," _ 'gfi ,G‘B‘IQ a :-.,..fl_.g-_,g:~..-err8~--5“‘ .‘ ,.a—-cr_,,.o---o ' 0.2 Omo‘" 00 0:1 0:2 0:3 014 0.5 0 0:1 0:2 0:3 0:4 Gamma Gamma PROTEIN LETTER 0.8 . . , 1 - . —Perceptron 1%,;va iii-Banditron 0.9 "---..o"'E'--a-..g.._3__ .-.a---u---m 07’ --o--Exp ""'o-~-o--~o.._., ...-3---o-~~o~---<> g g 0.8: : a 2. 9 “ 05f ‘ " 0-7 —Perce tron g b, _ 11317.8 g . . -p LU 1.5.3. __ _, .o-8'f_;87'---'8—“'0 LIJ 0.6: fl- Banditron 0.5 9.....5.,,:g....8-~ ‘ o-Exp 0.5- . . r r . .4 l . . . 040 0.1 0.2 0.3 0.4 0.5 0 O 0.1 0.2 0.3 0.4 0.5 Gamma Gamma PENDIGITS OPTDIGITS 1 . g 4 1 . . , & ——Pereeptron é,‘ -— Perceptron .: .,\ Kit-Banditron 0.80 'x' ~0- Banditron o 0.829 K‘s ..0.. Exp 0 “a" ..0.. Exp V E ‘-. "n-5," ,P'ua" 160.6» "0,“a’n'» x” “OMB"..-O ‘- 0.6' 21' ‘3'":13' ..0...,o."‘<> ._ 00300 E "o. ...-0: -o---o“' g 04- LIJ '--.,o,..-0’ uJ ' 0.4- 0.2 0.2 1 r 1 1 G . r . . 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Gamma Gamma ISOLET 1 . . . . &. ---..- -- Perceptron Emma‘s.- ... ‘G'Banditl'on 0.8- 8 ~-o--Exp O E h 0.6- E w 0.4- 0'20 0:1 012 04.3 0:4 0.5 Gamma Figure 5.2: Figure shows the final error rates of different methods with varied 7. 99 into the positive orthants, which is equivalent to setting all the negative weights to be zero. It is easy to verify that the projection step does not change the theoretic properties of Ban- ditron, in particular the mistake bound (of course only with respect to linear classifiers U in the positive orthants. We call this projected Banditron, Banditron_Pos. For each tested algorithm and for each data set, we conduct 10 independent runs with different seeds for randomization. We evaluate the performance of online Ieaming by the accumulate error rate, which is computed as the ratio of the number of misclassified sam- ples and the number of samples received so far during the online learning process. Since these online algorithms rely on the parameter 7 to control the tradeoff between exploration and exploitation, we examine the classification results of all the algorithms in comparison by varying 7. The step size 17 of online learning often play an important role in the final performance. For the proposed algorithm, we set the step size according to Eq. 5.12. Because the exponential function may exceed the upper bound of a real number with double precision type in a 64-bit computer, we further multiple the step size with a small factor (typically 10—5) to avoid this issue. 5.4.1 Experimental results Figure 5.2 compares the average error rates of the online algorithms with varied 7 values, and Figure 5.1 shows the average error rates of the three online methods over the entire online process. For the proposed algorithm and both version of Banditron, we choose the optimal 7 that results the lowest classification error rate. First, by examining the classification performance with varied 7, we clearly see that the exponential gradient algorithm shows comparable performance compared with the original Banditron algorithm for online multi-class learning with limited feedback. In particular, we observe that the proposed algorithm performs significantly better than the Banditron algorithm for three data sets ’OptDigits’, ’Pendgitis’, and ’Nursery’. The result indicates that the proposed algorithm is overall more reliable. Notice that for all data sets except for 100 ’Nursery’ data set, we observe a significant gap between online Ieaming with full feedback and online learning with partial feedback, which is due to the limited feedback from the adversary. Second, we compare the learning rate of all three algorithms. We observe that the pro- posed algorithm overall exhibits a significantly better learning rate than the Banditron_Pos algorithm (i.e. Banditron with positive weights), for most data sets and most part of the online Ieaming process. This result indicates that the proposed online Ieaming algorithm with partial feedback is generally effective in reducing the error rate. Finally, notice that these algorithms are sensitive to the choice of parameter 7. In Chapter 6, we provide more details on the exploration vs. exploitation tradeoff parameter 7 and provide effective algorithm to automatically tune it. 101 ”L‘s. .... ...—m, Chapter 6 Robust Online Classification With Bandit Feedback As we have already seen in Chapter 5, exploration vs. exploitation tradeoff strategy is the main tool to develop online classification algorithms with bandit feedback. The major prob- lem with utilizing this strategy is the sensitivity of the resulting algorithm to the exploration vs. exploitation tradeoff parameter. In this chapter, we propose three learning strategies to automatically adjust the tradeoff parameter for Banidtron. Our extensive empirical study with multiple real-world data sets verifies the efficacy of the proposed approach in learning the exploration vs. exploitation tradeoff parameter. 6.1 Introduction Exploitation vs. exploration tradeoff strategy has been widely applied to develop online learning techniques when the feedback provided to learner is bandit, i.e. the learner only receives the cost of its action but not the cost of other possible actions. Exploration refers to the choice of an action not recommended as the best action by the current model (classifier). It allows the learner to explore the game and receive the feedback for different strategies and gain new knowledge from the adversary. Exploitation refers to choice of the best action 102 according to the current knowledge in order to maximize the gain. These two objectives are complementary, but opposite: exploration leads to maximization of the gain in the long run at the risk of losing short term reward; exploitation maximizes the short term gain at the price of losing the gain over the long run. A careful tradeoff between these two objectives is important to the success of any online learner utilizing the combined strategy. The challenge of online classification with bandit feedback is that after classifying an instance, the learner only receives the loss value for those hypotheses that have the same prediction as the current hypothesis. This means that the learner is not able to explore the whole hypothesis space if it only classifies according to the current hypothesis. As described in Chapter 5, Banditron [5] utilizes the exploration vs. exploitation tradeoff tech- niques to handle this challenge. This tradeoff is explicitly captured by a single parameter 7 6 (0, 0.5) in Banditron: with probability 1 - 7, the learner will predict the most likely class label based on the current classification model (exploitation), and with probability 7, the learner will randomly choose one of the remaining class labels for prediction (explo- ration). Figure 6.1 shows the performance of Banditron for different data sets by varying the value of 7. The best 7 values for data sets ’Protein’, ’Pendigits’, ’Isolet’, ’Nursery’, ’Opt- digits’, ’Letter’, and ’Mnist’ are respectively 0.1, 0.25, 0.2, 0.15, 0.25, 0.35, and 0.15. It is clear that the performance of Banditron strongly depends on the value of 7 and and it is therefore very helpful to develop strategies to automatically tune this parameter. Intuitively, at the beginning of the learning stage, due to the fact that classification model is trained by a limited number of examples, it is likely that the classification model will perform poorly. As a result, it may be more desirable to have a large value for 7. As the Ieaming procedure proceeds, the classification model is updated with sufficiently large number of examples, and therefore is likely to yield accurate classification performance. Therefore, it is desir- able to reduce the value of 7 and the amount of exploration with increasing number of Notice these plots are extracted from Figure 5.2. 103 MNIST NURSERY 0.3 . . - 0.6 . a 0.7 0.55. m E 0.6» ‘5 III 0.5 El 04 0.45: I . J r 0.4 r . A r 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 gamma gamma PROTEIN LETTER I 0.6 . - 0.96 . . .1 '1 0.58 I g ‘3 0.56- L E m 0.54- 0'50 0.1 0.2 0:3 0:4 0'840 0:1 0:2 0:3 0:4 gamma gamma PENDIGITS OPTDIGITS 1 ~ - 0.9 . . 0.9- §0.8- f3 3 § LEI 0.7 m 0.6- 0'50 0:1 0:2 0:3 0:4 0.5 0'50 0:1 0:2 0:3 04 0.5 gamma gamma ISOLET 0.95 . . Error rate L 0 0:1 0:2 03 0.4 gamma Figure 6.1: The error rates of Banditron with different choice of 7 for different data sets 104 training examples. Theoretically, as suggested by [5], the choice of 7 = 0(tTI/3) pro- duces the optimum result in the agnostic case. This is because by minimizing the mistake bound provided in Inequality 5.2 with regard to 7, we obtain 7 = O(t‘1/3). However, as we show in this chapter, the adaptive choice of parameter 7 should be not only dependent on time t but also dependent on the number of correctly/incorrectly classified instances to control the speed in which we reduce the amount of exploration. 6.2 Related Work To the best of our knowledge, this is the first study that aims to learn the exploration vs exploitation tradeoff parameter for online classification with bandit feedback. Since explo- ration vs. exploitation tradeoff parameter is widely utilized for the problem of multi-armed bandit, here we briefly describe the tuning techniques for this parameter in multi-armed bandit [6]. However, none of these methods utilizes the classification specific information, e.g. the number of mistakes. In the simplest form, called 7-first strategy [6], a pure exploration phase is followed by a pure exploitation phase [46, 95]. Evan-Dar et al. [46] showed that to obtain an a-optimal arm with probability 1 — 6, 0(EKZ log(éf—D rounds of exploration is needed. The problem with this approach is that it cannot produce arbitrary small regrets. A second approach, called 7-decreasing strategy [6], is similar to the approach proposed by Kakade et al. [5] for the problem of online classification with bandit feedback. In this approach, 7 is a decreasing function of time t. Several decreasing function have been proposed including 7t = 0G) [6], 7t = 06—0—ng) [96]. and 7t = Otis) [5]. Another approach, the Boltzmann exploration, chooses each arm with a probability proportional of their obtained reward [96]. A temperature parameter can be utilized to smoothly switch from pure exploration to a pure exploitation. Notice that except 7- decreasing strategy proposed in [5], no theoretical results are known for the other methods 105 Algorithm 10 The Banditron Algorithm 1: Set wg = O,k=1,...,Kand00 = V*(W0) 2: fort = 1, . ..,Tdo 3: Receive xt 6 Rd Compute Q} = arg maxls ks K xtwac—l Choose sampling probability ’Yt Setpk = (1 -7t)[k =17t1+7t/K.k = 1..-..K Sample 31;} by distribution p = (121, . . . , p K)~ Predict fit and receive feedback [yt = fit] Compute 6; = lilt — lgtifitiytl V where 1k stands for the vector with all its elements yt being zero except its kth element is 1. 10: Compute Wt = W“1 - xtdtT 1 1: end for 4: S: 6'. '7: 8 9 introduced here for the problem of online classification with bandit setting. 6.3 Balancing between Exploration and Exploitation 6.3.1 Preliminary Algorithm 10 shows the Banditron algorithm [5], which is exactly the same algorithm as given in Algorithm 7 however it uses an adaptive 7t to emphasize that the exploration vs. exploitation tradeoff parameter changes over time. Theorem 14 provides a new form for the mistake bound of Banditron. The proof of Theorem 14 is provided in Appendix A.16. Theorem 14. Let K be the number of classes. After running over a sequence of examples x1, . . . ,xT, with ”Xt ”2 _<_ 1 for all t, the expected number of mistakes made by Banditron, denoted by E[M], is bounded as follows T T T T A K A EiMlSWli‘i'E :(t(U)+E 2% +13 E:’7t[yt=ytl+ E :Tfllyti‘éyt] (6-1) t=1 t=1 t=l t—l where U is any arbitrary weight matrix (classifier) and {7t}?=1 are exploration vs. ex- ploitation tradeofir parameters of trials. 106 Remark: It is important to realize that by a proper re-scaling of the complexity of U and margin as suggested in [97], the bound provided in Theorem 14 can be rewritten as: T T T T ElM] S EMU) + E :72: + IUIF 23 ZTtigt = 11:] + 2%[1’11 75 yr] (6-2) t=1 t=1 t=1 t=1 which is the inequality used in [5] to obtain the bound of Bandtiron given in Equation 5.2. More specifically, the bound in Equation 5.2 is obtained by two relaxations in Inequal- ity 6.2: [fit = gt] 3 1 and W S x/E + x/b. Moreover, notice that €t(U) is the hinge loss with margin equal to 1 in both inequalities 6.1 and 6.2. Given the bound provided in Theorem 14, the optimal set of sampling probabilities {7t};F=1 will be evidently obtained by minimizing the mistake bound stated in Theorem 14, i.e. T K T L 2 Z _[gt 7g yt] + Z7t(1+[17t = ytI) t=1 7‘ t=1 However [31} ¢ yt] and [fit = yt] are not provided as the feedback in the bandit setting and we need to approximate them with an expectation in terms of 17. We consider the following approximations in Section 6.3.2 : ' [lit = yt] = Et [Effiglgtzyd] S Et [5? =§_3t=y 1] S 2Et [[17t = 01111;} = yd] = yt m ' [lit 3'5 yt] = 1 — [371: = tn] 5 1 - Et [Ii/‘1: = @101: yd] = Tt To understand the merit of the above approximations, we analyze the following two ap- proximations in Sections 6.3.3 and 6.3.4 respectively and use them as the competitors in the experiments. ' [and S 1and[@‘t=yt1 Sflt 0 [fit 2 yt] g 1 (which is the relaxation used in [5]) and [1h 9'5 9t] S ”Ft 107 6.3.2 Finding Optimal 7 using [3’], 7E gt] S 7', and [3, = 3),] 3 pt In order to bound the quantity L = 2:1 753} + 221:1 7t(u + 1), we consider a general family of 7 that is defined based on a concave function. We introduce the concept of good support fimction. Definition 15. A function 02(2) defined in the domain of 2 2 0 is called a good support function if it satisfies the following conditions: (a) 02(2) is concave for z 2 0 and 02(0) 2 0, (b) 02(2) is monotonically increasing, i.e., 02’(2) > 0, for 2 2 0, (c) 02(2) is Lipschitz continuous with Lipschitz constant L, i.e., 02’ (2) S L, for 2 2 0, and (d) there exists a constant p _>_ 1 such that for any t 2 0 and 2 Z 0, we have w’(2) S ptw' (2 + t). Proposition 6. (1) 02(2) = (a + 2))‘, with A E (0, 1] and a > 0, is a good support fimction, with Lipschitz constant L = Aux-1, and p = 6(1—A)/ 0, and (2) 02(2) = ln(a + 2) with a > 0 is a good support function, with L = l/a, and p = el/a. The proof of this Proposition is provided in Appendix A.17. In order to bound quantity L, we introduce two good support functions 021(2) and 022 ( 2) , with 021(2) 2 05(2) for any 2 2 0. We define WE (213 1 + 6:) 7t : I t—l 2W1 (22:1 Ti) (6.3) It is straightforward to verify that ’Yt E (0, 1 / 2]. In addition, since 021(2) and 022(2) are two concave functions, 021(2) and 02'2 (2) are non-increasing functions of 2, leading to a decreas- ing function of M and t and increasing function of 'rt. The following proposition shows a key property for the construction of 7t in 6.3. The proof is provided in Appendix A.18. 108 Proposition 7. Given the construction of 711 in (6. 3), we have the following inequalities: T 022 (T) T 2 “’2 (Zirzl Ht) 2’72: S 102 2 T , 27%: S 102 I T t=1 2011 (thl Tt) t=1 20)] (Zt=1 Tt) T K 021 (2:31:10) 2: _Mt S 2P1K 2 T t=1 7t “2 (21:1 1 + Mt) where p1 and p2 are the constants defined in Definition 15 respectively for 021 and 022. Theorem 16. Let 021(2) and 002(2) be two good support fitnctions with 021(2) Z 02% (2) for any 2 Z 0. By running Algorithm 10 with ’Yt set as in Eq. (6.3), we have the following bound for the expected number of misclassified examples T p2w2 (T) 2P1 K021 (3T) 022(2T) EIMI S #21001) + 202’1(T) + IUIF (\/ 02’2(3T) +P2 202((2T) where p1 and p2 are the constants of two good support fimctions. Proof. The proof is straightforward by using the result in Remark 1, Proposition 7, in- equality Va + b S ([6 + x/b, and considering the fact that for a good support function 02: 02 (233;, 7,) _<_ am 5 02(2T) and 02' (23;, p.) 3 02’(2T). III Now, using the above theorem and Proposition 6, we have: Corollary 17. Suppose 7t is in Eq. (6.3) with 021(2) 2 (1 + 2))‘1 and022(2) = (1 + 2))‘2 where A1, A2 6 (0, 1] and A1 = A2 + 1 / 3. By running Algorithm 10, we have the following bound for the expected number of misclassified examples 1 3K E[M]<:lt()U +:—2—( (1+T)3+|U|p ”2:2 (1+3T)3+\/p——§_T(1+2T)3 where p2 = (Bl—AZ. This bound is of 0(T2/3) and similar to the bound of the original Banditron. 109 Proof. It is a simple plug-in of the two support functions in Theorem 16. CI 6.3.3 Finding Optimal 7 using [39} 7é yt] _<_ 1 and [’y} = gt] 3 at In this section, we use the the upper bound approximation L = 2&1 % + 23:1 ’Yt(1+ltt)- Given a good support function 02(2), we define ”it as t—l 1 I 7, = 51:02 (21+ 11,-) (6.4) It is straightforward to see that ’71: is valid since ’7t 6 [0, 1 / 2]. In addition, since 02(2) is a concave function, 02’ (2) is a non-increasing function of 2, leading to a decreasing value for 7,3 as more and more training examples have been classified correctly. The following proposition shows a key property for the construction of 7t in (6.4), with the proof provided in Appendix A.19. Proposition 8. Given the construction of ’Yt in (6.4), we have the following inequalities l/\ T _p— 5 2KLT an< _ 21101:” : Z713 2Lw(T)’ 1,27: 0)’(thr=11+flt) Using the above proposition, we have the following theorem for the mistake bound of dynamic 7 introduced in 6.4. Theorem 18. Let 02(2) be a good support function. By running Algorithm 10 with ’Yt set as in Eq. (6.4), we have the following bound for the expected number of mistakes made by the algorithm EIM] < {i +p02(T) + IUI 02(2T) + 2KLT _ t—l F p 2L 02’ (3T) Proof. Similar to Theorem 16. E] 110 The following corollary directly follows from the result of Proposition 6 and Theo- rem 18. Corollary 19. By running Algorithm 10 with 7), as in Eq. (6.4) and 02(2) = (1 + 2)“) , where )1 E (0, 1], we have the following bound for the expected number of classification mistakes T 1-I\ 1-I\ A l-A 1 E[MISE €t(U)+E——(1+T)’\+|U|p t-3——(1+2T)?+\/2K(1+T) 2 T2 t=1 2’\ V20 When A = 2/ 3, we have E[M] = 0(T2/3) which is the same convergence rate as Banditron. 6.3.4 Finding Optimal 7 using [fit 7é 3),] S T, and [3’], = yt] _<_ 1 Similar to the approach presented in the previous sections, we set ”It as I t n = , “221 (6.5) 2021 (22-21 Ti) where 021(2) and 022(2) are two good support functions and 021(2) _>_ 025(2). It is easy to verify that 7t 6 (0, 1/ 2] due to the properties of a good support function. The proposition below allows us to bound 2:le ’Yt and 2&1 K / ”Yt- Proposition 9. Given the construction of 7t in (6.5), we have the following inequalities T K ., (21:1 6) T 626) — < 2K & < Z; ’Yt Tt — p 02$(T) g,” — 2011 (El; Tt) Proof. Similar to Proposition 7. III Theorem 20. Let 021(2) and 022 (2) be two good support fimctions. By running Algorithm 10 with 7t set as in Eq. ( 6.5), we have the following bound for the number of misclassified 111 examples M = ELIE: 75 ytl T 022(T) w2(T) M 51M] 5 EMU) + 20', (23:1 n) + 'U'F (V 202; (T) + \/2K”1 025(1)) Proof. The proof directly follows Theorem 14 and Proposition 9. CI The following corollary directly follows from the result of Proposition 6 and Theo- rem 20. Corollary 21. By running Algorithm 10 with ’11: in Eq. (6.5) and 021 = (1 + 2)’\1 and 022 = (1 -I- 2))‘2 with A1, 2\2 E (0, 1], we have the following bound for the expected number of misclassified examples T 1 A _ E[M] _ t§=1:8t(U) + —2/\1(1+T) 1 ,\ +1—,\ 2k l-Al A +1-,\ + IUIF (l2—/\1(1+T)_2T'l+(l :2 (marl—24 with A1 = A2 + :1; we have E [M] = 0(T2/3) which is of the same rate as Banditron. 6.4 Experiments In this section, we conduct experiments on the classification data sets, introduced in 1.6.1, to validate the proposed strategies for balancing the tradeoff between exploration and ex- ploitation. 6.4.1 Experimental Settings We refer to the algorithms developed in Sections 6.3.2, 6.3.3, and 6.3.4 as banditron_ag3, banditron_agl and banditron_ag2. To evaluate the classification performance of the 112 three proposed Ieaming strategies for exploitation vs. exploration tradeoff parameter 7, we compare them with three different version of Banditron, namely, Banditron_worst, Banditron_Best, and Banditron_ag0. Banditron_worst and Banditron_Best are Ban- ditron algorithm when 7 is set to the worst and best value for a given data. Banditron_ag0 is the Banditron with the adaptive ’Yt = %t-1/ 3 as suggested in [5] for the general ag- nostic case. We repeat each experiment 50 times by generating random sequences of in- stances and report the average accumulate error rates, which are computed as the ratio of the number of misclassified samples to the number of samples received so far. For all three proposed methods in all the experiments, we use similar good support functions 02(2) = 021(2) = 022(2) = (1 + 2)’\ with A = 0.1 for a fair comparison. Also notice that the result is pretty stable for most of these data sets with different values of A. 6.4.2 Experimental results To study the behavior of different Ieaming algorithms over trials, we show the average error rates of all the methods over the entire online process in Figure 6.2. First notice that there is big gap between Banditron_worst and Banditron_Best in all data sets that emphasizes that the Banditron algorithm can perform very poorly if 7 is not set appropriately. We observe that overall the proposed algorithms exhibit similar or better learning rates as the Banditron algorithm with the optimal 7. In particular, banditron_ag2 and banditron_ag3 yields the best performance among the algorithms in comparison. In almost all the data sets, banditron_ag2 and banditron_ag3 perform significantly better than banditron_agO which suggests that 7t 2 %t—1/3 is not a good adaptive choice. As a few examples, notice that the final error rate of banditron_agO is 45% versus 38% error rate of banditron_ag2, banditron_ag3 and banditron_best for MNIST data set. For Pendigits data set, the final error rate of banditron_ag2 and banditron_ag3 is 56% which is significantly low compared to 60% error rate of banditron_best and 62% error rate of banditron_agO. The latter example also suggests that our adaptive strategy is better than the Banditron with a single best 7. 113 Error rate Error rate Error rate ---B-- Banditron_Best - o-~ Banditron_worst - 0 - Banditron_ago ‘ Banditron_ag1 - + - Banditron_agZ - 4+ — Banditron_Ag3 NURSERY o 0000000 :éat}; \ 0.45» :8 83233 g, 8 0'40 5000 10000 15000 Training rounds LETTER 1r 0953, ..., \\ °~$ '8'”- «jj, - 1’ ~ ‘ ‘ - ~0‘e6- "mm: 0.9' ‘: \ '0- - a‘flzfi. e-$ 0-0 "‘3‘: ._ 0.85- 1.13:4; 0'80 5000 10000 15000 Training rounds OPTDIGITS I ‘28: ‘1‘). - 0 7 “WK: ~ ‘-" . (1.: #1: : :#‘°~ 06 28:33; ~70~° 8213532 0.5 0 1000 2000 3000 4000 Training rounds MNIST ‘I o ~ _ , E ‘ "if” . _ h- » ‘3- ’: 1 ‘ 5, . Lg0.5 ~u~n~~2~~22 '2. 2 0 - . . 0 12- . g 6 raining roun s 4 x 10 PROTEIN 0.8- £2 9 go 6 . Mg....0...0...o...0..,0...o .0 ‘1' -‘ . I '1 51-2. LU ahfiflifi$_#_$ 0.4 - . 0 T . . 1 d 2 raining roun s 4 x 10 PENDIGITS i{'0-“0- -o-~--o 2.0.. .-,, o. 0-8'\:‘:.. 0 0111050 0) ~' ‘ *5 it, §0.7’ 1k at" :9 . m “in...“ ‘9? 0.6 ‘15» 51— g 8 ‘ “-1 0'50 2000 40800 6000 8000 Training rounds ISOLET 1. 0.95 ' o E x" if; 2.3%“. 12- 0.9- 9:0. ~; «5-,.. ‘ ‘0- - LU fl‘mfi. .6 ONO 0.85» 3;. . _ 0'80 2000 4000 6000 8000 Training rounds Figure 6.2: The error rates of different methods over trials. Each point on a curve is the average results of 50 randomly generated sequences of data. 114 Although better than banditron_worst, the performance of banditron_agl is not com- parable to that of the other methods. This can be explained by the inherited difference between banditron_agl and the other two proposed approaches. Unlike banditron_ag2 and banditron_ag3 where two good support functions are introduced to determine 7t, the ”It defined in banditron_agl is determined by a single good support function. As a result, we have a better control of the value for 7 over time in banditron_ag2 and banditron_ag3 by a tradeoff between two functions: one which is the decreasing function of time and the other which is the increasing function of the number of misclassified examples. 115 Chapter 7 Conclusion and Future Work In this chapter, we summarized the main contributions of this thesis and draw some direc- tions for future work. 7 .1 Summary and Conclusions We developed several online and batch learning algorithms in this thesis. The batch Ieam- ing algorithms that we covered have the common property that they all utilize boosting for optimizing an objective function in a function space. Utilizing boosting is particularly ben- eficial because it allows any existing supervised learning algorithms be applied for a new learning task. For the online Ieaming, our focus has been on the classification with bandit feedback. In the following subsections, we briefly review our main contributions in two separate sections, one for boosting and one for online Ieaming with bandit feedback. 7.1.1 Boosting We developed boosting algorithms for several classification and ranking problems, as sum- marized below. 0 Semi-supervised classification: Unlike existing semi-supervised learning algo- 116 rithms that focus on binary classification problems, we addressed the problem of multi-class semi-supervised learning directly. We proposed a new framework, termed multi-class semi-supervised boosting (MCSSB), that is able to improve the classifi- cation accuracy of any given base multi-class classifier. MCSSB utilizes both the cluster and manifold assumptions in the design of objective function and exploits boosting techniques to optimize the objective function. We showed that our proposed framework is able to improve the performance of a given classifier much better than Assemble, a well-known semi-supervised boosting algorithm, on several real world data sets. We also showed that MCSSB is very robust to the choice of base classifiers, the number of labeled examples, and the value of parameter C. Learning to rank by maximizing NDCG: Listwise approach is a relatively new approach to Ieaming to rank that aims to optimize listwise loss functions; i.e. loss functions that measure the performance of a ranking model in the query-level. The difficulty in optimizing such losses lies in the inherited sort function used for comput- ing them. We address this challenge by a probabilistic framework for the problem of maximizing NDCG that optimizes the expectation of NDCG over all the possible per- mutations of documents. We present a relaxation strategy to effectively approximate the expectation of NDCG, and a bound optimization strategy for efficient optimiza- tion. Our experiments on benchmark data sets shows that our method is superior to the state-of-the-art learning to rank algorithms in terms of performance and stability. Ranking Refinement: We considered the problem of ranking refinement whose goal is to improve a given ranking function by a small number of labeled instances. The key challenge in combining the ranking information from the base ranker and the labeled instances arises from the fact that the information in the base ranker tends to be inaccurate and the information from the training data tends to be noisy. We presented a multiplicative objective function to combine these sources of information 117 and proposed a boosting algorithm for learning. Empirical studies with relevance feedback and recommender system show promising performance of the proposed algorithm. 7.1.2 Online Learning 0 General framework: We presented a general framework for online multi-class learning with partial feedback using the potential-based gradient descent approach of which Banditron is a special case. In addition, we proposed an exponential gra- dient algorithm for online multi-class Ieaming with partial feedback. Compared to the Banditron algorithm, the exponential gradient algorithm is advantageous in that its mistake bound is independent from the dimension of data, making it suitable for classifying high dimensional data. We verified the efficacy of the proposed algo- rithm by empirical studies with several real-world data sets. Our experiments show the exponential gradient approach for online learning with partial feedback is more effective than Banditron in terms of the Ieaming rate, which makes it more suitable for the scenario when the number of training examples is relatively small. 0 Automatic tuning of trade-off parameter : We studied the problem of optimizing the exploration—exploitation tradeoff in the context of online classification with bandit feedback. We proposed three different strategies to automatically tune the tradeoff parameter used by the Banditron algorithm. We showed through extensive experi- mental study that the proposed approaches are effective in adjusting the exploration- exploitation tradeoff. In particular, we found that two of the proposed algorithms achieve similar or better performance compared to Banditron with the best value for 7. 118 7 .2 Future Work In this section, we summarize future research directions that are directly related to the theme of this thesis, in two separate subsections, one for boosting and one for online Ieam- ing. 7 .2.1 Boosting There has recently been increasing interests in understanding the relation between game theory and machine learning and furthermore examining how each field contributes to the other [98, 99]. Particularly, boosting can be considered a fictitious zero-sum game [39] between two agents: a data generator as a row player that chooses a mixed strategy over the space of training examples and a learner as a column player that chooses strategies over the hypothesis space. The followings are some interesting game theory questions for boosting: e Representability of a given hypothesis for an specific task: Using Minimax the- orem, Freund et al. [39] showed that there is a mixed strategy over the space of hypotheses H that produces zero classification error over the training set if for any mixed strategy over the training examples, there is one hypothesis in H able to per- form better than random guessing. Similar results may be extended to other tasks that also utilize boosting. For example, we utilized the space of binary classifiers to learn a ranking algorithm that maximizes NDCG in Chapter 3. It is interesting to study the ability and limitation of binary hypotheses in maximizing NDCG; i.e. to analyze the maximum value of NDCG obtained by a mixed strategy over the binary hypotheses given. 9 New methods to find mixed strategies: Boosting (and other ensemble methods) can be considered methods to find the mixed strategy over the hypothesis space. However, the designer of these methods did not have the notion of equilibrium in 119 mind while developing them. Designing new algorithms that directly consider the data generator as the row player and the learner and the column player and aim to find a equilibrium solution is potentially advantageous and interesting. One possible option is to learn a finite set of weak models sequentially (similar to boosting) and then playing a game to find the best weighted majority votes (mixed strategy). 0 Batch learning with partial feedback: In this problem, the feedback (i.e. labeling) is similar to online learning with partial feedback except that training instances are provided in batch mode. For instance, consider the multi-class learning problem where each instance is given a class label and a flag that indicates whether or not the given class label is correct. Similar to online Ieaming with partial feedback, contextual advertisement and recommender systems are some example applications of this problem. For these problems, training examples can be collected and utilized for learning in batch mode similar to the click-through ranking feedback that is being used in learning to rank. Designing a boosting algorithm that utilizes a supervised classifier for this problem is one direction of research work. From the game theory point of view, this problem can be considered a game between two players with partially known payoff matrix. 7 .2.2 Online Ieaming Online learning with bandit feedback is a new research area for which there are several open research questions, as summarized below: 0 Tighter bounds: Kakade et. al [5] proved that there exists algorithms for online classification with bandit feedback with bounds of order 0(T1/2), however the algo- rithms that are introduced so far are of order 0(T2/3). Developing algorithms that have better regret bounds than existing ones is one of the future research directions. 0 Online Ieaming to rank and multi-label classification with partial feedback: Contextual advertising and recommender systems are originally ranking problems 120 that were simplified as multi-class problems when dealing with online partial feed- back. An intermediate setting between online ranking and online classification with bandit setting is online multi-label classification in which more than one class (adver- tisement) are relevant. Developing algorithms for online Ieaming to rank and online multi-label classification with partial feedback is another research direction that will be explored in the future. 121 APPENDICES 122 Appendix A APPENDIX A.l Proof of Lemma 1, Chapter 2 Proof Bound in Equation (2.8) can be derived as follows: 1 1 (22::b,b’exp(abf’)>3 >3b, —--, 32w _ bk’ b. k- k’ =1 2 2 The inequality used by the above derivation follows the convexity of exponential function, i.e., I I , h’.c —h’?+2 —h’.c +h"?+2 1 exp(a(hl-c — hk)) < exp fia—z———J——-— + 0 x Z J + 60— % .7 _ 6 6 3 I I hf —h’?+2 1 415° +h’?+2 S ——6—J——— exp(6a) + 5 exp(6a) + 6 3 Using the definition of dbfj, we have the result in Equation 2.9. A.2 Proof of Lemma 2, Chapter 2 Proof. Following the result in (A.1), we have 1 m bklbkz k3 k k k k 37 Z —-'7y—’Jexp(a(hil +11]? —hz.3 —hj3)) 751.7 k1,k2,k3=1 3:] 1 exp_(___2cr)- kk kkexp(201)—1 k+ k 5 75+— 2th]. (Zh'bi'b thbj 22gb]. Zak?” +3“ 124 The inequality in (A.l) follows the convexity of exponential function, i.e., hkl 'i k1 ’62 ’63 k3 _ 6" 0x 2 J " 3 k2 k3 k3 +11]. hi h]. +2 + 1 + 603 E3+2 —h hfl + W hb°3 — h 1 exp(6a) + 3 exp(6a) + < j z _ 6 Bound in Equation 2.9 can be derived as follows m _ k k’_ k k’_ k ..H 2 yiexp(Hj Hj+a(hJ hj)) 7'7] kl,k=1 1 + exp(60:) + exp(—6a) exp(6a) - 1 E: hk m k’ bf yzlc z + 6 i _ _ — 32' ' k’=1 |/\ y . J k 1,] k=1 bf, bi The inequality used by the above derivation follows the convexity of exponential function, i.e., k’ k I 41%“ +hf+2 6 +0X 6 +60§ IA 2 exp 60 exp ) (A2) F . "ul "m /\ |/\ exp The above inequality follows from exp(:2:) 2 1 + x. We rewrite FT as T t=1 By substituting Ft / Ft"1 with the bound in Equation A.6, we have the result in the theorem. 125 A.4 Proof of Proposition 2, Chapter 3 1 1 -k ~k = k k k k 1+exp(Fz. —Fj) 1+exp(Fi _Fj +a(fi "'fj )) k k = ( + 1 J eXP(a(fik - ff» 1 + expwzk _ F119) 1 + exp(Fz.k — FJ’F) 1 + exp(Fik — FJ’F) exp(Fz-k — Ff) exp(Fz-k — FJ’F) 1 1 — + 1 + exp(Fik — F3153) ( 1 + exp(Fik — F?) 1 + exp(Fik — F319) |/\ exp - F(dibqk))] ) IrkEGIfliJ) 2 Z (Prekinqk) (1 + exp [Z(waeq’“) - F(dibqkflm IrkGGgfiJ) (1 + exp [2(F(df,qk) — F(df, qk))]) Pr (aka) > #0)) We used the definition of Pr(7rk IF, qk) in Equation (3.6) to find G§(i, j) as the dual of G50, j) in the first step of the proof. The inequality in the proof is because wk(z‘) — Irk( j ) _>_ 1 and the last step is because Pr(7r’c IF, qk) is the only term dependent on 7r. 126 A.6 Proof of Theorem 5, Chapter 3 In order to obtain the result of :1}: Theorem 5, we first plug Equation (3.13) in Equation (3.11). This leads to minimizing 22:1 2173!“: 1 2%,?ij [exp(a(f;-° — fz-k))] , the term related to a . Since fz-k takes binary values 0 and 1, we have the following: Getting the partial derivative of this term respect to a and having it equal to zero results the theorem. A.7 Proof of Theorem 6, Chapter 3 First, we provide the following proposition to handle exp(a( f f — fz-k)). Proposition 10. Ifas, y E [0, 1], we have exp(3;1) - 1(x _ y) + exp(3a) + e:p(—30r) + 1 (A3) exp(a($ - 31)) 5 Proof Due to the convexity of exp function, we have: z—y+1 1—x+y 1 exp(oz(a: — y)) = exp(3a 3 + O x —3—— + 3 x —3a) — 1 1 — l 3 32—3; exp(3oz) + ——::£ + 5 exp(——3cr) Using the result in the above proposition, we can bound the last term in Equation (3.13) as follows: 91“,,a[exp( (ff—If>—1]<_0§j(———exp(3a) (If— If)+ exp<3a)+e:p(’3a)’2) (AA) 127 Using the result in Equation (AA) and (3.13), we have M(Q, 17‘) in Equation (3.] l) bounded as M(Q,F)sM(Q,F)+v+el——3——p(3a) ZZZ“ 26,-‘1, (If—ft) mk 7.119 k 1'. __ - exp(3a)— k 22 —2.7 k —M(Q,F) +7(a)+ —— 2 E3132 Tam k=12=1 3:1 A.8 Proof of Theorem 7, Chapter 3 Proof By plugging Equation 13 into Equation 11, we have rk ijzb) —M(Q,F) s 21:42:}: 1 Z—gfibfij [exp — 1] Since fik takes binary values 0 and 1, we have the following n mk {bk-10k k k 2:1 :1 Z [0,,jexpII,-k>+exp(- e1) I§>+I) 128 " "_I'.:|"I\?V I "sl So, n "U; 27'2- MIf)+exp<—a>1fk)+1(fkf,k) = l + exp(— 0):“: f— jI(fk< ( 2 TM exp(Fj -F'i +0‘(fj -fi))) ( Z 0233' 6XP(0(fj -fz')) ( Z: bi,j exp(an —fi)) i,j=1 2',j=1 where ai , j and bi , j are defined in (4.11) and (4.12). Thus, we have an upper bound of the log ratio as follows i n 11 long log ( Z aiaj exp(a(fj — fi») +10g ( Z bi,j exp(a(fj - ft») iJ=1 iJ=1 n —2 + 2: (am- + bi,j)exp(a(fj — fi» z',j=1 |/\ The second inequality follows the concaveness of the logarithm function, i.e., logx _<_ x — 1 for any a: > 0. C] A.11 Proof of Theorem 9, Chapter 4 Proof Using the upper bound expressed in Lemma 5, we have ~ L n mfg-+2 5 Z 7,,jexp) z',j=1 = ( Z 1i,j6(fj,1)6(f,-,O)) exp(a) + ( Z 7,,j6(fj,0)6(f,~,1)) exp(—a) z',j=1 2',j=1 Using the definition of a in (8), we have I: ’n. n logfg 5 -2+2 (2 7i,j5(fj,1)5(fi,0)) (Z 7i,j5(fja0)6(fia1)) z',j=1 i,j=1 = —2 + Law 131 {L- v.1 In the above, we use the definitions of p and V in Theorem 8 to simplify the expression. Since 2 = 22j=1 72.1.7. 2 p + V, we have ~ L log?)2 5 —2+2,/;u/ S —p—V+2,/;w= — (fi— WV P We thus have Li 2 log Lt—l S rt : " (W'- M) P Substituting the above expression for rt into (4.17), and further using the fact 11. L0 = Z T,,,-+W,-,,-, 231:1 we obtain the result in Theorem 9. E] A.12 Proof of Theorem 10, Chapter 4 Proof We rewrite the quantity 17 as follows: 72 n n 'n '7 = ZfiinWi|=Zfi Eng-71¢ = Z 7i,j(fi“fj)=”‘” i=1 i=1 '=1 gj=1 Since u—v=(¢fi—W)(¢fi+fi)2(¢fi—m2, we have n 2 (fl — W)? Substituting this result into the expression of Theorem 9, we have Theorem 10. C] 132 A.13 Proof of Proposition 5, Chapter 5 Proof yt yt pm 8 2- Etnatlfii = Et [ [ =12 1,. _1~_U_ 1_ [fit = fit] p371: = {3’1}: ytlEt [ 1A __ 1~ [yt = yt] + [gt 7’; yt ]Et yt yt ~ yt 2 3% #w _ A_ 7(K-1)/K _ [yt‘ydm [Milli—m”err/3} fir—7+[iit #yt]{1_7+% (1+ [g]s)2/s} |/\ A.14 Proof of Theorem 11, Chapter 5 We take the expectation of both sides of the equality in (5.8) with respect to fit, denoted by Et [-,] and have E, [04,...(U, Wt_1) — 0,1,... (U, Wt) + D¢*(Wt—1,Wt)] (Wt—1 — U, nxt'rtT ) We define Mt = [fit 76 yt]. Since 37, 79 yt implies V€t(Wt—1) = xtTtT, using the convexity of the loss function, we have (€t(Wt_1)—€t(U))Mt S (Wt—l—UaWt(Wt—1)) = (Wt—1 — U, xtrtT) 133 We thus have 1 _ _ 6E1; [D¢*(U,Wt 1)—D¢*(U,Wt)+Dq,*(Wt 1,Wt)] * is a strictly convex function with constant p with respect to H - ||p,3, according to Lemma 6, we have D (A B) < 1 A B 2 q) , _ 2p” II,,, where p"1 + q—1 = 1 and 5‘1 + t"1 = 1. Hence, E[Dq,(6t—1,0t)] l/\ 2 77 T 2 ZEllxt‘st lq,t] 2 2 77 2 2 77 2 EEH‘Stlslxtlp] S 2—pEH5tls] |/\ where the second inequality is due to Holder’s inequality. Using the result in Proposition 5 and the fact 27le €t(Wt-1)Mt 2 £3le Mt, and that E[M] g E[M] + 77‘ we have the result in the theorem. 134 A.15 Proof of Lemma 7, Chapter 5 (W — W’, V*(W) — V*(W')) || IM (5:1 [)2 Kd(w WW“; =22 ’ Maul The second equality is due to mean value theorem and uses the Taylor expansion of log function where W = AW + (1 — A)W’ with A 6 [0,1]. Since (2?: 1 Wak— 22?: __ r , 21“]ch Z M231: ‘Wz'Jc' and 2:121 W“: =1, we have K (W — W', V*(W)— VcI>*( W’)) >kz| |w,c — chfi Using the property of Bregman distance in Lemma 6 and the fact the dual norm of L1 is Loo, we have the result for (0). A.16 Proof of Theorem 14, Chapter 6 Proof Considering that Banditron uses a second-order potential function, we have the following bound when 37 is used as the predictor: 1 1 2 MW) — Z W) s min + §|Wt—1 — th%~ s Ivlfp + 5E [WW] |/\ |U|F+E{Z’Yt[yt =ytl+ Z —lyt #11111} t- l 135 where we used Theorem 11.1 of [83] and Lemma 6 in the first inequality, let ”2 5 1 in the second inequality, and Lemma 5 of [5] in the third inequality. Using E[M] S 2:le £t(W) concludes the theorem when we add E [2221 7,] to get the bound for 37 [5]. [:1 A.17 Proof of Proposition 6, Chapter 6 Proof. We only show the result for w(z) = (a+ z)>‘. A similar derivation can be applied to w(z) = ln(a+ 2:). We have L = max———)‘——= A—l Aa 220 (a + z)1’)‘ To derive p, we have (a + z + t)1_)‘ ( + )1_’\ S(1+t/a)1—)‘Set(1—’\)/a Hence p = e(l_’\)/a. E] A.18 Proof of Proposition 7, Chapter 6 Proof By defining At = 2:21 Ti, we have Kwi(At_1)(At — At..1) < 2K 2;"; w'1(At—1)(At - At—l) t—l — wé (22:1 1 + #2) wé (2:1 1 + M) T T 2 K 2 2 —Tt :- t=1 where the inequality is because w'2 is a non-increasing function. We also have: T T Z w'1(At-1)(At — At—l) S pl 2: c0'1(At)(2‘1t - At—l) t=1 t=1 T S p1 2601040 - w2(At—1) S p1W1(AT) (A7) t=1 where the first step is due to the definition of good support functions, the second step is due to the concavity of “’1 and the last step is due to the telescope property and the fact that w1(0) 2 O. Combining the above 136 results produces the first inequality in the proposition. The proof of the second inequality in the proposition is similar and follows. By defining Bt = 2::1 pi, we have T T I T I w (Bi—1 +t-1) w (Bt—l) 2:7tl‘t = E, 2 ___—2 (Bt-Bt—l) t=1 B—B_ S H 2w’1(At-1) (t ‘1) 22404-) 2&1 w§(Bt—1)(Bt — Bt—1)