... . 0.6. “111““. h; ..
: _ 0 . .... . . . .. .....2. 0.0.0... 00.. .0... .03. ‘5. +7... 014:...5-4fﬂ 1.05.9250 cﬁi 26.00%:
.. .1. ...- . 0 0 3.6.6.. .0 ... . .. - 8. .00... ... .. : .... . . .0. 0. . . .. 0 . ._ -. ... ... ..- .... .... .0 J. ... . .—...6.. ..0. .mef...i..Mu.'t.000m. can»... . 0
6 .0. 01. . 0‘0. 6.01:: a. a . 0 0 0 .0. ... Q 0. .0.-I: . .00 . 00. .. . H. 0 . ..- .00 .0000. '0 u. 0 .0 .. 0 ..0. ...00. .0. 0 )0 9.0 ...-0. . . ...: . 0. 211.1.- ..6 . ...? .... .. .0 0.. . 060 0.. . 1. 00 0 0 . -
... 00...... ... mung? 3.11.03”. M0 ..0. 1.304:- 00...... ... .0. £0... ”36......30. .. ...» ...." .06 ..0. .0....0. .u... ...-(M. hum-..- .. an." .0.............ur.s... ..0... ...-....6-L L... ..0 . o ... 0.2.3:...1 . J 2.2.18.2“... . ..0... . .. .0.-.1101"... . . ....... . .... . 0. 6 .- .... .....9 : ﬁt.‘:«.h..§.1..t...3 ......0.. 0......
0.; .. . .....1u...0 02% “1.0.0:... .0» 6.0651,. 3... ..1.6 306 ......0‘. ......6 13...... . .... -0... 0 0 . 0... .. 3.1 o. . ...: .... ...: .. ... 1...... .l...’ .... 1.0.2.696! ..0 . .....2 ... 0.. .. ... .. (2210”. 005.15.12.31: 0.9.. . ..0.- 0341..
. . $00020. .. . . u. 00...? 9.0.0.“? it” . 600 - . .J. .001...-0.-O0.0‘.0 . . ...: 0.0.. 0. 0... .903... ...-.20 ..9. .. 6 . 2 ...... ... . . .... . . .36». ... ..0 . .0... . 0.1!. ... .00 00 0- 00.)0000»l’-. 0-..0. 1....,..'.r0.00‘03.0
13.009??? 0.3 . I'm-.00 .50....“ w... N.0u7.l0u...l I. $.00 000003.016n9 .4 .33.???” 3.3.0.600. 00.0.0 ....0 . . .60 006 .. 60.05”»...060 . 0 :30... 0 00 00.0 . . .6. .- .. 0... 300.006.... ... 000 . .0 .06.. . ; . . .... ...... .P 0.0 60. . . 0. L 0.0.... 0-. . . 0. 0-. ...-00 ..0 0. 0 0. ...0‘360100 0‘0. 0
. . 1.1:... 3......222121 0030.09.930h...0|..0.10.6.:.0.600..:.$-.0 0 . p.10. 6! ... . . 00 . ...013 80.6.1... .. 01.03.01 . 6o ..0. . . . .6: ...: ...... s . .6: -6 .. 6 .. . .. . . c. 20... 1 00.2.2. . ... . .1.0..|..0- 01653th5...«”0\.§..00.
.n. 0 0ng0 _ 0a.. . .01 .l' 0.1.0. 9.0.0. O“! 0. . O0. 0 . 0‘. .0001. .0000 0.0060 ...-0 60... .000 .1... . .0... ..r0- 16-. ..0. .0 00.. Q 900 .00.. . . . 6 00 . . . .. . 1.000... .6603. . . .. 0. 6- .10 ... .0 0.. .30 s .0.-1.. 0 .0. ..0 . 00. 1. .‘0060 060. ..’.006 A .0)
P . 0... a. ... 0 .10 1.4.0.0001 ..0. 0.... J V (.0... - - . a .0.. ...".100...0... 01... . . L $0.11.... ..0 0. . 6:0 . .0.-.19. . . . 1344-10.. 1. ... .o.0.. . ...... I ...... 1... . 0.....0. 6. . . . ..0. . 0000.: _ .. . . ....u “0.8.0.3515. ...). nut“ .03..
. lOu‘JoS...n0Umk£ . 00 ..l 169 T300316. . . . ...3 .010: . . ,. ... .00 .26.“..3 ... 0. .. .».. . . ..0 130 000....0. 000.. ..0. ..0: ...: .... . .. .. ..0. .9 . 1...: 601. 0.00““ .0 .u-07 .1}
- 0. 9 0. 00 .6 00.. . 0 . . . . ... . ... 6 Q- . . . 9 100.0 n .a . . .00 .....0 . . -. . . .0 ‘2 . 0....
. prawn. L 90v“- -1010 a u .. n. ”“1003! }u“.h’v"'.’. 9'0 .0.-HO.CW06600“W0-t1.h.-3 0 < .. 5. Joann-l "Nnuho... 000100. u ..u0h...’ . . ..‘0 .n . 0.004 Q, 0 0.. .uuwﬂotu. - 6 0H . .nshv. ('9 0 n 0. . n » 0 .. v 0. Ldt0. .. 01.1...“ Q.» “mm: .0 .é—‘Uoﬁn .0“
. .9 ..0 . . . 0 '0: ..3» o 00 09“ 00 . .0 0-,...00-1. ‘0 00 0.0 . .0. ...00 ... V 3 0 . 00. .1. 0 0000 ..0 . . .. . .. .0 6' .. ﬁg; .. 9 l . .0 .00. .f. 0 .0 0 .00. ... . .. a 0 0. 00.0 \3. .. ’ 0. 0. .6 .30. 0 .0. 0
00. 0 ..0 00”“ 10%» Q. 0. . ? um... .h 0‘. 0...?(9. 9.01.9.0 {91.043". . ..I 60 0 0.3.650."- 000000” m.”- 0£éwth09h0006r060 6115013. 1.30. “000000.. .16 "HR-40.00... ..0. . . . 30.0.0 .m.....~ 0 :00. 1.01.A0:... 0, .3 0 .00.? .0 .. . ”.... ......1- . .00. .. 0 "01.. ..u ”.0. 000.810.00. . .1. O .00. 9.30.?
J 0..” ..0... .90 0 .. 0 0. 30 .0 .Q0c. . 0.00000 0060...... 00.000 000666... 0. 0 I . 6 . 7 0. - . . . 0. . .13.. .. .0 n - .0. 0.0. .. ...0.. -. . 1... .1 . t o .0.. ..0 06‘ L0. 0 . .
6.10%.” 0. a...” 6... .01.. . W100“. 6a.... 003.00... =1 0‘. «0000.. 0:21:63 .1..00| 0.010"- 0 0 $3904. "u. 009v.0.0..0wn.... . I. 6W00.0u. ”-000 ”6.0”... ..0-"00 ..0” ”.0.-00.0 r... :9. 0.40.00. «row-67.. .0350. 0 0 . ..0“ x .6 . . . r. . a . 1...”. ..0. 0-.60... . o 3..... 000v.1.50»u0.... 00”! 00.1.6.0 .6 ‘QEXQQS .
.vruuﬂr: '03. a. 0..3.0 an 9-0 {0,0159 .W’ 0- ...-00 0 ‘3 60600.1 . 00 0.00,.- .0... 600000.011}. ’3 .0.... 0. ..0 0 ...0. .. .... . . .0 -.- .0. o .. ... ...0 r ... . ..a n 00.0... 0... 0... 9:00-03 50-"...6| .00 0‘: 001.0. ...-0.0
.Q. .6: 0... Al|0Q60 0.0. .0‘..‘0 . . ... 91.0066 . ~00". n0. 6.6 30.9.0.4 | 0. . .0. . ..0 . ..0. .00 .u . . ... v .10 . 00...... .00.? ..0. .- . ...... 0 . 0 ..60 0“. 09 31.1.0.0..- 0130"”
ﬁJQ-i... 0..“ ....0000 ..u. 9.. 4.0. L 0‘”! , . 0. . 0. 30000000.- ...: .- 80.0 E a... . . I. ... . ...! ...L.........180-. .0 0. .. 0... .0. 56.000.03.001. ‘06 . . --
.JW...” 00 0.. .10.... .0. .0011. . 0..9... _. 6.. ..1. .1-1.0.0...|.00-...0.. . . . .. ...-6.3.1. .. . . 60 19.619.30.00 69000- .10- I
. 40 N0 600. .01100 9'..Q.,0.0..0.0 ...0 . . . 1-0.. .. ...-0.00.600 0 0.. ..061001. 0. . 00... .l 0608.. . 3.“.JIQI .0...j-Q
_ “I. .1. ..0. 0 ..0... :0; 300.0 _ 0:: 0:100”; $0. 0-. 0 0 90.9.5." ..0 .011... 00. 00 ..0» .6. .0 .0 ..0} .000“! 9.. 0*“
. .100 16’ 6. l . n . . I o . .
. 0 3.0.1.0.... 4 ...oI-nulﬂuuqut... ...-3003090.! 0 .060un..ﬂ .. .... . . . Lamontnaounu.“ gran“... 0510.313“ .
. r. “...—)0}! . 0V.0|0.004...6.0. 91:20» .-...t...... v .. 6 - o 09: .. L 31:91 0.... ... . 00.00 000.510 l.“1u00.n01.o...09”n4~.nt. .0
0- I. 00‘ . 6.0-0 .0 .. . . .06 I. 0 000 , ... . .0 0 .0.
0. 5.0 600.01.01.09 0.0.01. ..6. .\0 .11... ...?!“H...00..00..0 .0. .116331000 .0 .0.‘
... 0.. 0.6.6 .1. . . a. 0.. .0.; 006.0 .. .0.. ..0. ..- 9.3.1. 6 ~1.......»Go .h....6
. 9.1.00. 1.! .0 00.. .0 Q o .0. . ..0. - .0....01 .. ... .0 ..h’ .0116“ .0....
_. ...6 «p.14 .u .aur..rr?vﬁ 0.25.10.10.31”... 04".". .0. 7.0.0. .. 1...! 3.10.0155. . J- 01. “MN-0 .41 ...“.1
. 1.9.0. 00 6.. r0000 . . . 00. . 0600 . 201‘ 0|. ..0 0. 10.0.0; 0 0. 0
.0 . '.“H’ I? '6‘!’.\ angst. 0.!”0 91.9-0HLU : 0"...O0QH 0‘0 ...”.60'0. 050%.... ...“...‘0030.’ 0 00 ..M.‘060 I-looay’nlf 6. I JWQNTQ
..0 0 .. .- .6 . .001 - . - 00000.. 0-0 . I ...0
.4 ....NJ. ”Viv!“ - 31. ... ..0-00 W309»... $0$10600PJ 0.0.1 2 . -. ... ...... .1 1.6-.an ..0u6.u . . . . ”0.0 . . ..ql .u ... “Burl... 1.5 Nﬁ 3%. . 68d... (“0”..an
. , ..1 00 . I ..3 6.. u. :30...) ...... 1.0.. F1 .10...- .... .901... ...00 106.02.. -3005. . . . .. . . . . . . ..... . 0.6.... . .. :25 .0 ...“. ..6060 0.00
. J .36....r32011. 1.0- .411': 01037.. 9.0.1.016 . 8. 00.. 1 ...6. 0...}...— -. :0.- 0 0 .0009. . . . . . . : . , . . . . 5 . 16...... ......616 .. ..0. .0. 1.3.... 0:00.06— 0... .... "Rh; 3.093300%
. ......” 506:...21PWf-‘l £051.06. -. . 11.0w! 0 60... .... 3.330.300.3033. . .01....3. . 4.. . ... .6... 3.33.0 .18“. 16...: . . . ... ...: .. I. . ...0 0.00..1. 1%....06 0.... ..gvcﬂmmh‘. #340.
.- 000.10?- ....JJ 14- 0.0.0640... .Iq. 0m 0.6.6... "6.. 0. :09 ”6. A700... ..0. .3970. . .3..- .l . ... I. ... . 0 . ..0 ... ..I. “...! J . .- 1r... . 00:.“ .0.-...u0 0. i .....0. .34
. .0. 0 0. .0 a is. . . ...,ri Q V000319 .007 0. ‘00. o . 0 . .0000 .0) . . 00. 0 . . :6 00.090 .0 .0 .0: 00.0010. “all 0. :1 Li
... 0 .took’0 .21.. .0.-... ..0! b-TJJS. ... .n. 000.0900. .0 0. .-6.... 00,00 ... .06 0.0. .00 . . . . 1.8.0.0-...26.0£Hvbiworuto600101vu~9 . ..hvo. 010.. . h. 0 0.0.". .....u 060. . 0 0 a. 0.0.3 ... 0. 0.! . .610 .6 6.00. 3. ..0 ..0-... '0... 6.1..
. . 30 .- . ..1. . ..0. .00... ...0 00190900360. 0313.31.65 .l.. 0000.09.60.00.“ .....l. 0- 0. 603.0 ...... .- 6-000 .0 0 ..0-.109 06.. . . .. . . . .. . . . . . . .. . . ..0.0... 0 ...-.0 6:00.702. .1. .0. ..0 0 8.0.10.0)
. 0.0.0.3..000. .0. a. 0669.00.06.00 0 0“: 30.3.0-3“ 0...}; 0 06.300.02.00. 300000.006 9:0 . .. . . . . . . . . . . . .. .. .0 . .6 .0117. 0.!2 . .0.. . . .0 .0.. 6. 6- .. 0.20. 16.0.
0'. 0.00.1813 30.01-01.11 0... 0.0. 109.14. 1......36 6300.31.31. .0.-.60....1... -.0 . . . _ . .0. . 01169.... ...... 06)}?! ...... . .10 .5006.
.. n .00.)...0..0.... 0%...21000L2 0.- 01 0.. . .6. .080 .. 07.00.. . 91351.109‘.. 0- 7“ ... . 06.: {.0.-.1321... .....z... . .....- -.. .. 30.. . . 0 0.. 1-. 0..
an... 009$. .... 1.00.00.63.00 0- 0. “-0.3; 360.014.0003"... .10. 01.01.001.009... v: ...006.|...I1000... ...-..0: 30.0.0... ..3-...; . .-|. . ..6..|........ ...3...) .. 0 .1 .
..0. '0 .0 0.00.03. P0 (0060“ 0 90.00 . 1 do .000, 000 3.0.. 0.0; - P6090 ..0» 0". 0000 .0, .0103: . . . . . . .0’ ... .60“
I 8.0. 0.... ...0-r'l10 30.333 ..0-.00. . .616”... .... .1000300: .23.. ..000 «0. .31 0|.92.....0.0121..6)a’0| ... .. or 2.1... o. .6 .32: .u ...r’..?. . 9.26....) .6: . .8... . . . . , _ .0.... . . 40..
f .. 0.00.1000. 0.0.... '66.” 060 5.0.0.0.... . .....0. .- 0”00 .l 1.0 «00 ...-00.0. 0 06-30-10. 01... 081.001 ’19. 6 0. 0- 06 0 . ..0. .. . ya. .0000 6 56104.0... .. . 0 o . .0. .. ..00 . 00. . . . . .. . .. .. ... c . 60.0... f
.0.-..0 .0.- . .413“). ...-6.81.... $05.6.”- id"! 00.0...06... 0. 09.5.... _ .... . 1090.15.01.91. -0.. ....E- .1. .... 0...... .... .. 2...... 9:92.33: 0...)... z . . . . . 2.6.... :5. .
I 1.0. “000. 1.0 0910.0. .09” 0 $00. MVP”. 0. .0.. ..00 .. .-.. . ..3- 0. 066 6.. 0.0 0 0.- . .200. 0603.60.60.35», ‘. 0. . 0.6 . . . ~30» - . .... . . . .90... . . . . . . . . .. . ..
...-.0. .. ..A . o 1 ..0. . 0. 6 0 .9- ... 0- . , . 00.. - J... I . .. .. . 0- 3.. . ..
g 10 00“ gr. 0 03.num 3..-0%“. 9‘ ”“00 00? 1 ‘0“‘050 COLXNW‘. .. .99900d90‘nh00‘03‘90u 0.00””. «run-“0.00116iﬂorl . 0 0.000.- 66.’.. n ...}...H‘u-r“. .0'6QW“ .00.“- “0000 .0 1&2“. .6006 110.50.10.00- Q...‘ .29“. Q. O.H..-...wnu. . 0 60.600
.0 . .0. . 9.00"“ 0. . 9 ...-000. 6’ .89 6 0. 0.." .0060. ..0-00 . . . . 1 .6. 0 0. :- v 0- . is 0000., .. .000 -. 0. .. . 0 . ..0. . .l . 0. A . ....o. - 19110.00}.
4001.. 0 ..0 ”009.60 90.0“" - ~010|0000000000090uﬂ C. 0 ‘00... 0.1 0 90 01 ...‘lu-u ”00.10“". Vanni”. .0000..0.0.¢ku0 .00“ ....»09 o «.0200 0' .0. 0-“ 0000. ..30...—n5.0.6 ...0 . .. ..0-1...... .0 n? 60.0.. .3. .00 n . .. ..0... .0 o. ,- a. .3000.
3.5.06.3»... 1.891100». . 3.3.10.0 .0.... .... 91.0.0. .6 :6... 6 66 .... 0.6 . .. .0.-0.. . ...: .0 31.13}... . . . . .... 0... .20.? ..0 . .3 Z .0810... 460..9"1”1’10k6{w.1100,-0‘0.
...-1 50,300.. ...|...000 ...0 10.. 3.. 001.0 .0 b.3026...- 30 . ..0- .10....03 6 .0 .0. o 0 0 0‘6 130.... . .0. .0 . 0. . 6 3A.. .0.. ...
0. I00 .3006. 01» .1 ‘0. .09 00.00. 00.1.... 66. .0..6 0.0001... ...-0.. ..l..:.- ...-.0060. 6.. ...0. 00. .... ...00I0. . 6000...»? 0... . r86..100..:..0..l 0 ... . , . . .
1000060003.. «0100» 6. 0.0.... .6. «.000 00. .. .6. 03.0 ...... 0.... )0 00 916.00.. . - 00’ . 0.00 0.2.02 20......0. ..00 .0600- ...-00.. . . . . . 6.5- .0 :9). - 0' .
.0 £00.00" .0 .30.! 06001... 100000 016000.00 30.6. 066 '00- £7. . b . 0-1 [Qt-00.0. 0.- . . . . . .
0 -

6.01.00 .06 0.1.. 1.“ 000- 9 0000.000.’ 0‘0“. .0 9,063.. 0. 00. ..
...3-06.600 ...-0.1.1. .0.... 06.0-
.‘v .230 .- 90,1 0 0.. ‘09. o. 00 .00 .0..

 

010000.Q0T-. ... 08.’00~. .10.. ...-“’0
o .966..-.0.| .00 .90 6 p . 0. .0. 9

..0..- ..6... 00.1.. .. .9105. 0.. ...l0.&...la«.|~

 

       

     

 

                

  

   

  

 
      

          

 
    
 

 

 

     

 

        
          
        

   

     

   

   

 
         

                                      

 

     

             
    

  

 
       
   

      
   
 

      
    

       

 

   

 

        

 

       

. . .0030. .0 .0 1. :0,- ’0’.
o - . . . ..013601. 0.0-00. - ‘10. p. 6.076. 0 .9. 09.56.... 102.0321“
. . $3001.10. .0.. 00-8. . . . . . - .. ... ..n . - ... 00.112.131.13 1...!- ......... 0....
12.63-63.00! 00060 ..3-000009100. .. 6.)...‘0500 0 . . . . .. 0.0.16... ..0.-. . .0160}. ..0! I .... :- 3.1.2219 01.6.0010. .00 900.06....11-90 00 ... l0
. u. 1.06010060020: .3. .1400. 0.0.. 0.00.30.19.01... ... ...-0 00 00!»??- . . . . . . . . ...6 l. . .0 ...3 00.1.06. .0. 0....0... ..0-0.. ...-$041.00. 0 . .0.-06‘... .
...... .0.... 00.0 600.0 0 66.70. 0000.0. .000 .L 16.0.... {30000.\~.0I.6.000.. ’0... 0 . . .0. (...-..0. 6) ..0-.0. 0 6:...00-6.’ 0... 0 0 ... . ..0. . 7 1.0.1.0....
. . .. 6....‘60. 0 6..) I ..0 .41.: 0.6 . . .0. 1.9.0. 3.0. Z 09.6.0.0."n «......6-.006 .30.. . . . .16., 00.. 06 .0.. .0603... 0-1.. 003910... .01 9.0600 030' ...] 03.7.0.0 01.9363 n u .
w . ...“ M50698”. .100! m0...a.».61.01..6.§0.. .. 0. 9100.100“? . 0.90.... ,0 .0. ._ . .. . . . .. .. 0. 0.....00. . 60.0. ..0. 0. 1.0.0... ..00..... 00.1000.- 0...‘ ...1. 09.0.5 M0..‘0uvlnm3.t.f0~ 03$ 0. .0
{..0 a 00.... .2006. ..n.“ . ..1... 610‘: 9.1.0.216. 09.1009. . 10.90.. - ..3-.... 0. .0. . ...-.6 . . . . . . . . . .. . . (..1...06.000.. ... .20.... 2...!1‘f. 6- . 0011.03.05 ...0 03.0 9.151.063.1061: I: K
v 1. r0: -9.). 000.... .5} 0.00.0 ..0 .110... .... ...3. .30.?!0500053. . .0....30 ... .. . t , . . . .. . . . . . . . . .000 . a6. . .1000]... 0.. . .0 . 0.9.8.5160. 613000.. 5?. C .033 '0
_,- .6: .0.“..46006100-0. .0....0....J 0 6v... ..1-u0.-.~..000..... 0.0.3..“ .l..0.0.00. .000. 30100.0...)17199 . . . ..0-3... .1. . . : . . .. ..0. .0.. .00... 06...? . 0:00.19 v...-00.-....;|| y. . a; O.
1190...... o. 0 300.10.;0u 06 v .. . 0 1%‘96 0.. 1000.00.00 01 0.0.0 1-00 00000... . .190...- . 6. ...-.... {.00. 0|. .0 0|. . 0.. .60.. 0 0- 00 0 0.1.6.060. . . . - . .. . . .0 . .3 0.00 00 ..0 00- 0-51 6 0.0 0.000 0.00.. 010.030.31.0013 9'
. 16......- $3.71... .......ﬁ.110. .00. Wm.”6060..0L .101239,-u00|0.......0_0.1..:. 360.0... 0. . ...... 6 600.103... 302.010.220. . .396 . 9... . .. . .. 2.0.. . .0. ..0-21......- . . .. .0...- .u! 0 0.. 1. . ......1...’ .... ... ..0... .0. . . ..0: . 0. .0. 1... ... ..b. .00.. .1061. 00.. 00.01.. ...: I 123': 06.0.0
_ . .0 6.0.0.; .0 .80.. 00......0ﬁ .. .71 .0.. .010. ..- .:L. .1 30:010.) .06 . ... ...... 0. .0... .... 2.1.3.7322... .. .-.0 .. o . 6.. .. . ....090031005 0 0.0- 011... . ...1 o . . .3: . .0- . ...3... . . _ .. . ...... .- .. . .- . .... .0.-6...! ...... .6 .00. 3.05 ... 009..
. . 0. 003.007.030.1303rqé .l. 00:020 0&0 “90.1.70300’0310000‘r'. .30....00003039110100 . 06... 6’..60.y000.- 00169-000000. .21.}; 0.0.1 . 0. .... . ...\.-.000. 0 3000. r: ..006 .46000000‘... . . .. .. ... .0-0. .0.. u . .00. ..0. 01600.... 0110.10... D 0.1.0 00- 0. 10.00, ”nu-010.00,: 0.0
. 00”. 00 .6010.o1..0 0 00041. 0000.06.00 .. .0 I. 0. 60009000 - |00 0 00.00..- 0 . 050-6. 0. 30.0.0 ..0' - .I. 60. 06 . .0? 00‘ 0.9. . .00. 0. 10-00-0091 00. 6|. 0.. 0.. 0.. . 0 :01 66. .... - ..0 00 .00 6. o. 0 . u‘ 01 . 300.... t 0. I. .00 00.0. ..I- . ...).10. 1,100.305’ .00 0 0
“N. 61.01.31.110... 11. .-.0.. {M030 1... 3.1.3.3....” .1... 0.1.0.0- ...031} . 13 ..0. 02:01.0. 0.0.0.... ......0... 2... ..0-1.1.9.... 0.20 1.35.8.1 ... ...3. 0.. 31.9.1130... 6...: 3.... . ..0: .00....» ..0- - 0303.0 . . . . .. . . . .0. 52.131.00.2100.‘ 06. 610.0...1660 ..0. «0%
. L0. .036.- .09 . 2... 0,. . ... 0.. - ... 0. {30.0. 1.60.. .09. 1. 9.0. 0". .101 .0. ...1-0...0..:0¢ .0-... 30.0.3 ...-.0... 0 300.10.. 0... ...3 .. 0610.626 .0.. 60‘ .3611 . 3.6.. .00. ... . .. .... P . . . . . . . . . .0. . . 00.3.0110.” Jug-«[460... .11 .... .- 0.0.3.0310
.0. ... ...:H‘ 0 0.0.. ...!- - .5136..06400 000.... 0 05 o .0192... 00’ 0 .....0.....0 . .. “V ...-1.1.6.6000 ....L 0.0....» 6.. .0 0.. 60.00. 0.. .. . 90......0. ......1. 1. . .10 0.01. 0 6......1-0 -0.0..0 . ..3. .... 0.0.30 0 ..0 ..0.» c 10.00. 0.0". 0 ..‘o. .16.!‘310 90.000.00.100.
I19... 0. ..0 .60 0.. . .0. .35.. 10000.6. . "1.... 0...} .000 ..0-0 .16." 0 ..0-0002.330 £060 .61. .. 021.3350, 0 0600 ..0. .... I. . !30.|-.»I.0| 0. . ... 66.10.400.100... ...-60.0 0.6 :11...- - .. . .. 6.. . 0.00. 0000 .. 0 09.. ‘00-: 600 90!.0: ..0-«000000.. .00 0 190.010“
.0. 9.....000. - s 0‘ 000 90.3.5309. .. ... . v k. . 061.65.": .. 0 01.00.00.109 .009 0 ..0. . ...-..0 .. ..0 . 0.... .. 6. 0. : 0. . -. 6-. 0. .. 0 ..0... . .0.- ...... ..0 0.. . . ..0. 09...... ... .. ..0 . 1169.....11850. 00'... -.0.
0 .9 00. 000. 00.‘00 ... Hf“ 036‘.“ 0040000- |0 0'. #0.; ”#106 “.6... .0.”- .."JI 0.00 1-00. 0.0-“. '0 0.0001 0.. 0600".- 0§:00 L. 0.1,... .0 (00 N0 ..0”. Q0“. ”6-4., . n .. . 0-9Qm0l0 0 0 .000.‘-.or.. .. .0 . .9090”: .... . . . 0 ...00900.10 ..‘0 0 ....0 .. \- .0KLQ 0 ~06 ..9”...Q0..Q.. 000.60 9.0..
..0. 0? 010,140,010... 000.«...0...-000..01=.’..01 .050... ... .0 0.3.... H 00 ..00...‘.000.o.0. 00000.10... ...-:7 .Q6 .‘h-on'...’ .. 6(1. 3.0.0010 0-00 0. 1.0.0.- .0... 000. 0.9 .. .0 3:10.. .00 .. ..0: 0. 0.311.200 ...-...). 9.31.20 .00..» . . ..0 00.00....-.0...0.... .0. .00.? .I... . 0. .20.. ....I..~..0.
. .00. . 0. .0.: -001 .00 .00 0...” 00000. .000. ..u. 0.00 .....0“. . .1. .56. .600. .0. 0 .0 . - ., 0.0 -. 10600: 6.. ..601.r 66.000... . ....0. 3 .10- . .1 0.000. ..0-1.. .3. 60 . .00 ...).u’o .... 2. h 6.. . .
I. .60“...0£. I 0 0.0.0 0‘.’-|-.1.0.0 w 006 7 , 1-00.. 00 0 .0 0 .00.. . .1 .00.. ..0-0. .10.... 6.. 0 10‘ 0... . o. . I. .60 0000 0 0.00.2000 0.6.. 9.000, 00. . 0 . 9. ..6 . 0. .6 9.000.. .1700 0.000 .. 00000. . . .0 . 0. ..0. 0' ._ 006 ... 0.0.6 ..7 l .
. i... .00 000 .- .NI. ..0 .05 ..0. ....0. . any-0:. . .00..r0..|6.......0luu0.0.13.09 .000 L36 ... «0. . . . .0- . 00.007...“ .300. 60“.. .91....0 ....n 0.. $.60. . on 6.0.0.??? 0 . .0 u .. .... 0 .00 0 .0 0.0.1.... . 0.13.50... ... .- :30.-
210080. .0.-060. n. 00:“..0 ... .... 0. . . ...: .. .wuuv... .. 0.]: 200.0. .0 ”0-4” 0.60.13.60.1-9 I... . . .. ...! .0 1:3! .. . ... ...6 n ..0 0.0.1.0..0...-..0..0-.. . . .... 0 ...-I. .. 35600.13... .6-1........\- 0.
00000 0.0““: ... . 0.6. 30 . - .. .0.: 3..... ...-.0360. . .000 .2‘ .9 .. .0 03.. ... . 0 0- 000001. 0 ...00. 0. . . 3‘70 6001.000 0.... ..0-0.. ..0 .00 00 500.30.. 0-. ..3. n . o ...0 . .631... . 00. .. :40 .010 0 001 0.
16.1.76 or ...0 .0 ..0 6.... 70‘. ... .....0... 1.06 5| . «daft-0.60... .t. 000060........J... 60.000. . 0 .. ..0-0.6.5.1300on ... .0. .0... 3...- .... 6.0.1.90 . 30.9.5.2...20. 0.. 0 0.0.0.6. 3.020. ...3, 3.9.0.2192. ..
- ....0 ... 6.... . 1.1. 0 .03“ l: 2 9.0.6.0. . . 0636;196:010. .0”. 3..-...: . L. 1 6.. .1... ....S. 0...! 0 . .... 0.9.23.1. . _ .0... ... ... 06.0.0.1! . .. .- . ...
5.0.5 .. 3631... ...-0.0.0:»... 1.0 .. - ..0..I.-..0 .81....£.-.. . 0.. 0. . 3... xs......7...l¢... ..0- ....-. .00 ... ... 2...... .. 3.... .. ...v......0.0...6. 0. 0.0 .2211. . Ghana;
. ..0 ..I .0... L ..0 5-0.. 0... .60. 0. 1. 0.203.... .....6. . ..3-0 ... 9.2 “g. .0. ...... 3.3.01.3! 0|. .. .. .. 30.2.1... 7.66.0.1 . . . ... .. ..6. .6.- .. 6 . . 6.20 . .6- .
. 0H 0 .-:0.rl 4000- 40. 00.. 0.0. :0 .0 060.33.. .Q. ."9 ..0 .01.... .0..“90000. .”0.0.‘.0. $00001... 000‘. ..na 6... . . 20.6 00.0. ....V 6.100002... ..0.0.. .... .0 ... ...01‘0301’09 ”000:.060000... .ﬁriv- 010i};
0..- .600. .3 .... 60.1.6.6. .. 0- ..0 . I03...6.0 0. 63. 0 > .... 12005:... .. 3.....0 0 . - 0 .0. . . ... . 2 .-.. .0....0..- ... .0. 0 00 . ......1. . ..0? . L. 6 .0 70.000.300.630 03.0.01... . ..0 .11 10.3.1900 .0015
. .06... 0J0...0.¢0. 2090.010... 0.. 1....»‘00. 0o «000......(000. .0000... .0 0...: 00.. .. 6.0.0.3639. 6. .....30r’...06.003..0.. "000. ..0. 0.001.000. .10.“ 1.0.6...) ..3...0...000..0-.90.-.1.11< 0. .000- . . . ...}. ... 0-9"... .. . .110... .... .0- 1.01....5 I: 0061.00.00.360ﬂha 0.. .113”, 0.003.
L030) 120“...- .. 1.0.0.. 0M3... 6 . 0. O. 6.060.135. 4.0. 7.00. 3.00000.- . . .... . 0.1.7.0 0.1.00.- .0 . . 0 0-6 . .0. 0.00. 9. . 0.60.0.2: . 46. .. ... 3... 3.1.0.. 60. . 0... ... 1|.»0000na000..0| . .0": ““0350oudtowlga.’ 0....I‘nga
- . .0 J. 3000. . 6 O3, .30.! 0.6.. .0911... 900000. 0. V “00 .0. 0.0. 0 0‘0. 50.0 ...0 0 0103.3 0 . 0. .0.-00.: . .... 0 0. 60.0061... 0-0 . 0 00 0 0060 ..0. Q ..0 «1 ‘0 .00 .
. . - hr .1136...” ... 6.. .- ““30, 06.0”.“1.w0..’l60- I . .. .1 0 .111”. ..u. Q... . 01...“; .91. 10‘ L06.- 00.. ..0... .00.»!06». an!” .0. In: «0.0 - 00410. . ..001000000. .1. .0 .306 . .901....._... 00 0....”0750501096-000033 g? 10010110. “-0210.
. ..‘30..0 7000.21. .. ... .5. 0- 10:09.10. .... 06.0002L 9 v. 0024-, 01.0..1..-...06-. .350. .0 ..0... ..0001: 0. .. ... ....- .. .10.» | 0000-00-31.... 00:0 .. . . . . -0....906 ...3: 0 00 . .0.. 17.. 00.06. ... 6.1.6.90 . “0.00009 ...-00. s... {5.0.9.00 ha0‘0.r.1011§g
06.00. .1.- .0- .0- 4. -. "00000 1. .0... 13.00.260.01. If: .. l- . .000. l. -.. 6.1000. 5000.... 69-630 .0 . .0 . ...-000 .00 ...-.100. ....0. 001100. 0. 2.01031 106000.010... . . 0. . .0. 0.. 0. .0...0. . q 2...... . 0.0.00.0..00 .0. l . . ,. 000.200.00.02 0.40 l. 0.00
9. . 6 0.9.10. 2.3.0........0..... ..) 0.; . 2 .00.... .... .0.... . . 206.0... . 4.. 3.....09..V|-1|.1 ....0.60-. .... .0. ... ...... . ..0 12.6.68 3.0. 0 . $0.. ...0 ... 0.... 00.- .. ...! 10.1.0... 0. .
. .0.. Q ...!016JQ1I 0:0- ...\ .0. .. 00. 00.00.. 30.6.00 1 - 9.00.66. v.00; 00300902001600.1300 .6 .100. .. . . , . . ... . . I 0 0 - 0.0-0.0 . .
... . ..0 6 L .00 ‘0. 50. -. . 0 16.0.0 .000- 60.. o ..0-0. 0 ...00 .. . .0. -.Q0 ..0. .0....) 0 001‘ 0.00 0 .‘6- 0060.... . . . . . . . . 100... .0-00. ...-0 .01.»... . - ...0 0 . .
r... .. . ...-t... .0..r.r-..l 9.50 ..... ..0 0. ......I... . . .032... 03. . .. . . . .. . ... . ......0 6.1.0.5.... . Vii-1.3.0.1. 0.. .
. .0 v 025. ..3... .|.U:.t.0.0...0‘10500........0. .000. .00. 1).. 0.01.0600.‘ .0....1... .00.]..0u0f . . . . . . .00.. . 1 o .0 26. .00 9. . . . . 1... . ..0-1.
.- ... . ... 0L0 . ... .0.-.009. ..F.$0 0.36... . . .10.... ... 6.. .0 . . 0.6. .. - ...6 ...-0C4... 00.\0. .
0 20.0.- 0 Jon-.0 ...-..0.— 1.1.3.0..0.00).,6011’0‘.6...0£ 0.00.0

 

H3 .. .00 Q... inn“... ...3-
..I. . ..u - . 0 . .
.wn‘ . 1L... I. 30“.: . ...-n0.

 

. .00 0.0. 3.10 .60.. 0‘ “10.16.60...

     

 

   

             

           

       

60 0.0 .000». 0.30. 0 2.0.1.1....

0. 0 00. 0 WI].0.0003..D0| .....0.0c..v.| ’00...
.0 .0- . .

. ...!0 .0 .

. 0. ...-.1. .60 .000 0D .0000. 0.70.0000 .1.0 .0600. 0.00 v .Q. .0 - 0
. .06.... .00. 6.6.0.: 0 .0” .0... 1 t .. . I. 6.. .. . ... .. . ...... 6 0... 9 ...
026?... 0.0.» ..0unh.-..1002?1. 61.5.2.0; .JO) 003... . ..| . 1 I ..- .... . .20 .. ..6- .30
v . 0-... 1L 00.0.0...6 0. .d .J'. . . 60000 .36 0.: . 06.3.3235- ...0 6.0.0.0..‘030. .
0. \l. 0 .60. 0‘ 08.7.3... 0.1 0 10. 6.... .0. 3.100013%. . . ... 0.....60.60..00K.6 . .022... . .1236- 0.. 06 . ..0600. 10.0.6. ...
1.....0. 36.00.10.9051 6.-01.o?.60......11.-...!. . .. ...... :0 .1.... 2.0.. 30.6.0330.- .. ...6 00. -.. ...0-
0 0 ...ﬁ- .00.... 100. ...3}. .0. 0.. 1.0. ..0 0 :0 .. 3.1.. ....04..k66.-..-... 1.3:... . . .61....- . . 6- .. 60 91.3336. 0.7060 .....- 0020-...
.. ... .0... .. ,- -:- 1.00.3.0. 0 .2 00.19 .... ... ...... .0130? .10? ..3 $0.. 0.006...- .000-7. . 9Q . . .0r 3|... 0.. 06.060. .0 46000 . . .0. .
.0 0.100.000. .hn.’ ...6: . ... .0. 1.00.311. 21.60.... . 300.. .00! 2.10.0- ..6.6(.0|O.... 0- ... 1 . 1.0. ..0... ‘0I .
(0 .-0\ 0611000 .00.: 0.0 0 . 0- 00.01 .. 60 0.00 0; ..0. . ..0 .06... .0 .0 6 u I 66.100.107.019. I. ..0. ...‘0. ’0-0
...3 0 0....Jo....o-.\60. 10 64 .10... .0’ o 00.6. 31.660 06. .64.. 0. .1 0. .0. - .
, ... .. . . H . 20.0.52... ... .61 ... 0- .461... . a
. «20.0.0.- 0. . ... 0~0-0... .6. .799... 00.0...0 . 4 .. .0060. 4.. 0. v .. ...1 .01.! ..0. ..0. o. -..3...- 0...... . .
....0. 0010.00 ..00n0; 60.1.0 3. ..0. 2|. ...L.. ..00. .0. Tu.

..0. 3..-.01.... .. t0. . 0. 00.0.00
00..- 0' 0.... - ... 0. 9.10... .036 0.....- ... .. 9.0-... ...-.... o. .. 6...
6'0... 0 .. .. M0.62.0.6... .00. 0. 6.. 0....0- 0.1.09.

3.1100....m.01002.......0-60..0..6 ....

   

     

 

  

0.001... .‘.l.0 Q-..' .0-

       

0 9.600066 . .0 .... ..0 ..

     

         

  
 

 
 
   

 
 

 

    
 

 

    

 

 

       

 

       
 

 

    

 

          
      

 
          

     
       

 

     

 

                       

  

... .. 0 0.02.... - 1.60.0 . f . .6 .60. .... . 0. .0.-.1096 . . . 20.3.0001. :00...1..10...0-..60000 0 ..0000-0. 1‘. 0 (F6.
0 .3v.......v.1. .125... .. 2... ........|.-. .2 0 6.. :- 0 .. 0.0.!- 329-05.... . . . . .. .0.-9... 110.... 1.0.6.... 0.06: .. f- . 091.200.130.- .90.
. ~. 0.. 000.. I0 . 90 609.0000 ‘0 0" . 01 0 0. ...0. - . . 0000000 .Q ..Q 0 .. . :00. . . . .... .9. ..IQ6000. I ‘0... 1
. 0- .0000 . ...0- 0.6.0.231... o. n .....0l0 .0.-.1...!0 . . ....v... . 6. 0.....- .03... .l. . . . .....1. “.0 0 ”110.302000...0 .0006}. ’1... .
. -. . 00.. - 04 a .09.. 0 . 0.060} . 0-00. 0.0.... ... lv‘llduﬁuso I: ...Q .10 6. 1000-0. .'?06 —.0 000‘ 0;...
. . .66 0 0. .0 ...0'0 . l 11 1.01660 . . . . . ._ .. . . . . , 00.6. ..N .. .0.| . 00. 00 0-3.»..6009. .3100319 00.! 8.0.9 .00 ..3 5:60.139“? ‘8
...... 301.000 ..0 .06.-.. . . . . . . . . . . . . . _ . . . . . . 6.. .. ..0. 0. 0 .v ...: . .. 00. 0 1. . ...3-0.9 .0.-.0... . .0.. ..0-6.0.0... 60-00.:‘000-1ioﬁnuu 619000-1113. £91
. .2...0. ...-10.0 .60... .. . . . . . . .. .. . . . . . . . . . . . .0....8. .. ...0. 69“.}; . .03.... ...: . .10....00... ... ... ... 0. ...‘PW-A..600.:.7 VI ..1 6 6 n-r .
. - .. 01. 1 19-. . . . .00 ... . . -. . . . . .. . . . . . . . . . . .. . . 01 . .00 006’ 1.00.0.9.- .. 00 0‘ .0010... .0 . 0 1.00.! ‘0...- 0 .0 .10-01. 6.0.."r II. .000! 1000...! ..3-‘0‘ ‘
. 0-. 0 .. ..0- v. 6.3.00. 51-»...- ...-0...).-. ..0 . 0! 0.. 00.6. .60. . 2.! ...-0 .013 .. .12.... 3.2.20.6... .. ‘61; 0.53. 1.133.306. ...000 .. . _ .1 .|0.0.| ...600 0. . .0. ...1. ...0.'....0 . 6.15 .6. .2063... \9001003510.00%\|2.’190 thv ‘0‘... Q03
.0 3.0.0. . .1. 1-... 0.. .304 . .... - .91. 6 ..0. 066. . .... .0.. .0 0-00 . 0. 600.3 .060 1. . .. . 10.0.. ..0 .0 .... .... 0.. . 21-3-17 3... .0. . .0 .0 .0 3.0.“. 3.10. .0.. 0:. 5:0 . ’6 1. 0. L 0.1.. v ... 1.. . .00: 60130310300000". 90.100.301.00. ,.I
32.2.1906! .. . ... ....J 01......&.-0o:.i.0.|c ...-...... .00. ...! .90.. ..91335 .. ...-..0 ..0. .0 020-? .1000. . ... .. .64.. ...-...}...60 .. ..C .....1..0...0..l... .... . .00 -. . 03.6....119760... ...... 10.0
. ..0-0 00.6. .0 :00. 6.6 30......9000.0 . n 00 .: 0 0. 1.13.0- -...-.- - ..00- 0 66.. n . Q ..0; .. 1.6.1.0.... 0.1010. 0 . 0 0... .0.. .00... .0.
.0.-0L: 0. . 63.6! 0.00- 200. . .0. 9.... 0.. . ......9 000.00. .00.. .0 02...: .. .0. .0.. 0.01M.... 0. 400.. .. . ...... 00. K... .... ...0. - 00.1.6.6. 0.7 .. 01 ..
A 1 .0 I .0 0.. 0. .0. 0 0 26.0 19’ ... ..0- ..00. ... .6 . a. 060 0c. 19......0 . 1... . . . .I.‘ 9.6.20.1! v. 60.1.. ..0
. . .1263 ...-96061.11... .... .... .. ... v.- ....b1.06.!0..0..|- 006. . .. . . ... 0: .1D6.. . 0.3.0.00217:
- ..3. 0 ‘60... 000 0... I. ..6 0 .....6 0Q .0. 06 50.1. . . . . . . . . . . .. .. . 0 6000 ..0.- .. . . 6 60 .6... .... 6 .0. .0 .
. ..0-00190000....t0 301.001.000.000 ...: . . . .. . .. . . . .. .. 0.001.013.0109. . 3.3.110... ..-.... .0.... “0.000. 000...} 0
- 0.... 33!, I 6 0 . 0 000.4. 0.000 . .0 8.0 .. -00. ..00 0 . . .... .9 :60 .0. ..00600. 00-0 .9. . . ..0 0... . 0. . .060 0.0.70.0 .. ... . .

 

1.0.21.0...) 30.01.13: ... ... 00 .0. .... .0. . .. 0600.0. 3.0021...
..0096‘.J.|... 0 60.. .0.-0.6.1.0300 Q1109 0. 6-.
- ...

 

..Q-uor‘ 0 1.1. .00! ..0. .0 ...IQPD-ot. ,. .
..0... u. ..l...0. 0. .. ... - ‘0. 6.6 0000 0 - ...-.0 10.0....- .ioo‘.00.00vl£0$\‘06|1990.f008 0.00.0638”.
. I . _

    
 
  

     

...0.60.a00.0-_.1 .
.l0.|06 ..

  
 

-.Q. 0.00)...

  

         

 

 

       
 

           

       

     

              

    

                

   
       

.390.
-..0-0. 0-‘00 -Q.
. 0:. 0. 060 s 0 . . .0 0. ~ 0.. O. .0. .00 0 . 1-0
.. 0000. 6. ”0. . .0... ..‘00606. Q0'10...I0 160.010.01; 0. .0.. 0.0..|. ’Q-v. 003.130.150 .Q...000-7.u
. .006 - 0....|....o. 00 60...... . . 1000-00-0 6,. . 0000. .. .021: 6. ... 1.1.... 00.. . . 0.1100. .00 ...
. ,- 0- 9. . . .66 - 6. .030_0....000.0.0-0< 00.7,6060 . . . . . 0 .00-0 0 .n0 0-1 ... 16-6. .. . . . .
.60.. 0.. . 3.4.11. 16.....- . ...-..0 ...6. 0.... . 90.00. . . ... . ... . a... . . . 0 .600 ‘10-..016) . 00... 0030 0rl0. 0 3‘ .0606 V0 0 106.9992!
. ... . .0 0. ..0... ......70- 0 . 209.... . . . . . I... .~ ..0 .0.... ...-6.0.. . . S 6.000.- s. 6 .0 00.. 1|. 0.0.5.6. .1004. 01130.1
. .. ......2. ...-......t... ._ ...... ..0. ....03. o... .6 . .01... .00... .... . ......083..\-6.6.l.0000..60.60.|.. 11.00.060.916...-
.. 03...!!13t01‘ . ... ... : . ... ...3-P3131 .... . .51... 0.0.10.3. 3.00.1. . .0.-0.000001”- .110101 16.0. 6’00 )0 1‘6
. . . ..- ... . .... .... ....- ...... ... . . 03.1.3.6: .3091}: aka)... ...3-I031
. .u‘. ..0 ...3-.06 . . ... 2 : M6 .... . 0- .0.1.016. 000.- 01.. 0 0600. :61... . .... :10... 3..... 0.. . . . .. .0 0.0.0.919? 01199-010000
.. .1 . .6... .01.... . .. .0400... .. .0... 0 6 6-. 6. ‘0... .... ... “...? z- . . . . . . .. .... 097. . . .. . . . . .. . . ..9 ...?90220 002...; . a.
. .61. 0.0.1.0 0 910...: . . -.. 01 I ..1 I 0 0:080 .. . 6 . . . .. . . . . . . . . . . . ,wd. . .. . 5.900.111.10..00Q- .0.!9.00028100r|.t$9
.. 3... l7 , L o. ....6. . . . . . ...: .... . . . . . _ . . . . ...q ..0. . . . . . .. . .1003.- 0 ....IA 3.3... 09.0...3.9..10000. .10;
. . . . . . .....T . . .... .. . u. o... . .....2 -. -. . -. . .. 3 {0160.0 -610... \....£0.l (.0. .0.-9....31106993
...-.0600 0 0 .9 .0. . . . 0 . 0.. 0 . . . . 5". .0....-l |.03 0.01.0.- 10.¥Tl.' ‘10.! 1530;000:0060!
\.. 0.6-0 . 8 .. 0 - . ..0 .0.. .0. .. . . . . . . .... 60. . . . 1101 000. 0.0...0. 6.. 0 1.: 0 .00- 00... 1.
...0.. l ....0 . . . .. . . . . . . ...... . ..k .... .. . . . . . :P.’ 3.0... ..... l1....h£.l-«nu-00.00F9. 0.30.00.66.06.
i . . . I 2.0.. .. ....Hasn 0.0.36.1... .010..‘0£..‘0| .0...|-6|.6|0¢..0u ‘0.
.000... . . .0... ..000. .. ”g. 0 . . .00.?! A “0:6 60. #39,}! ..0". 300001301:
.- ... . 1... 26.. 1-0... . 6. .7? . .0306). 20.1.1. 0.0.0.0...1-5. 09.! 0.3. 00 .01,- 0.90””
: 6 .. -.. . .. to... .. ... .... .0. ... .0... .. .51.:(0..0008.|06:}I0.0|.f 3..-191.00.103.00
00 . . l 6 300.00.006.2011-‘0‘ 3..-00.0.3.1... 0030-11-00- 0
0- - 0 a. . .0 . 90‘... .-.-Q... . . . .

...- ....1.Q0.0 .00000. ...

        

 

0n.- 0 D . f 0.10 Q6. 0. . 0.0 0‘ 0. I.- [0.20 0 000. I
. c.3393. 0.. 0-0}. . ... . ....0’2 - .....S. .. . 76.3-6.0 09
a 0 00010.00. 0 0 0001.0..l. 0. ..Q Q .0.. . . . .. 00 ...-10.0000! . .
.1 10.10 0. 0. . . . . . ..
. .. 0 .009 . 00 . .. 0 90:36-12... .0 00- 0 0.0. . ...... .
..0 6.0!. 001 0.0.13. 0’ .0

   

: 00 ..3. . .thuo

 

... 9
......I

 

..Q- ..0-0 .6..I!0.\.0 ...7. .
Q. 000.1 v.0Q0.. 6.0. .0 .1900...

 

...0. . ... . . ...0. ... ..0 -|....
. .‘.0 .0 ..I 0

          

    

 

  
            

          

.00.. 6.0.. A- 0... 000 1‘0.
00. u... ....6. .0..00.‘00...00..
. . .00.6|... ....06006 10.1 0 o. -|. . . .0. . :73... ... 3. .. . . L. .0060... 0 00......» ...J I... ... .0. .30.: 000., .... ..0... _ .0 ..0 060 ..0: 9 0.
.30..V0-01-. ..l. ... .0. . . 0 .. .... . _ . .. - 14.0....0-‘00. ..I .0. . .. .... .o .0. 0.0“... . . 00. .n. . 6 ... .60....
00.0 ..I 00 0 00.... .60 0

 

.\..

       

   

   

.0 .6. Q..‘
..00 .00... 5‘” ”5.0 .. 10.... ..0. 6.1 ... .6 .0. .
0-0 6...0. . . 0.0-. . O. . .
run“. I. L... ..0. .....0. . .0 ..0 ans... 0. .60.

                                   

                         

 

   

        

 

     

              

          

 

         

              

 

 

 

 

   

 

10.20. 0600-3... 0.0.9160. 0.26.! ..u.0»-J0.00...-00100.6.
. . . _. 000 0- X - .16.. .0.-.0391, .21”, 16. 0.0. ...-360.001.00.00 0.0!? O
... .0 .0. . .6 .6 6 .0 v . . 06000 '. 0 0.0.00 . .
X0 - :0 . ..0 02.0. 0. 6 . t 0 ...0- | .- . .6. . . ... . .. . .. . . . (-91.111630. . . 66 0.0.10. .0.1.11 .9 .006...-.’. 0 -\‘)00.’90)I§1\‘|6000“§0§0.
......- 6- 0. .0.... .Q. ’0... 00.0.0000 .. | .0 . . . . . . . . . . . . . .. . 11...... 106 .0900. .....91 6.06.000? C. 003.311.912.917: 6 .‘010'3'6’000 6:69.01991‘31
.... 50...? .0 ... 0. ..0. 6.. . . . . . . . . . . . . . . . .. .. . . .. . . . . . ... . : .10....6)! . 2”... 6.0.3.1.]... I. ...v .210. .ﬁ70601620M0-6‘J06'!|.‘01 .9300. 01.090.
-. ...0 . . .0 . .6. v. . u . . . . . . . . .. . . , . . . . . . . . . . . . . . ....33-1-906: 01.696.01.300.)- .6.H 16.0 0 .0....nl900-019.‘
.0. . .. 3.00 . 000 . .. ..0 91 . . 0 0 0 ... . ...-1.6 - 0 .7. 0 ..66 61 . ...... . 9 || 10.00 000.0,... “.0.-.1030”. 0" (0.0.9191-00!
. . . .6 n0 “.0..." 6 M3 0 ...-0. 0.0.... .06. 9 J... . ...10}M.L.».“.. ...». -. 0-... - 0. Q.. 0.. .0. . .. 0. 0.. .000. 020.16.0-‘6‘ N "who. lrﬁ1‘uouvi‘o 900.
0. .0 - 0 ... 6 00’ 9 . . . '
”0-.- .. .3001... 6. ...-.09.. ...“.X. ... 6.70 .0 .. . ..0. . .3“... . . 6.... ..-. . . 0...? .. 6100060 06.14 10312001519090.0001. 901,665."..rah‘ I ‘3...
.2000 1‘ 1.0. ... '0. .0 0 ..0 a .0 . 3.7.. 0. . 1.65 - .. .. ...... 0 ,0. 05-11-0000. . 0161961000.:0000’ ".900112!00¢0i92930 609.6-
161..- . .0.. .6 . .7 ....... p 106;. - .. . 6a.. .08.. 19.0.. ...-0. 1.06-1006...:6111\.\2.-.306 totittglﬁi
. .. .. .. - .. 9 . ...-1.106. . ...- 701 0. 1.,0.0.0.\0h.- 66.0 .493. 619.0060|0000 1.0006019001996-060110190‘.
— .0"... .anvﬂdn . . . . .9 . "H0... 0. 0.00. ..0..- .. .- ..0.Q£.0....~u.6-.0 0... . 0. 0-1.0 A. 0... -0 00.060. ...... . ..1. 9 0. 0.. . 0 .-. 0 0. ..100‘0w06- .‘60ﬂkﬂ‘hﬂﬁ6‘gﬂu 0 IO-il’. . 1 0.16006IONL4 ‘0”
O 00 - . 1 . .. - . 62669 .06 0 .0‘.0- 0100. 60. . . .. . . . v . 0‘- 00... 11010 01 1 . 0 0' 00 ‘

.1.... .. ... l.0.t. . . $.31... . ....10. -. ..0. ..6 . ... . . .... .. .. . . . . 3..... . ...0. .01.... - ..0; io- .. . . . . . ... o . .. ...-.... «3.. .0. . ,_ .. . . .. . .. ...? . 1. . .. . . . ... v.4... .. ...... ... 7- . .1. .910... .J.....o|..v.00 .... . 9 6.060100173009006. .l..0-0.000663‘ 1. 36%.

0.1 4.. 00C. 7. .. J... . . ... .. . 0-1.. .. - .. ...: . .60 . .. . .0. . . . . . .... \. .. . ...606 ....0. .. .0. . .6! 5-6 I. .90 -.. l

. 0 . -..|. 0.. 0 ..”0.uu0....d 16.3”” .0.“.. 06:. n-.n.:....-00. 01. ... .0 0a.... .0.“. ....00 .6.— -. ..0 6. 01'."- .l.6.0.0.0.:00. . ..00 . ”.02.... J0 . .1 . .. , ..l. . .0... .0 .. . .1.- 0 {-06.6 0a 0.
- . .. 6 . . . 66.. .. . .00 ... .. 0.. .0. .060 .. . I. . ...0.!.. ... . . . .6 . . . ...0.
w.... ... ...v ....nml.400. 0.. 31.13.. 5? .15.. . .X. .... .60 00...}. 06 ...3. 0.. .0. . .. ...61 3..-... .06 6... . . . - 0.3016 .. x 1.- ..3! . ... 6 . .66.. . . . . .. ...!- 61. 0;. n... . 6. .0 ... . .. 0.
0.. ...? . . . . 0 ..0. . .00 .0. ..L . - . o ..9. . .0 - ... 20.6.1... . . . 0-90. . ..- ..0o-0.0 .. 0 . .
. 1.0.... ..u $6.0 .. ... . . .. . 0... 0:0: 00.. .6. 0. .. .
.v 6 i H . .

 

  

.60...6000 .0..00-
k. u: 0.000.. . ..00. . 00.
0-. .

. 0.. .. ...»..0 .0 ..00. . .0Q .0 o. .
.. ... 0. 0 .v ..

...: . ... ... .... ...

up . p .0 ..L..‘60. 0. .1‘01 . . . . . . . 0....00. .. .10.. ....0... 66 0.016....-

.0 . 00 . . I110 . . . . . . 1.0.. . 5.61:0...- .0. ..0. ..

69. ......6... 06“. .II.J. . . .. . ..

J 0 .0011... . O . 7.0.. .. 0.. Q".
96 1.66. .3662... 6. . .. (..0...-
. 00.0.5 0. ..10 00- 9- .

..00H0....0000 06.. 0 0 .. . 1. 9.. 6 0.
. 076-6 60.... 001...... Z .... 090.. .

 

     
    

. 6.... ... V 0 .. o ...9: .
.‘6 . .. ..3-0...! . . . . -0.- . .. 0 64.0... . . . ..‘l0000 . . 00.0 .v - . .1 .09 o 6.. 0 .0 .. a. 06- 0.
.lo! I 00. . r: ’- ... .. 6. ... . 6. V .0 .0... . c - ... 0. 6. 0O. 000 -\ u.
.0 .0 t 0- . .. .6 6000. . L. '60. . ... 6 6 I 7-0. 66.0.00 0 .. 0 0
01.6.1 , ... to. .6...0... .-. 0
...-...0 ,0 o . .0 .
..0. _

1 :0 ..HI “uﬁ _ L “0.0 9 .J Q.6v “€010 01.0 ..0-M. .0.-00w 0.. .X‘A - ..0-6L . L.f‘.‘~. *Wﬁoku‘.” ’0'.uu..000-..0‘.-u|.&- &... .0 PW“?
0|! . . - 0h:- 0 . . - :o . 0.0 3 . . .
. 0|»..0...0.....6_mu 1‘ .u .0 1‘ him-16"....u. ...“. %.h.0.l0 .0. “I I ._ $1. om..0~.\.0001 . Qua-ymﬁt-F 0.". . ...:w.-Qu.. L02» . AZWMTM. u.“ 3. .00.w.0l|..014«\m0n.0'00.5..£.
l. .... . .01.». .. ... . r 4". ... 11.6.. Mal»: . ...... .. .... ...-10.10.343.19 6. 1.. 3:... 1...... w...- .... . #0.. 1.5%.... 0.... . 0. t
. .0 . at: 0.. 0;“. ﬁt‘“? 5...). Hum-... 3.06.00. ..0.. 0...... 001.031. ...-0 do. .u... ... “...... .10... . .0.-03.002090. .. ...! 3.10.1119 ... b. . .00. .. 6
”6L ..
. .

1:0.....0J:.0 .... ... 0..
.....6 H... 0hr ..h. 0PM:. .0..." . .001“ 0. A 00 0-... iv0....| ..0 .00..

     

      

..th watt... 1...}.

.0.. ..0"

0... . ..
....L: r...r._.
06.. .w_

  

1.0.0.0100... 0.20.0.1..- . . .66 I ...-00.. . .7. 0.-0 ....Q . .. '1'6 0.0. n .. .0

 

 

 

 

 

 

 

 

 

This is to certify that the
dissertation entitled

BOOSTING AND ONLINE LEARNING FOR
CLASSIFICATION AND RANKING

presented by
HAMED VALIZADEGAN

has been accepted towards fulﬁllment
of the requirements for the

Ph.D degree in Computer Science

9
4

/ Major Professor’s Signature
09/27 /2o\ 0.

Date

 

 

MSU is an Afﬁnnative Action/Equal Opportunity Employer

 

LIBRARY A“
Michigan State
University__

——.——u——,

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProj/Aoo&Pres/CIRC/DateDue.indd

 

 

 

BOOSTING AND ONLINE LEARNING FOR CLASSIFICATION
AND RANKING

By

Hamed Valizadegan

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2010

ABSTRACT

BOOSTING AND ONLINE LEARNING FOR CLASSIFICATION AND
RANKING

By
Hamed Valizadegan

This dissertation utilizes boosting and online learning techniques to address several
real-world problems in ranking and classiﬁcation. Boosting is an optimization tool that
works in the function space (as opposed to parameter space) and aims to ﬁnd a model
in batch mode. Typically, boosting iteratively constructs weak hypotheses with respect
to different distributions over a ﬁxed set of training instances and adds them to a ﬁnal
hypothesis. Online learning is the problem of learning a model when the instances are
provided over trials. In each trial, a new sample is presented to the learner, the learner
predicts its class label and then receives some feedback (partial or complete). The learner
updates its model by utilizing the feedback and then a new trial starts.

We consider several learning problems, including the usage of side information in rank-
ing and classiﬁcation, learning to rank by optimizing a well-known information retrieval
measure called NDCG, and online classiﬁcation with partial feedback.

Using side information to improve the performance of learning techniques has been
one research focus of machine learning community for the last decade. In this dissertation,
we utilize the abundance of unlabeled instances to improve the performance of multi-class
classiﬁcation, and exploit the existence of a base ranker to improve the performance of
learning to rank, both using the boosting technique.

Direct optimization of information retrieval evaluation measures such as NDCG and
MAP has received increasing attention in the recent years. It is a difﬁcult task because these
measures evaluate the retrieval performance based on the ranking list of documents induced

by the ranking function, and therefore they are non-continuous and non-differentiable. To

overcome this difﬁculty, we propose to optimize the expected value of NDCG and utilize
boosting technique as the optimization tool.

Online classiﬁcation with partial feedback is recently introduced and has applications
in contextual advertisement and recommender systems. We propose a general framework
for this problem based on exploration vs. exploitation tradeoff technique and introduce
effective approaches to automatically tune the exploration vs. exploitation tradeoff param-

eter.

© Copyright by
HAMED VALIZADEGAN
2010

To my loving parents, Simin Rahimi and Reza Valizadegan, for their

unlimited and unconditional encouragement, support, and love.

ACKNOWLEDGMENTS

During my Ph.D, I have received support from a number of people without whom the
completion of this thesis was not possible.

First of all, I would like to express my deepest gratitude to my thesis advisor, Dr. Rong
Jin, for his unique supervision and guidance. He motivated me to work on a diverse set
of problems in machine learning and provided me with an excellent mathematical and
optimization knowledge support. Under his supervision, I have learned different aspects of
conducting high-quality research and become capable of publishing papers in prestigious
research venues such as NIPS and WW.

For a number of years, I have also worked closely with Dr. Pang-Ning Tan with whom
I published a few papers in data mining. I would like to present my sincere appreciation
for his valuable support during those years. I will never forget his kindness and help.

I would also like to thank my committee members, Dr. Anil K. Jain, Dr. Joyce Chai,
and Dr. Selin Aviyente for their valuable feedback and discussions during my compressive
and thesis exams.

I want also thank the Department of Computer Science and Engineering at Michigan
State University that provides me with the ﬁnancial support in terms of teaching assistant
for a number of semesters. I would like to particularly thank Dr. Abdol-hossein Esfahnian,
Dr. Eric Tomg and Linda Moore for their amazing attitude in helping graduate students in
the department.

The contextual advertisement group of Yahoo! kindly provided me with an exceptional
work atmosphere during Summer and Fall 2008. I would like to thank everyone in their
group, particularly Dr. Jianchang Mao, the head of contextual and display advertisement
science and Ruofei Zhang, my direct mentor.

It has been a great pleasure to collaborate with Dr. Hang Li, the research manager of

Information Retrieval and Mining Group at Microsoft Research Asia, and Dr. Shijun Wang

vi

from National Institute of Health with whom I co-authored research papers in ranking and
online learning, respectively.

Finally, I should thank the members of LINKS and PREP labs for all the great supports
they have provided me with during my Ph.D. Particularly, I would like to thank Wei Tong,
Fenhjie Li, Yang Zhou, Pavan Mallapragada, and Matthew Gerber.

vii

TABLE OF CONTENTS

LIST OF TABLES xi
LIST OF FIGURES xii
1 Introduction 1
1.1 Classiﬁcation ................................... 2
1.2 Learning to Rank ................................. 3
1.2.1 Training set ................................... 5
l .2.2 Evaluation ................................... 6
1.2.3 Learning .................................... 6
1.3 Batch Learning .................................. 7
1.3.1 Boosting .................................... 8
1.4 Online Learning .................................. 11
1.5 Contribution of This Dissertation ......................... 13
1.6 Benchmark Data Sets ............................... 15
1.6.1 Classiﬁcation Data Sets ............................. 15
1.6.2 Ranking Data Sets ............................... 16
2 Semi-Supervised Multi-Class Boosting 18
2.1 Introduction .................................... 19
2.2 Related Work ................................... 22
2.3 Multi-Class Semi-supervised Learning ...................... 23
2.3.1 Problem Deﬁnition ............................... 23
2.3.2 Assemble Algorithm .............................. 23
2.3.3 Design of Objective Function ......................... 25
2.3.4 Multi-Class Boosting Algorithm ........................ 27
2.4 Experiments .................................... 3 1
2.4.1 Experimental Setup ............................... 32
2.4.2 Evaluation of Classiﬁcation Performance ................... 33
2.4.3 Sensitivity to the Combination Parameter C .................. 36
2.4.4 Sensitivity to Base Classiﬁer .......................... 36
3 Optimizing NDCG Measure by Boosting 40
3. 1 Introduction .................................... 41
3.2 Related Work ................................... 43
3.3 Optimizing NDCG Measure ........................... 44
3.3.1 Notation ..................................... 44
3.3.2 AdaRank Algorithm .............................. 45

viii

3.3.3 A Probabilistic Framework ........................... 46

3.3.4 Objective Function ............................... 48
3.3.5 Algorithm .................................... 50
3.4 Experiments .................................... 54
3.4.1 Experimental setup ............................... 55
3.4.2 Results ..................................... 56
4 Ranking Reﬁnement by Boosting 58
4.1 Introduction .................................... 58
4.2 Related Work ................................... 61
4.3 Ranking Reﬁnement ............................... 62
4.3.1 Problem Deﬁnition ............................... 62
4.3.2 Encoding Ranking Information ......................... 63
4.3.3 Objective Function ............................... 64
4.3.4 Boosting Algorithm for Ranking Reﬁnement ................. 69
4.4 Experiments .................................... 74
4.4.1 Experimental Setup ............................... 74
4.4.2 Results for Relevance Feedback ........................ 77
4.4.3 Effect of Base Ranker ............................. 78
4.4.4 Effect of Size of Feedback Data ........................ 79
4.4.5 Results for Recommender System ....................... 79
4.4.6 Time Efﬁciency of Ranking Reﬁnement .................... 80
5 Online Classiﬁcation with Bandit Feedback 85
5. 1 Introduction .................................... 86
5.2 Related Work ................................... 87
5.3 A Potential-based Framework for Classiﬁcation with Partial Feedback ..... 88
5.3.1 Problem Deﬁnition ............................... 88
5.3.2 Banditron .................................... 90
5.3.3 Potential-based Online Classiﬁcation for Partial Feedback .......... 90
5.3.4 Exponential Gradient for Online Classiﬁcation with Partial Feedback . . . . 95
5.4 Experiments .................................... 97
5.4.1 Experimental results .............................. 100
6 Robust Online Classiﬁcation With Bandit Feedback 102
6. 1 Introduction .................................... 102
6.2 Related Work ................................... 105
6.3 Balancing between Exploration and Exploitation ................ 106
6.3.1 Preliminary ................................... 106
6.3.2 Finding Optimal 7 using [’y‘t aé gt] 3 rt and [9} = gt] 5 pt .......... 108
6.3.3 Finding Optimal 7 using [37¢ aé gt] 3 1 and {Q} = yt] _<_ pt .......... 110
6.3.4 Finding Optimal 7 using [3} 76 gt] 3 rt and {1;} = gt] 3 1 ........... 111
6.4 Experiments .................................... l 12
6.4.1 Experimental Settings ............................. 112
6.4.2 Experimental results .............................. 1 l3

ix

7 Conclusion and Future Work 116

7.1 Summary and Conclusions ............................ 116
7.1.1 Boosting .................................... 116
7.1.2 Online Learning ................................ 118
7.2 Future Work .................................... 119
7.2.1 Boosting ........... ' ......................... 119
7.2.2 Online learning ................................. 120
APPENDICES 122
A APPENDIX 123
A] Proof of Lemma 1, Chapter 2 ........................... 123
A2 Proof of Lemma 2, Chapter 2 ........................... 124
A3 Proof of Theorem 4, Chapter 2 .......................... 125
A4 Proof of Proposition 2, Chapter 3 ......................... 126
A5 Proof of Lemma 4, Chapter 3 ........................... 126
A6 Proof of Theorem 5, Chapter 3 .......................... 127
A.7 Proof of Theorem 6, Chapter 3 .......................... 127
A8 Proof of Theorem 7, Chapter 3 .......................... 128
A9 Proof of Theorem 8, Chapter 4 .......................... 130
A.10 Proof of Lemma 5, Chapter 4 ........................... 131
All Proof of Theorem 9, Chapter 4 .......................... 131
A.12 Proof of Theorem 10, Chapter 4 ......................... 132
A.13 Proof of Proposition 5, Chapter 5 ......................... 133
A.14 Proof of Theorem 11, Chapter 5 ......................... 133
A.15 Proof of Lemma 7, Chapter 5 ........................... 135
A.16 Proof of Theorem 14, Chapter 6 ......................... 135
A.17 Proof of Proposition 6, Chapter 6 ......................... 136
A.18 Proof of Proposition 7, Chapter 6 ......................... 136
A.19 Proof of Proposition 8, Chapter 6 ......................... 137
BIBLIOGRAPHY 138

LIST OF TABLES

1.1 Description of the classiﬁcation data sets used in this dissertation ...... 16

1.2 Description of data sets in Letor 3.0 ...................... 17

xi

2.1
2.2
2.3

3.1

4.1
4.2
4.3
4.4
4.5
4.6

5.1
5.2

6.1
6.2

LIST OF FIGURES

Performance comparision ........................... 35
Sensitivity to parameter 0 .......................... 37
Sensitivity to the base ranker ......................... 39
The experimental results in terms of NDCG for Letor 3.0 data sets ..... 57
Reduction of the objective function Lp using the OHSUMED Data Set . . . 71
NDCG of relevance feedback for different algorithms ............ 81
NDCG of MRR with different base rankers for relevance feedback ..... 82
NDCG of MR with different numbers of feedback ............. 83
The ranking result for recommender system ................. 84
Running time of MR for different numbers of movies ........... 84
Performance comparisons of different methods ............... 98
Performance comparisons of different methods with varied 7 ........ 99
The error rates of Banditron with different choice of 7 ............ 104

The error rates of different methods over trials ................ 114

xii

Chapter 1

Introduction

Learning is the task of constructing a prediction model using training data. A learning task
is deﬁned by an objective function that evaluates the performance of each model in the do-
main. A variety of objective functions for learning are deﬁned for different learning tasks.
These learning tasks differ in I) their type of prediction, H) the type of feedback/labeling
for training data, and III) the way training data are presented to them.

Based on the type of prediction, the learning algorithms can be classiﬁed into three
major groups: classiﬁcation, regression, and learning to rank. A regression model aims
to map an instance to a numerical value. A classiﬁcation model (classiﬁer) categorizes
instances into predeﬁned classes and a ranking model (ranker) orders a series of items
based on a given request.

Training instances can be presented to the learner in two different ways: batch mode
and online mode. In batch mode, a set of training instances are provided to the learner and
the learner trains a model off-line. The learned model is evaluated based on the prediction
made for unseen test instances. We usually assume the training instances are i.i.d samples
from an unknown distribution and the objective is to learn a statistical model that is able to
make accurate prediction for unseen instances sampled from the same distribution of the

training data. In online mode, the task of learning and making prediction are performed

at the same time; i.e. the learner applies the current model to each received instance, and
then receives the feedback for that instance and consequently updates the model based on
the instance and the feedback. In online mode, we do not have to make the i.i.d assumption
regarding the received instances and the data generator produces instances arbitrarily [l].
The feedback for the training instances can be either partial or full in online mode
and the label for training instances can be either present or absent in batch mode. Each
of these combination results in different learning tasks. When we discuss batch learning
in more details in section 1.3, we cover a brief description of semi-supervised learning,
in which part of training instances are unlabeled; we discuss online learning with partial
feedback in Section 1.4 where the feedback only indicates if the predicted class is correct.
In the following sections, we focus on classiﬁcation, learning to rank, batch and online

learning to draw the direction of materials in the future chapters of this thesis.

1.1 Classiﬁcation

Classiﬁcation is the task of categorizing instances into predeﬁned classes and has found
countless number of applications. In the fully supervised mode, the learning algorithm re-
ceives a set of labeled instances, each represented by a vector of features and a label that
shows its class assignment. The objective of the learning algorithm is to learn a classiﬁer
that is able to make accurate prediction for unseen examples, generated by the same distri-
bution for training instances. The ability of a learner in producing models that perform well
for unseen instances is called generalization ability [2] in the machine learning literature.
Many effective algorithms have been proposed for the task of supervised classiﬁcation,
such as Support Vector Machines (SVMS) [3], logistic regression [2], and boosting [4].
Classiﬁcation is one of the oldest machine learning tasks. Nonetheless, it still ﬁnds
applications that demands developing new techniques. One of the major challenges we

address in this dissertation is to learn a classiﬁcation model from partial feedbacks. As

an example, consider the problem of contextual advertisement that chooses advertisements
to display on a web page for a speciﬁc user [5]. The contextual advertisement algorithms
are usually based on this assumption that users provide feedback by clicking on relevant
advertisements [5]. However, if none of the displayed advertisements are relevant to the
user’s information needs, they will not be clicked and consequently the algorithm does not
know which advertisements are relevant for the user. We refer to this scenario as partial
feedback as opposed to the case of full feedback where the correct output (i.e., the relevant
advertisement) is provided for each instance. This task demands new online learning algo-
rithms that are able to learn over the trials in the partial feedback setting. In particular, the
online algorithms need to explore the exploration vs. exploitation trade-off techniques that
are primarily developed for multi-armed bandit problem [6].

The performance of a classiﬁcation algorithm is usually evaluated by the classiﬁcation
accuracy. For the evaluation of multi-class or multi-label learning, the classiﬁcation ac-
curacy may not be sufﬁcient, particularly when the number of classes is large or classes
are unbalanced. In those cases, the most commonly measures used for classiﬁcation are

precision, recall or a combination of these two, such as F1 measure and ROC curve.

1.2 Learning to Rank

Ranking is the task of ordering a list of offerings for a given request. It receives a set of
offerings and a request as input and outputs the list of offerings sorted according to their
relevancy to the request. The performance of a ranking algorithm is evaluated based on
how well it sorts the offering according to their relevancy to the request. Learning to rank
is the task of learning a ranking function that can order the offerings for unseen requests. It
receives a set of requests, each with a sorted list of offerings as the training set and produces
a ranking function to sort offerings for new requests. Learning to rank is a relatively new

area of study in machine learning that has received much attention in recent years because

of its important role in a variety of applications including:

0 Document Retrieval: In document retrieval, the request is a textual query (a set of
keywords) and the offerings are documents. Users provide a set of keywords to the
system and the ranking system should retrieve the most relevant documents to those

keywords.

o Recommender Systems: In recommender systems, the request is a user and the
offerings are the items to be recommended. For example, in movie recommenda-
tion system, a ranking system aims to recommend the most interesting movies to a

particular user based on the history of users and movies information.

o Sentiment Analysis: In sentiment analysis, the request is a text and the offerings are

the attitudes of the author regarding to a particular subject.

9 Computational Biology: In computational biology, a request is a protein and the
offerings are the list of different 3d structures. The objective is to provide a sorted

list of 3d structures for a given protein.

0 Online Advertisement Placement: In online advertisement placement, the request
is a user visiting a web page and the offerings are the advertisements. Online adver-
tisement systems should rank the relevancy of different advertisements to that user
and display the most relevant advertisement on the web page in order to maximize

the number of clicks on the advertisements.

Throughout this thesis, we use the document retrieval terminology (e.g. query for request,
document for offering) when talking about ranking although the material are applicable
to other domains. Since learning to rank is a relatively new problem, we describe it in
more details here. A learning to rank system usually consists of three components that

distinguish it from classiﬁcation and regression.

1.2.1 Training set

The training set for learning to rank consists of a set of queries. For each query, a list of
documents and their relevancy to the query are provided. The common practice in learning
to rank is to assume the existence of a set of base rankers that can be considered the feature
generators for query-document pairs. PageRank [7], vector space model [8], and statistical
language models [9] such as BM25 are some example base rankers. These base tankers are
basically unsupervised models that measure the relevancy of each document to a query. The
value produced by each base ranker is considered a feature for a query-document pair and
the learning to rank algorithm aims to combine these feature values to produce a ranking
function.

The label information in learning to rank is in form of relevancy judgments that can be
of three different types: relevancy scores, pairwise relevancy information (partial ordering)
and a complete ordering. A relevancy score is a numerical value (e.g. 1,2,..) that shows
the level of relevancy of documents to a given query [10]. Relevancy scores are the most
widely used relevancy information. The pairwise relevancy information is the relative
relevancy between two documents that indicates which document among the two is more
relevant. The pairwise relevancy can often be derived from the implicit feedbacks from
users. For example, in search engines, when a user clicks on one of the ranked documents,
it is safe to infer that the clicked document is more relevant than the documents that are
ranked before the clicked one. This type of click-through feedback provides the relative
relevancy for pairs of documents [11]. A less commonly used relevancy information is a
complete relevancy ordering of documents to a given query [12] in which documents are
ordered in the descending relevancy. Notice that the relevancy scores can be converted to a

pairwise ordering and complete ordering but the Opposite is not true.

1.2.2 Evaluation

The performance of a ranking system is evaluated based on how well it predicts the rele-
vancy of documents to a query. Several evaluation measures are introduced in the literature.
Area under the ROC Curve (AUC), Mean Average Precision (MAP), and Normalized Dis-
counted Cumulative Gain (NDCG) are some of the most-widely used measures. AUC is
based on the Wilcoxon test, a nonparametric statistical test to measure the distributional
difference between two sets of numbers. AUC works only for two levels of relevancy judg-
ments and measures how well a ranking function places the relevant documents on the top
of the irrelevant documents. AUC treats documents similarly regardless of their position in
the ordered list. However, the top retrieved documents are more important because users
only look for the relevant documents at the top of the list (e.g. consider a search engine in
which users only look at the ﬁrst few pages of retrieved links). Based on this observation,
MAP [13] and NDCG [14] are constructed to put more weight on the documents at the
top of the list. Similar to AUC, MAP only works for binary relevancy judgment. On the
other hand, NDCG is a general evaluation measure that can handle ranking problems with

multiple levels of relevancy judgements.

1.2.3 Learning

Three types of learning to rank algorithms can be found in the literature: Pointwise, pair-
wise and listwise approaches. Pointwise approaches [15-17] can be applied when the
relevancy scores of documents are available. In this case, the relevancy scores are consid-
ered as absolute quantities and a classiﬁcation or regression technique is applied by treating
the relevancy scores as class labels or numerical values. The pairwise approaches are the
only group of techniques that can handle the pairwise relevancy information. They ap-
ply a classiﬁcation or regression technique to learn the ordering information of pairs of
documents [18-23]. The third group of algorithms, the listwise approaches, are the most

effective learning to rank techniques that have been studied in the last few years. They are

motivated by this observation that most evaluation metrics of information retrieval measure
the ranking quality for individual queries, not documents. These approaches consider the
ranking list of documents for every query as a training instance [13, 24—29] by optimizing

a listwise loss function. We describe these techniques in more details in Chapter 3.

1.3 Batch Learning

In batch learning, a set of training instances are provided that are generated by an unknown
distribution. The goal is to train a model off-line that is capable of making accurate predic-
tion for unseen instances. As mentioned before, dependent on the type of training instances
and their labels, different learning tasks can be deﬁned. For example, in classiﬁcation, each
instance is a vector of features and the label is the class assignment. And in the listwise
approach to learning to rank, each instance consists of a query, the list of its documents,
and the relevancy of documents to the query.

Training instances can be either all labeled or partially labeled that results in two dif-
ferent modes of learning: supervised and semi-supervised learning. All training instances
are labeled in supervised learning and plenty of unlabeled instances are provided in case of
semi-supervised learning to help the process of learning. The usage of unlabeled instances
are based on some assumptions about the data generating process such as manifold and
cluster assumption [30—35]. We return to these assumptions in Chapter 2.

In most studies of batch learning, an objective function is designed to measure the
performance of a given model (function) on instances. Different learning algorithms can
be designed by deﬁning different objective function for the same task. For example, in case
of classiﬁcation, the negative log-likelihood function is used in logistic regression, a hinge
loss leads to support vector machines, and etc. In case of learning to rank, the pointwise
approaches utilize a classiﬁcation or regression model, i.e. they utilize a classiﬁcation or

regression loss function. Similarly, pairwise approaches are concluded from designing a

classiﬁcation or regression model on pair of documents and a listwise learning to rank
algorithm results from utilizing a loss function in the level of query.

Given an objective function (loss function) L(F) to measure the performance of a given
model F, learning translates to the process of ﬁnding F that optimizes L(F). A common
approach is to restrict the model to a member of a parametric family F (w) (e.g. a linear
model). This constraint translates the objective function L(F) into an objective function of
parameters 212, i.e. L(w), and consequentially the optimal model is found by optimizing the
objective function with respect to w. In this case, L(w) is called a function in the param-
eter space. A different approach is to directly optimize L over function F. This approach
optimizes the objective function in the function space and is called boosting. Boosting is

the optimization technique we utilize in this thesis for the batch mode algorithms we cover.

1.3.1 Boosting

Boosting [4, 36] is a popular technique with a greedy nature designed to optimize a given
objective function in the Space of functions. This is very important because it allows to
boost the performance of any base function (weak learner) once the problem is written in
the function space. Boosting can be considered as a gradient descent algorithm applied
in the function space [37]; in each step i, it learns a new direction ft and a step size at
to move as much as possible toward the optimum point, which results in a ﬁnal solution
of Fn = 2L1 aifi. Instead of applying a direct optimization approach such as gradient
descent, bound Optimization strategies [38] may be used; this is because f,- and a,- are
dependent on each other and it is difﬁcult to decide the values for fz- and O, simultaneously.
The bound optimization strategy is often applied to decouple the dependency between f,-
and ai. We use this technique in different parts of this thesis.

First introduced by Schapire [4], boosting was initially designed to convert a weak
learner that performs just slightly better than random guessing into an accurate classiﬁer.

Here, by random guessing, we mean a classiﬁer with less than 50% classiﬁcation error.

However, as we will Show throughout this dissertation, the meaning of random guessing
can change from one problem to another. In this view, given a set of labeled training exam-
ples (xi, 3],), i = 1..n, a boosting algorithm provides the weak learner with a set of weighted
training examples at each round. The weak learner constructs a model by optimizing its
loss over the weighted training examples. In the new iteration, the boosting algorithm pro-
duces a new set of weighted examples by increasing the weights for the examples that are
misclassiﬁed in the previous round. The iterations are repeated till the algorithm converges.

One well-known boosting algorithm is AdaBoost [39], developed based on an expo-
nential loss function for classiﬁcation. Algorithm 1 shows AdaBoost algorithm. At the
beginning of this algorithm, the booster chooses a uniform weighting over the examples
(Step 3). Given the weights produced by the booster, the weak learner constructs a bi-
nary classiﬁer that minimizes the loss Ct at Step 5. The booster then produces a new set of
weights for the examples in Step 8 by increasing the weights for the examples misclassiﬁed
in the previous round of learning (Steps 6 and 7). These steps are repeated for a number of
times. We have the following bound for the misclassiﬁcation error of the ﬁnal hypothesis

generated by AdaBoost algorithms:
6 g 2TH;1 et(1— at) (1.1)

where 6t is the classiﬁcation error for the hypothesis generated in round t. The above result
shows that, under the assumption of weak classiﬁer, the classiﬁcation error is guaranteed
to be reduced as the iteration proceeds.

Using the Minimax theorem, Freund et a1. [39] showed that there is a mixed strategy
over the space of hypotheses H that produces zero classiﬁcation error over the training
set if (H, X) is 7-learnable. The progress of a boosting algorithm is measured by how

much the classiﬁcation error (or a given loss) decreases at each iteration (or over time) and

 

For 7 > 0, a learning algorithm is 7-leamable if for any distribution Q over training examples X, the
algorithm can return h E H with at most % — 7 classiﬁcation error

 

Algorithm 1 AdaBoost Algorithm
1: Input
1. A weak learner, and a set of training examples
2. A set of training examples (x1,y1),...,(:rm,ym) where 1:,- E X and y,- 6

{—1,1}.

 

2: Initialize F(:r,-) = 0,2' = 1, .., m

3: Initialize 01(2) = 1/m,z‘ = 1, ...,m

4: repeat

5: Find the classiﬁer ft : X —> {—1,1} that minimizes 6t = 2&1 Dt(i)I(y,- 74
ft($i))

6: Compute at = %ln(l—:t—€t)

7 Compute F(a:,-) = F(:c,') + aft(:r,-), i: 1, ..,m

(2.) = Dt(i)exp(20tyift($i))

8: Compute the new weighting Dt+1 t

malization factor.
9: until reach the maximum number of iterations

where Zt is the nor-

 

deﬁned in the following form:
M(Pt1 Q0) S HtT=16(M(htr Qt)) (12)

where 6 is an increasing function of the loss, M (Pt, Q0) is the suffered loss when the
majority vote Pt is used over H and Q0 is the uniform distribution over X (i.e. M (Pt, Q0)
is the computed loss of weighted majority vote over the original samples), and M (ht, Qt)
is the computed loss at round t (i.e. the loss suffered when a single hypothesis ht is applied
over the weighted samples set Qt).

Beside classiﬁcation and regression, boosting has been applied to a wide range of ap-

plications including:

o Semi-Supervised Learning: Boosting can be utilized to adapt a supervised learner
to the problem of semi-supervised learning. For example, [40] used binary classiﬁer
as the weak learner and boosted it for the task of semi-supervised classiﬁcation and
[41] exploited a binary supervised learner as the weak learner and boosted it for

semi-supervised clustering.

10

0 Learning to Rank: Boosting is used to learn a ranking function to order the rel-
evancy of documents for a query. RankBoost [19] and AdaRank [42] are example
applications of boosting to ranking. RankBoost uses pairwise binary classiﬁer and
boost it for ranking and AdaRank adapts AdaBoost to optimize information retrieval
evaluation measures such as Normalized Discounted Cumulative Discount (NDCG)

and Mean Average Precision (MAP).

1.4 Online Learning

Online learning is the task of learning when the examples are provided sequentially (over
the trials). In each trial, the learning algorithm receives a new example, classiﬁes it and
then acquires some sort of feedback. Using this feedback, the online learning algorithm
updates the model in order to better classify the future examples. The feedback provided
to the online algorithm can be either full or partial. In the full feedback setting, after
classifying an instance, the algorithm receives its true class label. One well-known example
of such online learning algorithm is the well-known Perceptron algorithm [43]. In the
partial feedback or "Bandit" setting, the true label is not revealed and the feedback is
limited to whether or not the algorithm classiﬁed the instance correctly. Since the difference
between full and partial feedback in the above discussion only makes sense for the case of
multi-class classiﬁcation, the online classiﬁcation with partial feedback is called multi-class
bandit learning [5]. The objective of the learner is to generate a sequence of hypotheses that
guarantees a small cumulative loss in the long run when compared to the best hypothesis

in the space of hypothesis; i.e.

T T
1 1 .
5: 2:1: M (Pt. Qt) S ? mgn t§=1:M(P.Qt) + 5(T) (1-3)

11

where 6 is a decreasing function of T and should approaches zero when T approaches
inﬁnity.
The bandit feedback has several real-world applications such as online advertise-

ment [5] and recommender systems [5], as described in the following

0 Online Advertisement: In online advertisement, we often assume that a sponsored
ad is likely to be relevant to the user’s query if it is clicked by the user, and irrelevant
otherwise. In the case when the sponsored ad does not receive a click, the online
advertisement algorithm is unable to locate the advertisements that are relevant to

the given query, leading to the partial user feedback.

0 Recommender Systems: A recommender system recommends some items (e.g.
movies) to the user. The assumption is that if one of the recommended movies are
selected by user, that movie was a correct recommendation. However, if none of the
recommended movies are chosen by the user, the recommender system is not able to

discover the right set of movies for that user.

While the problem of online classiﬁcation with full feedback is well-studied, online clas-
siﬁcation with bandit feedback has received attention only recently [5]. Kakade et a1. [5]
introduced Banditron as an extension to Perceptron [43] to handle the partial feedback set-
ting. Online learning with bandit feedback can be regarded as the problem of multi-armed
bandit [44] when some side information (e.g. the feature vector of instances) is available.
Multi-armed bandit is the generalized version of one-armed bandit game (a traditional slot
machine) in which several levers are provided and the player aims to choose a lever that
maximizes the rewards in the long run. At each stage, the player only knows the reward
for the lever he chooses; the rewards for the remaining levers are unknown to the player.
In a more abstract level, multi-armed bandit problem refers to the problem of choosing an
action from a list of actions to maximize rewards given that the feedback is (bandit) partial.

The algorithms developed for this problem usually utilize the exploitation vs. exploitation

12

tradeoff strategy to handle the challenge arising from partial feedback [45—47].

1.5 Contribution of This Dissertation

We address several important ranking and classiﬁcation problems in this dissertation. Uti-
lizing side information in ranking and multi-class classiﬁcation, direct optimization of in-
formation retrieval measures such as NDCG, and online learning in the bandit setting are

the subjects we cover, as summarized here:

0 Semi-supervised Classiﬁcation: The focus of semi-supervised classiﬁcation is on
constructing better models by utilizing unlabeled instances when the number of la-
beled instances is small. Several semi-supervised classiﬁcation algorithms are devel-
oped based on manifold [32—35] and cluster [30, 48, 49] assumptions. Most of these
techniques work for binary problems and converting techniques such as one-versus-
one and one-versus-the-rest are applied to use them for multi-class problems [50].
This converting procedures has several well-known problems including imbalanced
classiﬁcation and different output scales of different binary classiﬁers. We utilize
both manifold and cluster assumptions in Chapter 2 and design an objective function
that directly addresses multi-class semi-supervised problem. We solve this objec-
tive function in the function space using boosting technique. Our empirical Study
Shows the superior performance of this boosting algorithm compared to the existing

boosting algorithms for multi-class problems.

0 Ranking by optimizing NDCG: The objective in this problem is to learn a ranking
function by maximizing Normalized Discounted Cumulative Gain (NDCG), the most
frequently used information retrieval evaluation measure for ranking problems with
multi level relevance judgement [10]. This is a difﬁcult problem because NDCG
is a non-differentiable and non-continuous loss function. In order to overcome this

difﬁculty, we introduce the expected value of NDCG and solve it in the function space

13

using the boosting technique. The detailed discussion of this boosting algorithm is

provided in Chapter 3.

Ranking Reﬁnement: In some real world applications, there are two complementary
sources of information for ranking, ranking information given by the existing ranking
function (i.e., the base ranker) and that obtained from users feedback. One example
of such applications is relevance feedback, where the two sources of information
are the relevance scores obtained from a ranking function like BM25 [51] and the
relevance judgments obtained by the users. The key challenge in combining the two
sources of information arises from the fact that the ranking information presented
by the base ranker tends to be imperfect and the ranking information obtained from
users’ feedbacks tends to be noisy. We encode these sources of relevancy information
in form of pairwise relevancy and design an objective function to combine them.
We also design a boosting algorithm to solve the resulting objective function. The
detailed discussion is provided in Chapter 4 where we perform extensive experiments

to Show the superiority of our proposed framework to several baselines.

Online Multi-class Learning with Partial Feedback: Unlike online learning with
complete feedback that has been extensively studied [52], the problem of online
multi-class learning with bandit feedback was introduced very recently [5]. Ban-
ditron, the ﬁrst introduced algorithm for multi-class learning with bandit feedback,
is a direct generalization of Perceptron to the case of partial feedback that uses ex-
ploration vs. exploitation tradeoff strategy to handle partial feedback [5]. Using
potential function and exploration vs. exploitation tradeoff technique, we develop a
general framework in Chapter 5, of which Banditron is a special case. The major
problem with Banditron is that its performance could be sensitive to the parameter
that trades off between exploration and exploitation [53]. We develop an effective

approach in Chapter 6 to reduce this dependency.

l4

1.6

Benchmark Data Sets

Throughout this dissertation, we use two sets of data to study the performance of the pro-

posed methods, one set for multi-class classiﬁcation and one set for learning to rank, as

described in the following subsections. We use 5 folds cross validation to run all the exper-

iments except for online learning.

1.6.1 Classiﬁcation Data Sets

Multiple benchmark data sets from UCI data repository [54] and LIBSVM web page [55]

are used in our study. Here is the list and a brief description of these data sets:

MNIST. MNIST is comprised of grey scale images of size 28 x 28 for hand written
digits. It contains 60000 training samples, each represented by 780 features.
Protein. Protein has 17766 samples, represented by 357 features and three classes.
Letter. Letter contains 15000 instances of 26 characters, each represented by 16
features.

optdigits. This data set consist of normalized bitmaps for handwritten digits from 30
people. It contains 3823 instances, each represented by 64 features.

pendigits. This is another collection of images for handwritten digits. It contains
7495 samples, each represented by 16 features.

Nursery. Originally developed to rank applications for nursery school, it has 12960
records, each represented by 8 features belonging to one of 4 classes (we removed
one class that only had two samples).

Isolet. Isolet contains 7797 Spoken alphabet that belong to 26 classes, with each

letter forming its own class. Every spoken alphabet is represented by 617 attributes.

Notice that for some of these data sets, there were two separate sets, one for training and

one for testing. We only used the training set in our experiments. The information related

to these data sets are summarized in Table 1.1.

15

Table 1.1: Description of the classiﬁcation data sets used in this dissertation

 

 

 

 

 

 

 

 

 

Instances Features Classes

Isolet 7797 617 26
MNIST 60000 784 10
Protein 17766 357 3

Optdigits 3823 64 26
Nursery 12960 8 3

Letter 15000 16 26
Pendigits 7495 16 26

 

 

 

 

 

1.6.2 Ranking Data Sets

We use data sets from information retrieval and recommender systems to study the per-
formance of ranking algorithm in our studies. For information retrieval, we use ver-
sion 3.0 of LETOR package provided by Microsoft Research Asia [56]. LETOR Pack-
age includes several benchmark data sets for ranking, along with the state-of-the-art algo-
rithms for learning to rank and tools for evaluation. There are seven data sets provided
in the LETOR package: OHSUMED, Top Distillation 2003 (TD2003), Top Distillation
2004 (TD2004), Homepage Finding 2003 (HP2003), Homepage Finding 2003 (HP2003),
Named Page Finding 2003 (NP2003) and Named Page Finding 2004 (NP2004). There are
106 queries in the OSHUMED data sets, with each query equipped with around 1000 man-
ually judged documents. The relevancy of each document in OHSUMED data set is scored
in three levels: 0 (irrelevant), 1 (possibly) or 2 (deﬁnitely). The total number of query-
document relevancy judgments provided in OHSUMED data set is 16140 and there are 45
features used to represent each document-query pair . For TD2003, TD2004, HP2003,
HP2004, NP2003, and NP2004 there are 50, 75, 150, 75 150 and 75 queries, respec-
tively, with about 1000 retrieved documents that are manually judged for each query. This
amounts to a total number of 49058, 74170, 147606, 148657 and 73834 query-document
pairs for TD2003, TD2004, HP2003, HP2004 and NP2003 respectively. For these data

 

Unlike the classical supervised learning, in Ieaming to rank,the representation of documents depends on
the given query. Hence, features are extracted for each document-query pair, not just for individual documents

16

Table 1.2: Description of data sets in Letor 3.0.

 

 

 

 

 

 

 

 

Query document pair Queries Relevancy level Features
OHSUMED 16140 106 3 45
TD2003 49058 50 binary 63
TD2004 74170 75 binary 63
HP2003 147606 150 binary 63
HP2004 74409 75 binary 63
NP2003 148657 150 binary 63
NP2004 73834 75 binary 63

 

 

 

 

 

 

 

sets, there are 63 features extracted for every query-document pair. A binary relevancy
judgment is provided for every query-document pair. This information is summarized in
Table 1.2.

For every data sets in LETOR, ﬁve partitions are provided to conduct the ﬁve-fold cross
validation, and each partition is further divided into the training set, testing set, and vali-
dation set. The retrieval results for a number of state-of—the-art learning to rank algorithms
are also provided in the LETOR package. We will describe these algorithms in details in
Chapter 3.

In order to evaluate the performance of the proposed ranking algorithms for Recom-
mender System, we use the MovieLens dataset, available at [57], which is one of the most
popular data sets for the evaluation of information ﬁltering. It contains 100, 000 ratings
ranging from 1 to 5, with l as the best rating and 5 as the worst rating for 1682 movies
given by 943 users. Each movie is represented by 51 binary features: 19 features are de-
rived from the genres of movies and the rest 32 features are derived from the keywords
that are used to describe the content of movies. To extract the content features, we down-
loaded the keywords of each movie from the online movie database IMBD and selected the

keywords that are mostly used by the 1682 movies.

17

Chapter 2

Semi-Supervised Multi-Class Boosting

Most semi-supervised Ieaming algorithms are designed for binary classiﬁcation. They are
extended to multi-class classiﬁcation by approaches such as one-against-the-rest. The main
shortcoming of these approaches is that they are unable to exploit the fact that each exam-
ple is only assigned to one class in the case of multi-class Ieaming. Additional problems
with extending semi-supervised binary classiﬁers to multi-class classiﬁcation include im-
balanced classiﬁcation and different output scales of different binary classiﬁers. Given that
there are well-known multi-class classiﬁcation techniques such as decision tree and multi-
layer perceptron, the research question is whether it is possible to use these techniques as
weak learner and boost their performance for the task of semi-supervised Ieaming. The
main challenge in designing such boosting algorithms is that the deﬁnition of the loss for
unlabeled exampels is not clear. One approach is to generalize the notion of margin for
labeled instances to unlabeled instances. This approach computes the margin for unlabeled
examples by considering their assigned labels at the current iteration of the algorithm.
However, Since the labels computed in the early iterations is likely to be inaccurae, this
strategy produces undesireable results.

Unlike the exising boosting algorithms for semi-supervised Ieaming which are only

based on the classiﬁcation conﬁdence (margin) of the exampels (i.e. cluster assumption),

18

we utilize both the classiﬁcation conﬁdence and the similarity among examples (i.e. the
manifold assumptions) to design a loss function for multi—class semi-supervised learning.
We further develop a boosting algorithm for efﬁcient computation. Empirical study with the
multiple benchmark datasets shows that the proposed MCSSB algorithm performs better

than the state-of-the-art boosting algorithms for semi-supervised learning.

2.1 Introduction

Semi-supervised classiﬁcation combines the hidden structural information in the unlabeled
examples with the explicit classiﬁcation information of labeled examples to improve the
classiﬁcation performance. Many semi-supervised Ieaming algorithms have been studied in
the literature. Examples are density based methods [30, 31], graph-based algorithms [32—
35], and boosting techniques [40, 48, 49]. Most of these methods are based on either
manifold assumption [32—35] or cluster assumption [30, 48, 49]. Under the manifold as-
sumption, the data is assumed to reside on a low dimensional manifold within the original
high dimensional space and the class assignment of unlabeled examples can be derived
from a classiﬁcation function that lives in this low dimensional manifold. Under the cluster
assumption, the examples of the same class tends to be closer to each other than those of
different classes. As a result of this assumption, the decision boundary is expected to pass
through the low density regions. Thus, a given semi-supervised Ieaming is usually speci-
ﬁed by a combination of two terms, with one term related to the classiﬁcation error on the
training examples and the other term related to how well the model satisﬁes the assumption
(either manifold or cluster assumption).

While most of semi-supervised classiﬁcation approaches were originally designed for
two class problems, many real-world applications, such as speech recognition and object
recognition, require multi-class categorization. To adopt a binary (semi-supervised) Ieam-

ing algorithm to problems with more than two classes, a common practice is to divide a

19

multi-class learning problem into a number of independent binary classiﬁcation problems
using techniques such as one-versus—the-rest, one-versuS-one, and error-correcting output
coding [58]. The main shortcoming with these approaches is that the resulting binary clas-
siﬁcation problems are independent. As a result, these approaches are unable to exploit the
fact that each example can only be assigned to one class. This issue was already pointed
out in the study of multi-class boosting [59]. In addition, since every binary classiﬁer is
trained independently, their Outputs may be on different scales, making it difﬁcult to iden-
tify the most likely class assignment based on the classiﬁcation scores [60]. Though cali-
bration techniques [61] can be used to alleviate this problem in supervised classiﬁcation, it
is rarely used in semi-supervised Ieaming due to the small number of labeled training ex-
amples. Moreover, techniques like one-versus-the-rest, where the examples of one class are
considered against the examples of all the other classes, could lead to the imbalanced clas-
siﬁcation problem. Although a number of techniques have been proposed for supervised
learning in multi-class problems [59, 62, 63], none of them addressed semi-supervised
multi-class learning problems, which is the focus of this chapter.

Given that the supervised multi-class classiﬁcation is a well-studied subject, an im-
portant research question is whether it is possible to develop a general semi-supervised
framework that is able to improve the accuracy of a given supervised multi-class Ieaming
algorithm by effectively exploring the abundance of unlabeled data. The immediate answer
to this question is boosting technique. The objective of semi-supervised classiﬁcation is to
learn a hypothesis that makes minimum number of misclassiﬁcation on the labeled exam-
ples and utilizes the unlabeled data for a better generalization. Given a loss function for the
labeled and unlabeled examples, a boosting algorithms can be deﬁned by reweighting each
instance based on the current value of the loss.

One straightforward approach to deﬁne the loss for unlabeled examples is to consider
the classiﬁcation conﬁdence as the loss for unlabeled instances. The difﬁculty comes from

the fact that the classiﬁcation conﬁdence related to the unlabeled examples are unknown.

20

One approach to address this problem is to use the class labels predicted by the current
model as the pseudo-labels for the unlabeled examples and utilize them to obtain the clas-
siﬁcation conﬁdence (or margin). Assemble [48], as described in Section 2.3.2, is con-
strucuted based on the idea of pseodu-labels. The problem with utilizing pseudo-labels
to compute the loss for unlabeled examples is that the pseudo-labels assigned in the early
steps of the algorithm is not precise and can lead to undesireable result of the boosting
algorithm. Particularly, this approach does not directly utilize the underlying properies of
data described as a manifold or cluster assumption. Moreover, since all the existing semi-
supervised boosting algorithms are designed for binary classiﬁcation, they will still suffer
from the aforementioned problems when applied to multi-class problems.

To avoid the above problems, we design a boosting algorithm in this chapter by con-
sidering a multi-class loss function that utilizes both the manifold and cluster assumption;
i.e. it consists of two terms, one releated to the consistency of the predicted labels and
similarity between the examples, and one related to the consistency between the predicted
labels and the true labels of labeled examples. To minimize this loss function, we develop a
semi-supervised boosting framework, termed Multi-Class Semi-Supervised Boosting (MC-
SSB), that is designed for multi-class semi-supervised learning problems. By directly solv-
ing a multi-class problem, we avoid the problems that arise when converting a multi-class
classiﬁcation problem into a number of binary ones. Moreover, unlike the existing senti-
supervised boosting methods that only assign pseudo-labels to the unlabeled examples with
high classiﬁcation conﬁdence, the proposed framework decides the pseudo labels for un-
labeled examples based on both the classiﬁcation conﬁdence and the similarities among
examples. It therefore effectively explores both the manifold assumption and the cluster-
ing assumption for semi-supervised learning. Empirical study with UCI datasets shows the
proposed algorithm performs better than the state-of—the-art algorithms for semi-supervised

learning.

21

2.2 Related Work

Most semi-supervised Ieaming algorithms can be classiﬁed into three categories: density
based methods [30, 31], graph-based algorithms [32—35], and boosting techniques [40, 48,
49]. As mentined in Section 2.1, these methods are based on either cluster or manfold
assumption, dependent on how they utilize the unlabeled examples. Denisty-based meth-
ods are usually based on ﬁnding a decision boundary that passes through sparse regions
and have the maximum margin to both labeled and unlabeled examples [30, 31, 48, 49].
Cluster-based learners utilize a similarity measure between examples and construct a graph
to propagate the labeling information to the unlabeled instances [32—35].

Semi-supervised learning algorithms can be also categorized into inductive and trans-
ductive learner based on their functionality. A semi-supervised learner is called trans-
ductive if it does not produce a classiﬁer and cannnot operate on the unseen exampels.
Otherwise, it is called inductive. The algorithm we developed in this chapter works in the
inductive mode.

Semi-supervised SVMS (S3VMS) or Transductive SVMS (T SVMS) are the semi-
supervised extensions to Support Vector Machines (SVM). They are essentially density-
based methods and assume that decision boundaries should lie in the sparse regions. Un-
like their name, TSVMS can work in inductive mode. Although ﬁnding an exact S3VM
is NP-complete [64], there are many approximate solutions for it [30, 31, 65-67]. Ex-
cept for [67], these methods are designed for binary semi-supervised Ieaming. The main
drawback with [67] is its high computational cost due to the semi-deﬁnite programming
formulation.

Graph-based methods are usually transdactive learner that aims to predict the class
labels that are smooth on the graph of unlabeled examples. These algorithms differ in
how they deﬁne the smoothness of class labels over a graph. Example graph-based senti-
supervised learning approaches include Mincut [32], Harmonic function [33], local and

global consistency [34], and manifold regularization [35]. Similar to density based meth-

22

ods, most graph—based methods are mainly designed for binary classiﬁcation.
Semi-supervised boosting methods such as SSMBoost [68] and Assemble [48] are di-
rect extensions of Adaboost [39]. In [49], a local smoothness regularizer is introduced
to improve the reliability of semi-supervised boosting. Unlike the existing approaches
for semi-supervised boosting that solve 2-class problems, we focused on semi-supervised

boosting for multi-class classiﬁcation.

2.3 Multi-Class Semi-supervised Learning

2.3.1 Problem Deﬁnition

Let D = (.731, .., xN) denote the collection of N examples. Assume that the ﬁrst N1 exam-
ples are labeled by y1,..., le. Each y,- = (yz-1,..., yzm) E {0, +1}m is a binary vector that
indicates the assignment of 2:,- to m different classes, where. yf = +1 when 2:,- is assigned
to the kth class, and yf = 0, otherwise. Since we are dealing with a multi-class problem,
we have 2254 yf = 1, i.e., each example 2:,- is assigned to one and only one class. We
denote by g, = (3211,. . . ,y‘im) e Rm the predicted class labels (or conﬁdence) for exam-
ple mi, and by I? = (Qir,...,3)1-[,)T the predicted class labels for all the examples. Let
S = [Sm-M,-x N be the similarity matrix where Sm- = SJ},- 2 0 is the similarity between 2:,-
and xj. For the convenience of discussion, we set 3,3,; = 0' for any x,- G D, a convention
that is commonly used by many graph-based approaches. Our goal is to compute 37,- for the

unlabeled examples with the assistance of Similarity matrix S and Y = (y;- , . . . , 311-51)?

2.3.2 Assemble Algorithm

Assemble [48], a boosting algorithm for semi-supervsed classiﬁcation as depicted in Al-

gorithm 2, is construcuted based on the idea of pseodu-labels. At each boosting iteration,

 

xT is the transpose of matrix(vector) 3:.

23

 

Algorithm 2 Assemble: Adaptive Semi-Supervised Ensemble Algorithm
1: Input:
0 D = (x1, .., xN): The set of examples; the ﬁrst N, examples are labeled.
0 s: The number of sampled examples
2: Initialize F(i) = 0,i = 1, ..., |D|
3: Initialize w1(z') = l/Nl,’i = 1, ...,Nl and w1(i) = 0,1 = N1 + 1, ..., IDI
4: repeat
Set y,- = F($i),i = N1 + 1, ..., IDI
Find a multi—class classiﬁer ht that minimizes 6t = 2:131 wt (2)] (y,- 75 ft(a:,-))
Compute at = $11K?)
Compute F(a:,-) = F(a:,-) + (112(2),), 2': 1, .., |D|
wt(i) exp(atz{(yi#ft (1%)»

normalization factor and [(12) outputs 1 if a: is true, and 0 otherwise.
10: until reach the maximum number of iterations

 

999$???

Compute the new weighting wt+1(i) = where Zt is the

 

the boosting algorithm creates a new classiﬁer and redisributes the weights by emphasizing
more on the less-conﬁdent instances.

Beside Assemble, several other boosting algorithms have been proposed for senti-
supervised Ieaming based on the idea of using pseudo-labels [49, 68]. They essentially
operate like self-training where the class labels of unlabeled examples are updated itera-
tively: a classiﬁer trained by a small number of labeled examples is initially used to predict
the pseudo-labels for unlabeled examples; a new classiﬁer is then trained by both labeled
and pseudo-labeled examples; the processes of training classiﬁers and predicting pseudo—
labels are altered iteratively till stopping criterion is reached. The main drawback with this
approach is that it relies solely on the pseudo-labels predicted by the classiﬁers learned so
far when generating new classiﬁers. Given the possibility that pseudo-labels predicted in
the ﬁrst few steps of boosting could be inaccurate, the resulting new classiﬁers may also
be unreliable. This problem was addressed in [49] by the introduction of a local smooth-
ness regularizer. However, these approaches do not utilize the underlying properies of data
described as a manifold or cluster assumption. In what follows, we design a boosting algo-
rithm for the problem of multi-class semi-supervised classiﬁcation based on manifold and

cluster assumpption.

24

2.3.3 Design of Objective Function

The goal of semi-supervised Ieaming is to combine labeled and unlabeled examples to
improve the classiﬁcation performance. Therefore, we design an objective function that
consists of two terms: (a) Fu that measures the inconsistency between the predicted class
labels 17 of unlabeled examples and the similarity matrix S, and (b) F, that measures the
inconsistency between the predicted class labels I’ and true labels Y. Below we discuss
these two terms in detail.

Given two examples 3:,- and xj, we ﬁrst deﬁne the similarity ngj based on their pre-

dicted conﬁdence score 3),- and 373-:

m “l9 exp 3}]? m
Z3]- : Z: mexp(yz )Ak, m ( JlAk’ = Z b15719? = bTb- (2.1)

 

where bf = exp(37£°)/ (273:1 exp(3)z’-°’)) and b,- = (b}, . . . , bl”). Note that bf can be inter-
preted as the probability of assigning cc,- to class k, and Zz’fj, the cosine similarity between
b,- and bj, can be interpreted as the probability of assigning 3:,- and :cj to the same class.
We emphasize it is important to use bf, instead of exp(§/f), for computing Z3]. because the
normalization in bf allows us to enforce the requirement that each example is assigned to a
Single class, a key feature of multi-class learning.

Let Z“ = [2sz be the similarity matrix based on the predicted labels. To measure
the inconsistency between this similarity and the similarity matrix S, we deﬁne Fu as the

distance between the matrices Z“ and 3 using the Bregman matrix divergence [69], i.e.,
F. = 90(2“) — MS) — tr((Z’“ — S)TV<p(S)). (2.2)

where (,0 : RNXN -—> R is a convex matrix function. By choosing 90(X) =

25

293:1 Xi,j(10g Xi,j — 1) [69], Pa is written as

N

S,-
u = Z (Sijlog-Z——:’: +Zuj— —S,-,j) (2.3)
i,j=1

By assuming that 229% Zl‘j z 2211 N]? and log a: z a: — 1, where Nk is the number of
examples assigned to class 1:, we simplify the above expression as Fu~ ~ 2, j— _1 82-2 ,j/Zg‘
Since S 2 Jcould be viewed as a general Similarity measurement, we replace 32-2, j with Sid
and simplify Fu as
N 5,, N s,- ,
Fu z ”2:1 ~55 = ”:21 2k: —-—1———jbkbk (2.4)

Remark I We did not use <p(X) = 293% X .2 - [69], which will result in Pa =
2, j— 1_(ng Sid): This rs because the value of Z"j and S,- Jmay be on different scales
which makes it inappropriate to measure the difference between two matrices directly by
subtracting their corresponding elements.

Similarly, we deﬁne the similarity between a labeled example as,- and an unlabeled ex-

ample xj based on their class assignments as follows
m
= Z yfbf. (2.5)
k=1

and the label inconsistency measure Fl between the labeled and unlabeled examples as

follows:

£2=zzjt= zzzmﬂ. am

i=1j=Zi1,ji=1j=31

 

We can only consider the sub-matrices related to unlabeled examples when deﬁning F u.

26

Finally, we linearly combine F1 and Fu to form the objective function:
F = Fu + CF; (2.7)

where C weights the importance of F1. It is set to 10, 000 in our experiments to emphasize
Fl. Given the Objective function F in (2.7), our goal is to ﬁnd solution I? that minimizes

F.

2.3.4 Multi-Class Boosting Algorithm

In this section, we present a boosting algorithm to solve the optimization problem in (2.7).
Following the general boosting strategy, we incrementally add weak learners to obtain a
better classiﬁcation model. We denote by Hf the solution that is obtained for 37:“ so far,
and by hf E {0, 1} the prediction made by the incremental weak classiﬁer that needs to
be learned. Then, our goal is to ﬁnd him = 1,... ,N,k = 1,. . . ,m and a combination
weight a such that the new solution H 2’“ = Hf + 01h:c signiﬁcantly reduces the objective
function F in Equation 2.7. For the convenience of discussion, we use symbol ~ to denote
the quantities (e.g., F) associated with the new solution H.

The key challenge in optimizing F with respect to hf and a is that these two quantities
are coupled with each other and therefore the solution of one variable depends on the solu-
tion of the other. Our strategy to solve the optimization problem is to ﬁrst upper bound F
with a simple convex function in which the optimal solution for hf can be obtained without
knowing the solution to a. Given the solution to hf, we compute the optimal solution for
a. Below we give details for these two steps.

First, the following lemma allows us to decouple the interaction between a and hf

. . u l
wrthrn Zi, j and Zi, j

 

The algorithm is quite stable with different values Of C bigger than 1000 according to our experiment.

27

Lemma 1.

 

 

 

 

1 1+eﬁa+e—60 660—1(m k k k)
:— S + -——- (b- —- T- -)h- (2.8)
2,2,- 32332‘ 3353' 1.2:; . 2’] .
1 1+660+e—6C' e6“ 1 m k k
~ h 45' - (2-9)
1 I Z J 2.]
ZR] 3Zi9j 6 k=1
where
btbt In ,bt k
k_ w k_ k4_ﬁ
Ti,j - m b’F’b’?” ¢,,- _ 2 y,- b’?’ b’? (2.10)
k’=1 I J k’=1 J J

The proof of Lemma 1 can be found in Appendix A.l. Using Lemma 1, we derive an

upper bound for F in the following theorem.

Theorem 1.

 

N m
~ 1 + exp(6a) + exp(—6a) exp(6a) -— 1 k k k
FgF +—-§——§:§WN%+CA) QM)

 

3 i=1 k=1
where of and [if are deﬁned as follows:
N S- -(b’.° _ TIC.) NI
k m z 2,3 k 1 k
oz,- = Z 21‘. r 52' = § Zsi,j¢j,i (2.12)
j=1 m j=1

Theorem 1 can be directly veriﬁed by replacing l/Zz'fj and 1 / if j in (2.7) with (2.8)
and (2.9). Note that the bound in Theorem 1 is tight because by setting a = 0, we have
H = H and the inequality in Equation 2.11 is reduced to an equality. The key feature
of the bound in Equation 2.1] is that the optimal solution for hf can be obtained without

knowing the solution for a. This is summarized by the following theorem.

Theorem 2. The optimal solution for hf that minimizes the upper bound of F in Equa-

28

tion 2.11 is

,r I
hk 1 k = arg mink/(a? + C5?)

(2. l 3)
0 otherwise

It is straightforward to verify the result in Theorem 2.
We then proceed to ﬁnd solution for a given the solution for hf. The following lemma

provides a tighter bound for solving a in F.

Lemma 2.
F — F g (e20 —1)(Au + CA1) + (ta—2" — 1)(B., + 03,) (2.14)
where
A, = 27.521 2 hfbf (2.15)
i,j=—1Ziv.7 k=1
Nl N m

A, = -ZZS,-,j Z -‘Z—;:b§’hfl (2.16)

i=lj=1 k,k’=1j

Bu = 227,3”- 2,127sz (2.17)

’i,j=—1Zivjk=1
NI N

Bl = 5:25 ,1 :31. bk (2°18)
i=1j=1 J

The proof of Lemma 2 can be found in Appendix B. Using Lemma 2, Theorem 3 gives

the optimal solution for (1.

Theorem 3. The optimal a that minimizes the upper bound of F in Equation 2.14 is

a — l—Og (Au+CAl) (2.19)

 

Note that this tighter bound can not be used to derive hf

29

Remark 11: Notice that in order to have this boosting algorithm continue working, the

weak learner needs to produce models better than random guessing in the following sense.

 

 

Writing a in the form
1 1 —
a = — log ( 6) (2.20)
4 e
where
f _ Au + CAl
_ Au + 8,, + C(Al + Bu)

can be interpreted as some kind of classiﬁcation error. Hence, better than random guessing
implies e S 0.5 in this case.

Algorithm 3 summarizes the proposed boosting algorithm for multi-class semi-
supervised learning. In each iteration, MCSSB produces a weighting on the set of training
examples based on the evaluation of F. Notice that 10,-, the weight for the ith unlabeled
example, is guaranteed to be non-negative. This is because 221:1 of + 06f = 0 and
therefore to, = maxk(af -l- 06;“) _>_ 0; w,- is a measure of the failure of the algorithm on
example 50,-. Using the new weighting on the training examples, MCSSB learns a multi-
class model that minimizes the loss on the weighted training examples, by adopting the
sampling approach as described in the following: MCSSB samples 3 instances by replace-
ment, with probability of each sample proportional to its weight. 3 sampled instances are
passed to the weak learner to obtain a multi-class hypothesis. In our experiments, the num-
ber of sampled examples at each iteration is set as s = max(20, N / 5). After creating a
weak classiﬁer at this round, MCSSB adds it to the current classiﬁers to reduce the value
of the objective function.

For the experments, we ran the algorithm with different numbers of iterations and ﬁnd
that both the objective function and the classiﬁcation accuracy remains essentially the same
after 50 iterations. We, therefore, set the number of iterations to be 50 to save the compu-

tational cost.

30

 

Algorithm 3 MCSSB: Multi-Class Semi-Supervised Boosting Algorithm
1: Input:
0 D: The set of examples; the ﬁrst N1 examples are labeled.
0 s: the number of sampled examples from (N — N1) unlabeled examples
0 T: the maximum number of iterations
2: Set F(z') = 0,2' = 1, .., |D|
3: repeat
4: Compute of and ﬁll“ for every example as given in Equation 2.12.
5: Assign each unlabeled example 2:, to class k; = arg minjc(oz£c + 0511“) and weight

k’I‘ k’l‘

6: Sample .9 examples using a distribution that is proportional to w,-
. Train a multi-class classiﬁer h(:r) using the 3 samples examples
8: Predict hf for all examples using h(:1:), and compute or using Equation 4.14. Exit
the loop if or g 0.
8: H(:1:) (— H(:z:) + ah(:r)
9: until reach the maximum number of iterations

 

 

Theorem 4 Shows that the proposed boosting algorithm reduces the objective function

F exponentially. The proof of this theorem is provided in Appendix A.3.

Theorem 4. The objective function after T iterations, denoted by F T, is bounded as fol-

lows:

 

T (,/At +oAt— ,/Bt +013“)2
FT 3 Foexp —Z " [Ft—1 “ l (2.21)

t=1

where Au, A), Bu and B1 are deﬁned in Lemma 2.

2.4 Experiments

In this section, we present our empirical study on the classiﬁcation data sets that were
described in Chapter 7. We refer to the proposed semi-supervised multi-class boosting
algorithm as MCSSB. In this study, we aim to Show that (1) MCSSB can improve the per-
formance of a given multi-class classiﬁer with unlabeled examples, (2) MCSSB is more

effective than the existing semi-supervised boosting algorithms, and (3) MCSSB is robust

31

to the model parameters and the number of labeled examples. It is important to note that it is
not our intention to show that the proposed senri-supervised multi-class boosting algorithm
always outperforms other semi-supervised Ieaming algorithms. Instead, our objective is
to demonstrate that the proposed semi-supervised boosting algorithm is able to effectively
improve the accuracy of different supervised multi-class learning algorithms using the un-
labeled examples. Hence, the empirical study is focused on a comparison with the existing
semi-supervised boosting algorithms, rather than a wide range of semi-supervised learning

algorithms.

2.4.1 Experimental Setup

For each classiﬁcation data sets, described Section 1.6.1, we split the examples into 5
partitions, with one partition used for training and the others used for testing. In each ex-
periment, we used a small percentage (between 2 to 10 percent) of training instances as
labeled examples and the remainding instances as unlabeled examples. We applied the pro-
posed algorithms and the baselines on the training examples to create a model and applied
it on the test examples and computed the accuracy on the test examples. We repeated each
experiment 10 times and reported the average.

We compare the proposed semi-supervised boosting algorithm to ASSEMBLE, a state-
of-the—art semi-supervised boosting algorithm. The main reason for this choice was be-
cause Assemble utlizes boosting technique and can exploit an existing supervised learning
technique. This makes the comparision fair and easy because it enables us to compare MC-
SSB and Assemble with base classiﬁeres that have different quality. Also notice that As-
semble is a powerful semi—supervised Ieaming technique that was the best semi-supervised
algorithm among 34 participants in NIPSS2001 workshop competition "Unlabeled Data for
Supervised Learning" [48].

 

Unlike the general setup introduced in 1.6.1, we used the test set for for mnist data set because of the
huge size of the training set in mnist and the memory problem.

32

A Gaussian kernel is used as the measure for similarity in the standard MCSSB algotihm
with kernel width set to be 15% of the range of the distance between examples for all the
experiments, as suggested in [70]. To verify the importance of using the Similarity measure
in the semi-supervised boosting algorithm and direct formulation of multi-class problem,
we use two other baselines: MCSSB-Uniform that uses similar similarity values for every
pair of examples (i.e. Sij = 1, 2', j = 1, .., N) that can be considered MCSSB with a
bad similarity measure, and MCSSB-Absolute that considers absolute similairy between
an example and itself (i.e. Si,- = 1,2' = 1, .., N) and absolute dissimilarity between two
different examples (i.e. Sij = 1, i, j = 1, .., N & i 75 j). MCSSB-Absolute can be
considered MCSSB that only exploits the advantage of using a direct formulation of the
multi-class problem.

We use decision tree with only two level of nodes, as the base classiﬁer for all the
methods in the standard setting . The combination paremeter C is set to 104 in all experi-
ments. To study the robustness of the proposed methods, we further investigate the effect
of the depth Of decision tree and combination parameter C on the performance of different

methods in Sections 2.4.4 and 2.4.3 respectively.

2.4.2 Evaluation of Classiﬁcation Performance

Figure 2.1 shows the result of different algorithms when the amount of labeled examples
is changed from 2% to 10%. First, notice that MCSSB signiﬁcantly improves the accuracy
of decision tree for 5 out of 7 data sets. For data set ’Nursery’, MCSSB performs worse
than the base classiﬁer and for data set ’Letter’, the result of MCSSB is not much different
than the base clasiﬁer. However, for both these cases, MCSSB-Absolute performs quite

good that indicates the direct formulation of multi-class problem is useful and the bad

 

i.e. 0.15 x (dmax — dmin)’ where dmin and dmax are minimum and maximum distance between
examples

Notice we also used neural network as another base classiﬁer to evaluate the performance of our algo-
rithm. Refer to [50] for the results on several benchmark datasets

33

performance is due to the utilization of a bad similartiy matrix. Note that for several data
sets, the improvement made by the MCSSB is dramatic. For instance, the classiﬁcation
accuracy of decision tree is improved from 33% to 48% for data set ’Pendigits’, and from
24% to 43% for data set ’Optdigits’ when there is 2% labeled examples; the classiﬁcation
accuracy of decision tree is improved from 13% to 17% for data set ’Isolet’, and from 46%
to 49% for data set ’Protein’ when there is 8% labeled examples.

Second, when compared to ASSEMBLE, we found that the proposed algorithm sig-
niﬁcantly outperforms ASSEMBLE for all the data sets. More interestingly, Assemble
reduces the performance of the base classiﬁer for most data sets that indicates the usage Of
pseodu-labelss can produce misleading results. The key differences between MCSSB and
ASSEMBLE is that MCSSB is not only specially designed for multi-class classiﬁcation, it
does not solely rely on the pseudo-labels obtained in the iterations of boosting algorithm.
Thus, the success of MCSSB indicates the importance of designing semi-supervised learn-
ing algorithms for multi-class problems.

Third, to verify that the outstanding performance of MCSSB is related to the direct
formulation of multi-class problem and the usage of similarity measure in the boosting
algorithm, we examine the results of MCSSB-Uniform and MCSSB-Absolute. Because
MCSSB-Uniform does not utilize an appropriate similarity measure, it performs very
poorry that emphasizes our effective approach in utilizing the similarity measure in the
boosting algorithm. On the other hand, MCSSB-Absolute is the second best method after
MCSSB. Because MCSSB-Absolute does not utilize any similary measure among exam-
ples, we believe that this superior performance is due to our approach in direct formulation
of multi-class problem. It is interesting to note that the performance of MCSSB-Absolute
on the ’Nursery’ and ’Letter’ data sets is better than other methods including MCSSB that
indicated the sensitivity of the proposed method to the choice of similarity method.

And ﬁnally, notice that as the number of labeled examples increases, the performance

of different methods improves. However MCSSB keeps its superiority for most of the cases

34

 

 

 

 

 

 

 

 

 

 

MNIST NURSERY
50 —Decision Stump 100 .
+Assemble

40 +MCSSB
5. +MCSSB_Uniform 5
g -e- MCSSB_Absolute g

0

< £— 2

20

1 A w . 1

. 2 0764 0.06 0.08 0.1 .02 0.04 0.06 0.03 0.1
Percentage of labeled examples Percentage of labeled examples
LETTER

 

PROTEIN

 

   

 

Accuracy

 

  

 

   

 

 

 

 

0.06 0.08 0.1 .02

0.04
Percentage of labeled examples
PENDIGITS

 

 

0.04 0.08
Percentage of labeled examples

0.06 0.1

OPTDIGITS

 

 

 

   

 

 

004 0.06 0.08 0.1

Percentage of labeled examples
ISOLET

  
   

 

0.04 0.06 0.08
Percentage of labeled examples

V

0.1

 

f

 

 

 

A A
V V

 

A
V V

 

 

06.02

0.04 0.06

0.08 0.1

Percentage of labeled examples

Figure 2.1: The error rates Of different methods with different amount of labeled examples.

35

when compared to both the base classiﬁer and the ASSEMBLE algorithm. We also observe
that overall ASSEMBLE is unable to make improvement over the base classiﬁer regardless
of the number of labeled examples. These results indicate the challenge in developing
boosting algorithms for semi-supervised multi-class Ieaming. Compared to ASSEMBLE
that relies on the classiﬁcation conﬁdence to decide the pseudo labels for unlabeled ex-
amples, MCSSB is more reliable since it exploits both the classiﬁcation conﬁdence and

similarities among examples when determining the pseudo labels.

2.4.3 Sensitivity to the Combination Parameter C

Figure 2.2 shows the performance of MCSSB when the combination parameter C changes
from 1 to 1010. It is clear that for large values of C, MCSSB is very stable. Notice the
improvement of MCSSB on the base classiﬁer for dataset ’Protein’ is very marginal for
some values of C. However if you look at Figure 2.1, you will notice that the result of
MCSSB for larger amount of labeled data (as large as 4%) is signiﬁcant for this data set
and not sensitive to the small changes of parameters C. We conclude that MCSSB is very

robust to the choice of parameter C.

2.4.4 Sensitivity to Base Classiﬁer

In this section, we focus on examining the sensitivity of MCSSB to the complexity of base
classiﬁers. This will allow us to understand the behavior of the proposed semi-supervised
boosting algorithm for both weak classiﬁers and strong classiﬁers. To this end, we use de-
cision tree with varying number of levels as the base classiﬁer. We used decision tree with
only one node (decision stump) up to fully-grown decision trees and plot the performance
result of different methods. Figure 2.3 shows the classiﬁcation accuracy of Tree, ASSEM-
BLE and MCSSB when we vary the number of levels in decision tree. Notice that in each
case, the maximum number of levels in the plot for each data set is set to the fully grown

tree for that data set. It is not surprising that overall the classiﬁcation accuracy is improved

36

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MNIST NURSERY
40 . 100 .
5" 30' //W—. 5" BMW
(0 (U
5 6
0 O
< ZWZG—EP-a—a—e—a—e—e—HI <2 60W
10 r 40 a
10° 105 1010 10° 105 1010
C C
PROTEIN LETTER
45 W 15 '
640- ' : 5:10—
a a
8 8
2 35' : 2 5: —Decision Stump
‘3 a a E a a E E' a a 5' +Assemble
3O 0 +MCSSL
10° 105 10‘° 10° 105 101°
C C
PENDIGITS OPTDIGITS
60 . 5o .
5‘40- M ,>,~
a a
< 2 <
O .
100 185 1010
ISOLET
20 '
§
<
0 .
10° 105 10’°
C

Figure 2.2: The error rates of MCSSB with different C( 2% of labeled).

37

with increasing number of levels in decision tree for most data sets. We also Observe that
MCSSB is more effective than ASSEMBLE for decision trees with different complexity
and regardless of quality of the base classiﬁer, ASSEBLE is not able to improve the per-
formance of the supervised classiﬁer by utilizing unlabeled examples. Notice that for some
data sets, e.g. ’Protein’ data set, the performance decreases as the depth of tree increases.
This is because, unlike other data sets, ’Protein’ has only tinee classes and large tree can

lead to overﬁtting.

38

MNIST

 

 

——Tree

 

 

 

 

 

 

2 4
Depth of the Tree
PROTEIN

 

  

 

   
   

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

    
  

 

   

 

 

 

 

 

 

 

    
 
   
     
 
   

NURSERY

80 -

70-
6 60 .
E 1
8 l
0 50
< —Tree

40 +Assemble

+ MCSSB
3C0 1 2 3
Depth of the Tree
LETTER
25 . -
--Tree
20 +Assemble

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   
 

> 45, +MCSSB
o a, .
E 1 E
6 61
< 40 --Tree <
+Assemble
+MCSSB
35o 2 I: 6 0o i 2 6 4 5
Depth of the Tree Depth of the Tree
PENDIGITS OPTDIGITS
60 . - 50 r
-—Tree
50’ 4o» +Assemble
+MCSSB
64° 63 .
E E
a 30- :3
8 8 2 .
< 20 —Tree <
1 +Assemble 1
+MCSSB
G 1 A r r G
0 1 2 3 4 5 0 1 3
Depth of the Tree Depth of the Tree
ISOLET
20 - .
—Tree
15 +Assemble f
+MCSSB
5‘
E 10
3
8
<
5 .

 

 

 

 

1 2
Depth of the Tree

1:“igure 2.3: The error rates of MCSSB with decision tree with different depth as the weak
learner. 2% of training examples are labeled in all the experiments.

39

Chapter 3

Optimizing NDCG Measure by Boosting

Learning to rank is a relatively new ﬁeld in machine Ieaming. It aims to learn a ranking
function from training examples with relevancy judgements. The learning to rank algo-
rithms are often evaluated using information retrieval measures, such as Normalized Dis-
counted Cumulative Gain (NDCG) [14] and Mean Average Precision (MAP) [13]. Until
recently, most learning to rank algorithms were not able to directly optimize a loss function
related to the IR evaluation measures, such as NDCG and MAP. The main difﬁculty in di-
rect optimization of these measures is that they are non-continuous and non-differentiable.
In this chapter, we discuss how boosting can be applied to optimize Normalized Discounted
Cumulative Gain (NDCG) which is the most commonly used multi-level evaluation mea-
Sure for Ieaming to rank. We start with a detailed description of AdaRank [42], one of the
ﬁrst algorithms designed to directly maximize IR measures. We further develop a learning
t0 rank algorithm, termed NDCG_Boost, for optimizing NDCG metric. Unlike AdaRank
that weights all the documents related to each query equally when optimizing the NDCG
measure, NDCG_Boost weights individual documents differently even if they are all re-
lated to the same query, leading to more effective Optimization of the NDCG measure. In
Order to deal with the non-smooth nature of the NDCG measure, in the NDCG_Boost al-

gOIithm, we propose to optimize the expectation of NDCG over the distribution induced

40

by a ranking function. We then present a relaxation strategy that approximates the average
of NDCG value, and an optimization strategy to make the computation efﬁcient. Extensive
experiments Show that the proposed algorithm outperforms state-of-the-art ranking algo-

rithms on several benchmark data sets.

3.1 Introduction

Learning to rank has attracted many machine Ieaming researchers in the last decade because
of its growing importance in the areas like information retrieval (IR) and recommender

systems. Three types of learning to rank algorithms can be found in the literature.

0 Pointwise approaches: AS the simplest form, these approaches [15, 16] treat rank-
ing as a classiﬁcation or regression problem that learns a ranking function in order to
ﬁt the relevance judgments for given retrieved documents [16, 17]. However classi-
ﬁcation and regression may not be the best for the task of ranking. This is because
(i) classiﬁcation problems are usually associated with unordered class labels where
there is an intrinsic order among the levels of relevance judgments provided by the
user, and (ii) the target variables in regression problems are assumed to be numerical

values while the relevance judgments are only ordinary variables.

0 Pairwise approaches: These approaches are motivated by the fact that the rele-
vancy scores in ranking are relative to each other. This group considers the pairs of
documents as independent variables and learns a classiﬁcation (regression) model to
correctly order the training pairs [18—23], namely document da is ranked above db if
the relevance score of da is larger than db. One major problem with the pairwise ap-
proaches is that they assume pairs of documents are independent random variables,

which is often violated in real world applications. .

41

o Listwise approaches: The listwise approaches are motivated by this observation that
most evaluation metrics of information retrieval measure the ranking quality for indi-
vidual queries, not documents. These approaches treat the ranking list of documents
for every query as a training instance [13, 24—29], either by direct optimization of an
information retrieval evaluation measure [13, 25, 28, 29] or by optimizing a listwise
loss function [24, 26, 27]. Empirical studies have shown that the listwise approaches
are more effective than both pointwise and listwise approaches because they utilize

the query-document group structure which is a unique and useful characteristic in

ranking.

The main difﬁculty in optimizing the listwise loss functions is that they are non-
continuous and non-differentiable. This is because these loss functions measure the re-
trieval performance based on the ranking list of documents induced by the ranking function,
and therefore their dependence on ranking functions is implicit. Given that classiﬁcation
is a well-studied subject in machine Ieaming, the research question is whether it is pos-
sible to design a boosting algorithm that utilizes a classiﬁcation algorithm to optimize an
information retrieval measure such as NDCG. The easiest way to design such a boosting
algorithm is the approach taken by Xu et al. in the design of AdaRank [42]. In each trial of
a boosting algorithm, AdaRank re-weights the queries based on their NDCG values (com-
Pared to AdaBoost that re-weights the examples based on their conﬁdence in prediction).
As we see in more details in Section 3.3.2, AdaRank treats all the documents related to

each query equally when trying to improve the NDCG metric, which could signiﬁcantly
liInits the choice of ranking functions for optimizing the NDCG metric. In this chapter, we
introduce a better boosting algorithm for optimizing NDCG metric that weights documents
differently even if they are associated with the same query. In each iteration, the boosting
algorithm provides a weighting as well as binary class assignments for given documents;

the weak learner constructs a binary classiﬁer from the weighted documents that are labeled

\
It is important to distinguish the binary class assignment from the relevance judgments for documents

42

by the boosting algorithm.

3.2 Related Work

We focus on reviewing the listwise approaches that are closely related to the theme of this
chapter. The listwise approaches can be classiﬁed into two categories. The ﬁrst group
of approaches directly optimizes the IR evaluation metrics. Most IR evaluation metrics,
however, depend on the sorted order of documents, and are non-convex in the target rank-
ing function. To avoid the computational difﬁculty, these approaches either approximate
the metrics with some convex functions or deploy methods (e.g., genetic algorithm [71])
for non-convex optimization. In [25], the authors introduced LambdaRank that addresses
the difﬁculty in optimizing IR metrics by deﬁning a virtual gradient on each document af-
ter the sorting. While [25] provided a simple test to determine if there exists an implicit
cost function for the virtual gradient, theoretical justiﬁcation for the relation between the
implicit cost function and the IR evaluation metric is incomplete. This may partially ex-
plain why LambdaRank performs very poor when compared to MCRank [16], a simple
adjustment of classiﬁcation for ranking (a pointwise approach). The authors of MCRank
paper even claimed that a boosting model for regression produces better results than Lamb-
daRank. Volkovs and Zemel [29] proposed optimizing the expectation of IR measures to
Overcome the sorting problem, similar to the approach taken in this paper. However they
use monte carlo sampling to address the intractable task of computing the expectation in
the permutation space which could be a bad approximation for the queries with large num-
ber of documents. AdaRank [42], as was described earlier in this chapter, uses boosting to
Optimize NDCG, similar to our optimization strategy. However they deploy heuristics to
elubed the IR evaluation metrics in computing the weights of queries and the importance
of weak tankers; i.e. it uses NDCG value of each query in the current iteration as the

Weight for that query in constructing the weak ranker (the documents of each query have

43

similar weight). This is unlike our approach that the contribution of each single document
to the ﬁnal NDCG score is considered. Moreover, unlike our method, the convergence of
AdaRank is conditional and not guaranteed. Sun et al. [72] reduced the ranking, as mea-
sured by NDCG, to pairwise classiﬁcation and applied alternating optimization strategy to
address the sorting problem by ﬁxing the rank position in getting the derivative. SVM-
MAP [13] relaxes the MAP metric by incorporating it into the constrains of SVM. Since
SVM-MAP is designed to optimize MAP, it only considers the binary relevancy and cannot
be applied to the data sets that have more than two levels of relevance judgements.

The second group of listwise algorithms deﬁnes a listwise loss function as an indirect
way to optimize the IR evaluation metrics. RankCosine [24] uses cosine similarity between
the ranking list and the ground truth as a query level loss function. ListNet [26] adopts the
KL divergence for loss function by deﬁning a probabilistic distribution in the space of
permutation for Ieaming to rank. FRank [22] uses a new loss function called ﬁdelity loss
on the probability framework introduced in ListNet. ListMLE [27] employs the likelihood
loss as the surrogate for the IR evaluation metrics. The main problem with this group of
approaches is that the connection between the listwise loss function and the targeted IR
evaluation metric is unclear, and therefore optimizing the listwise loss function may not

necessme result in the optimization of the IR metrics.

3.3 Optimizing NDCG Measure

3.3.1 Notation

Assume that we have a collection of n queries for training, denoted by Q = {q1, . . . ,qn}.
F0r each query qk, we have a collection of mk documents Dk = {dz-“,7: = 1, . . . ,mk},
Whose relevance to qk is given by a vector rk = (rf, . . . ,rfnk) E ka. We denote by
F (d, q) the ranking function that takes a document-query pair (d, q) and outputs a real

number score, and by jg“ the rank of document (if within the collection ’Dk for query qk.

44

 

The NDCG value for ranking function F (d, q) is then computed as following:

mk 2i—1

won 12;;—

(3.1)
log( (1 + 32')
where Z k is the normalization factor [14]. NDCG is usually truncated at a particular rank

level (e.g. the ﬁrst 10 retrieved documents) to emphasize the importance of the ﬁrst re-

trieved documents.

3.3.2 AdaRank Algorithm

The easiest way to design a boosting algorithm for Optimizing a given IR evaluation mea-
sure is what AdaRank algorithm [42] performs. AdaRank uses an exponential loss function
similar to AdaBoost. However, unlike the loss function of AdaBoost which is constructed
based on the classiﬁcation margin, AdaRank utilizes information retrieval measures such

as NDCG to construct the exponential loss. To optimize NDCG, for example, AdaRank

uses the following exponential loss function:

Zexp(-£(qr, F ))

k=1

Where £(qk, F) is the NDCG value for query h when ranking the documents for query qk
by function F. The steps of AdaRank are given in Algorithm 4. In each iteration, AdaRank
ﬁI‘lds a weak tanker ft that maximizes quantity m at Step 4, i.e. NDCG weighted by p.
Then, it computes the combination weight for ft and adds it to the current set of classiﬁers
in Steps 5 and 6 respectively. The authors of AdaRank paper [42] suggest using the ranking
features (e.g. BM25) as the weak ranker. However, a (multi-class) classiﬁer can also be
uSed as the weak tanker. To construct a classiﬁer that maximizes qt, AdaRank distributes
the weight 12,500) to all documents of query k equally, and constructs a classiﬁer based on

the documents that are sampled according to the weights. To redistributes the weights to

45

 

Algorithm 4 AdaRank Algorithm

1: Input:
0 Q = {q1, . . . ,q"}: The set of queries

0 Dk = {(df,rf),i = 1,...,mk}: The set of documents and their relevancy

scores for query qk.
2: Initialize p1(qk) = 1/n,k = 1, ...,n
3: repeat
Find ft by maximizing weighted NDCG; i.e. 77t = 2:le pt(qk)£(qk, F)

Compute at = 211101233?)

4

5

6: Compute F(df) = 2L1 alfl(df), k = 1, ..,n, 2': 1, ..,mk
- - _ exp(-£(Qk,F))

7 Compute the new werghtrng pt+1(qk) — 22:1 exp(—£(Qk,F))

8:

until reach the maximum number of iterations

instanced, AdaRank increases the weights of difﬁcult queries (e.g. those that have small
NDCG) and decreases the weights of easy queries (e.g. those that have large NDCG) at
Step 7.

As it is Obvious from the Steps of AdaRank algorithm, it gives the same weights to the
documents of each query, leading to a suboptimal performance. However, since a pointwise
weak learner (multi-class classiﬁer) is often utilized in a boosting algorithm to maximize
NDCG, it is advantageous to allow every document to contribute differently to the ﬁnal
NDCG value. Moreover, although NDCG works in query level, not all documents have
Similar contribution in improving the NDCG value at each stage of the algorithm. These
Observations motivated us to develop NDCG_Boost algorithm that considers the contri-

bution of every single document in the iterations of the boosting algorithm to maximize

NDCG.

3-3.3 A Probabilistic Framework

C)ne of the main challenges faced by optimizing the NDCG metric deﬁned in Equation
(3- 1) is that the dependence of document ranks (i.e., jf) on the ranking function F(d, q)
is not explicitly expressed, which makes it computationally challenging. To address this

Problem, we consider the expectation of £( Q, F) over all the possible rankings induced by

46

the ranking function F(d, q), i e

£(Q F) — <1L-1_> (3 2)
’ k=1 k log(1+j,-'°) F
2"”IC -1

:2 E P” 'F")log(1+vr'~(>)

"1
22‘

all-4

MS I'M:

 

ﬁll-d

where Smk stands for the group of permutations of m k documents, and 7rk is an instance of
permutation (or ranking). Notation 7rk(z') stands for the rank position of the ith document
by Wk - To this end, we ﬁrst utilize the result in the following lemma to approximate the

expectation of 1/ log(1 + 7rk(z')) by the expectation of 7r,“(i).

Lemma 3. For any distribution Pr(7r|F , q), the inequality C(Q,F) _>_ 7:1(Q,F) holds

where

 

2 i — k1
’H(Q, F): i:— HZ]: 1}: 1(110g ’°(i))p ) (3.3)

ProOﬁ The proof follows from the fact that (a) 1 / :1: is a convex function when :1: > 0 and
therefore (1 / log(1 + 2)) 2 1/(log(1 + 2)); (b) log(1 + :c) is a concave function, and
therefore (log(1 + x)) S log(1 + (27)). Combining these two factors together, we have the

reSUIt stated in the lemma. [:1

Given H(Q, F) provides a lower bound for [3(Q, F), in order to maximize [2(6), F),
we Could alternatively maximize 77(6), F), which is substantially simpler than £(Q, F). In

the next step of simpliﬁcation, we rewrite 7rk(i) as

74(2) = 1 + 2 104(2) > «kg» (3.4)

47

where I (x) outputs 1 when x is true and zero otherwise. Hence, (nk(i)) is written as

mk mic
Me» = 1+ Zuwka) > «km» -—— 1+ 2 12mm) > «W» (3.5)
j=1 j=1

As a result, to optimize ’FMQ, F), we only need to deﬁne Pr(7rk(i) > wk(j)), i.e., the
marginal distribution for document d? to be ranked before document dz? . In the next section,

we will discuss how to deﬁne a probability model for Pr(7rk|F, qk), and derive pairwise

ranking probability Pr(7rk(z') > 7rk(j)) from distribution Pr(7rk|F, qk).

3.3.4 Objective Function

We model Pr(7rk|F, qk) as follows

 

 

mk
1
131-(«km f) = k exp 2 Z (F(d§,qk) — F(d§,q"))
Z(F’q ) i=1j-«k(j)>vrk(z')
mic
k - k k
= Z(F, qk) exp (;(mk — 27r (Z)+1)F(dz-,q )) (3.6)

where Z (F, qk ) is the partition function that ensures the sum of probability is one. Equa-
tion (3.6) models each pair (df, (if) of the ranking list 7r’c by the factor exp(F(df, qk) —
F(d§ , qk)) if dz? is ranked before d? (i.e., nk(d£°) < 7rk(d$-°)) and vice versa. This mod-
cling choice is consistent with the idea of ranking the documents with largest scores ﬁrst;
intuitively, the more documents in a permutation are in the decreasing order of score, the
bigger the probability of the permutation is. Using Equation (3.6) for Pr(1rk|F, qk), we
have 'H(Q, F) expressed in terms of ranking function F. By maximizing 72(Q, F) over F .
we Could ﬁnd the optimal solution for ranking function F.

AS indicated by Equation (3.5), we only need to compute the marginal distribution
Pr(7“k(i) > nk(j)). To approximate Pr(1rk(i) > 7rk(j)), we divide the group of permu-

tation Smk into two sets: 050,1.) = {Wklﬂkm > “ICU” and G’b“(i,j) = {Wklwkm <

48

 

7rk (j ) }. Notice that there is a one-to-one mapping between these two sets; namely for any
ranking wk 6 05(22, j), we could create a corresponding ranking 7rk 6 G§(z', j) by switch-
ing the rankings of document df and d? and vice versa. The following lemma allows us to
bound the marginal distribution Pr(7rk(i) > 7rk(j)). The proof of this lemma is provided

in Appendix A.5.

Lemma 4. IfF(d’-“, qk) > F(dg-c, qk), we have

2

1

1+ exp [2(F<d:-°, qk> — F(dga gm]

Prokm > m» s

 

(3.7)

This lemma indicates that we could approximate Pr(7rk(i) > nk( j )) by a simple logis-
tic model. The idea of using logistic model for Pr(7rk (2') > wk(j)) is not new in learning
to rank [20, 22]; however it has been taken for granted and no justiﬁcation has been pro-
vided in using it for Ieaming to rank. Using the logistic model approximation introduced

in Lemma 4, we now have (14%)) written as
m
k 1

#1 1+ exp [2(F(d§,qk> — F<d§,qk>>]

1+

 

(3.8)

22

We»

To simplify our notation, we deﬁne Fik = 2F (dz-c, qk), and rewrite the above expression as

 

mic mic
k' - r7rkz' nk' z 1
<vr<z>>—1+§P< ()> (3)) 1+j2=311+exp(Fik_FJk)

Using the above approximation for (wk(i)), we have ”R in Equation (3.3) written as

n 7.21: _
W2, F) m if, 71- 2 ——2—1—. (3.9)

49

 

where

 

mk
10'7”)
3.10
AE;1+eXp(Fk—Ff) ( )

We deﬁne the following proposition to further simplify the objective function:

Proposition 1.

1 > 1 __ Af
log(2 + A?) _ 108(2) 2[log(2)]2

 

The proof is due to the Taylor expansion of convex function 1/log(2 + x), a: > —1
around a: = 0 noting that A? > 0 (the proof of convexity of 1/log(1 + x) is given in
Lemma 3) and is provided in Appendix A.6. By plugging the result of this proposition

to the objective function in Equation (3.9), the new objective is to minimize the following

quantity:

M(Q, F) 2:712—1162Q’k 2' —1)A (3.11)

The Objective function in Equation (3.11) is explicitly related to F via term Af. In the
next section, we aim to derive an algorithm that learns an effective ranking function by
efﬁCiently minimizing M. It is also important to note that although M is no longer a
Iigol‘ous lower bound for the original objective function 5, our empirical study shows that
this approximation is very effective in identifying the appropriate ranking function from

the training data.

3‘3-5 Algorithm

To minimize M(Q, F) in Equation (3.11), we employ the boosting strategy [38] that iter-

ativel y updates the solution for F. Let Fik denote the value obtained so far for document

50

(if. T 0 improve NDCG, following the idea of Adaboost, we restrict the new ranking value

for document (if, denoted by if, is updated as to the following form:

~

I: k k

2

where a > O is the combination weight and fz-k = f (dz-ﬁqk) E {0,1} is a binary value.
Note that in the above, we assume the ranking function F (d, q) is updated iteratively with
an addition of binary classiﬁcation function f (d, g), which leads to efﬁcient computation
as well as effective exploitation of the existing algorithms for data classiﬁcation. . To
construct a lower bound for M (Q, F), we ﬁrst handle the expression [1+exp(Fz-k —ij)] —1,

summarized by the following proposition.

Proposition 2.

 

1 1 k k k
~ ~ 3 + . - ex a - — . _1 3.13
1 + exp(Fik — Fjl“) 1 + eXP(F2-k _ Ff) 72.][ p( (f; fz )) j ( )

 

where

ex PIC—F’-It
p( ' 9) (3.14)

(1+ exp(Fz-k — F;‘))2

k _
7232' —

 

The proof of this proposition can be found in Appendix A.4. This proposition separates
the term related to Pi"c from that related to off in Equation (3.11), and shows how the
new Weak ranker (i.e., the binary classiﬁcation function f (d, q)) will affect the current
ranki ng function F (d, q). Using the above proposition, we can derive the upper bound for

M (Theorem 5) as well as a closed form solution for a given the solution for F (Theorem 6).

Theorem 5. Given the solution for binary classiﬁer fz-d, the optimal a that minimizes the

51

objective function in Equation (3.1!) is

rk
m _
1 Zk=1Z.-,jk=1—z—2i9£Cj191(f"<f-k)
a=§log 1: (3-15)

m _
22:1 22' ,jk= 1 2_sz—102kj1(ff> fik)

 

where 0:9]- 2723-10 79 2).

Remark: Notice that in order to have this boosting algorithm continue the iterations, the
weak learner needs to produce models better than random guessing in the following sense.

Writing a in the following form

 

a = llog(1—6) (3.16)

where

k
27:- 123112—3510: I(ff>fk)

 

5:
ZZ=1E$"=1%ll-9,’f(1(f">>f,-’“)+I(ff<f,-k))

, we mean 6 S 0.5 by random guessing.

Theorem 6.

M(Q,F)SM(Q.F)+7(a)+ e———"p(3§)‘lzsz :SZ—‘i’ag.

k=1 i=1j=—1Zk
where 7(a) is only afunction of a with 7(0) = 0.

The proofs of these theorems are provided in Appendix A6 and Appendix A.7 respec-
tively. Note that the bound provided by Theorem 6 is tight because by setting a = O, the
inequality reduces to equality resulting M(Q, F) = M(Q, F). The importance of this
theorem is that the Optimal solution for ff can be found without knowing the solution for

a.

52

 

Algorithm 5 NDCG_Boost: A Boosting Algorithm for Maxinrizing NDCG
1; Initialize F(df) = 0 for all documents

 

2: repeat
3: Compute 6’1- j — —'yz ij(j 75 i) for all document pairs of each query. 7:” is given in
,3
Eq (3 14)-
4: Compute the weight for each document as
k
mk 2erk _ 27‘ jg
k=Z———— (3.17)

5: Assign each document the following class label yf = sign(wz’-°).
Train a classiﬁer f (x) : Rd —> {0,1} that maximizes the following quantity

nmk

Z Z lwticlﬂdbyf (3.18)

1:21 i=1

7: Predict f,- for all documents in {Dk,i = 1, . . . , n}
8: Compute the combination weight a as provided in Equation (3.15).
9: Update the ranking function as Ff +— Ff + eff.

10: until reach the maximum number of iterations

 

Algorithm 5 summarizes the boosting algorithm in minimizing the objective function
in Equation (3.11). In each iteration, it computes 62- for every pair of documents of query
1:. 19k can be considered a measure of how close the rank position of documents (1" and

k

k . . .
dj are when they are sorted by function F. The algorithms computes wi , a weight for

each document, which is the summary information of document dz? when its position and

relevancy score compared to all other documents of the same query. wf

can be positive
or negative. A positive wf indicates that the ranking position of df induced by the current
ranking function F is less than its true rank position, and a negative weight wz’? shows that
ranking position of (if induced by the current F is greater than its true rank position. The
magnitude of wf shows how much the corresponding document is misplaced in the ranking.

In other words, it shows the importance of correct ranking position of document (If in terms

of the value of NDCG. Using these information, the algorithm ﬁnds out the most difﬁcult

 

Notice that we use F (df) instead of F ((1? , qk ) to simplify the notation in the algorithm.

53

documents and the relevancy direction of their importance at the current iteration. Using
these information, NDCG_Boost maximizes ﬁt as given by Equation (3.18) which can be
considered as some sort of classiﬁcation accuracy. It uses sampling strategy in order to
maximize m because most binary classiﬁers do not support the weighted training set; that
is, it ﬁrst samples the documents according to wa | and then constructs a binary classiﬁer
with the sampled documents. After learning the new binary model at Step 6, the algorithm
evaluates its success in improving the value of NDCG in Step 7 and 8 and adds it to the
current set of binary models (the mixed strategy over binary models) at Step 9.

The following theorem shows that the proposed boosting algorithm reduces the objec-

tive function M exponentially.

Theorem 7. The objective ﬁmction after T iterations, denoted by MT, is bounded as

follows:

 

NEW f)?

Mt— 1
where (11 and a2 are deﬁned as follows.

251' r]?

n mk _ n ml; 2
=222 ’f,I(f,’-°<f-’“), ”=2:
k=1,3= -1 k=1i,j=1

 

 

1 k 1(fJ’.c > 5.13.19)

The proof is provided in Appendix A.8.

3.4 Experiments

To study the performance of NDCG_Boost we use the latest version (version 3.0) of
LETOR package provided by Microsoft Research Asia [56], which has been described
in Chapter 1. Besides a number of benchmark data data, LETOR package also includes

multiple state-of-the-art baselines and evaluation tools for research on Ieaming to rank.

54

 

3.4.1 Experimental setup

A number of state-of-the-art Ieaming to rank algorithms are provided in the LETOR pack-
age, including some of the most well-known leamin g to rank algorithms from each category
(pointwise, pairwise and listwise). These baselines will be used to study the performance
of NDCG_Boost. Here is the list of these baselines (the details can be found in the LETOR
web page):

Regression: This is a pointwise approach that applies a linear regression to a ranking
problem. It is used as a reference point.

RankSVM: RankSVM is a pairwise approach that applies Support Vector Machine [18]
to the ranking problem.

FRank: FRank is a pairwise approach. It uses a probability model similar to RankNet [20]
for the relative rank position of two documents, with a novel loss function called
Fidelity loss function [22]. TSai et al. [22] showed that FRank performs signiﬁcantly
better than RankNet.

ListNet: ListNet is a listwise learning to rank algorithm [26]. It uses cross-entropy loss as
its listwise loss function.

AdaRank_NDCG: This is a listwise boosting algorithm that incorporates NDCG in com-
puting the weights for both queries and the combination of weak ranking hypothe-
ses [42].

SVM_MAP: SVM_MAP is a support vector machine with MAP measure as the target
objective function. It is a listwise approach [13].

While the validation set is used in ﬁnding the best set of parameters in the baselines in

LETOR, it is not used for NDCG_Boost in our experiments. For NDCG_Boost, we set the

maximum number of iteration to 100 and use decision stump as the weak ranker.

55

3.4.2 Results

Figure 3.1 provides the the average results of ﬁve folds for different Ieaming to rank al-
gorithms in terms of NDCG @ each of the ﬁrst 10 truncation level on the LETOR data
sets. Notice that the performance of algorithms in comparison varies from one data set to
another; however NDCG_Boost performs almost always the best. We would like to point
out a few statistics; On OHSUMED data set, NDCG_Boost performs 0.50 at N DCG@3, a
4% increase in performance, compared to FRANK, the second best algorithm. On TD2003
data set, this value for NDCG_Boost is 0.375 that shows a 10% increase, compared with
RankSVM (0.34), the second best method. On HP2004 data set, NDCG_Boost performs
0.80 at N DCG@3, compared to 0.75 of SVM_MAP, the second best method, which in-
dicates a 6% increase. Moreover, among all the methods in comparison, NDCG_Boost
appears to be the most stable method across all the data sets. For example, FRank, which
performs well in OHSUMED and TD2004 data sets, yields a poor performance on TD2003,
HP2003 and HP 2004. Similarly, AdaRank_NDCG achieves a decent performance on
OHSUMED data set, but fails to deliver accurate ranking results on TD2003, HP2003 and
NP2003. In fact, both AdaRank_NDCG and FRank perform even worse than the sim-
ple Regression approach on TD2003, which further indicates their instability. As another
example, ListNet and RankSVM, which perform well on TD2003 are not competitive to
NDCG_boost on OHSUMED and TD2004 data sets.

 

NDCG is commonly measured at the ﬁrst few retrieved documents to emphasize their importance.

56

OHSUMED

T

 

1

     
 
  
 
   
 
 
  
 

 

-+— Regression
----- FRank

-e— ListNet
RankSVM
AdaRankNDCG

 

 

 

 

—~— SVM_MAP
+ Noce_\eoosr

 

 

 

 

TDZOO3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0'4 a it é a 10 0'2 e 1i 6
@n

NP2004

 

 

 

 

 

 

 

 

 

 

 

i Al 6 e 10 z 4 e a 10
Figure 3.1: The experimental results in terms of NDCG for Letor 3.0 data sets

57

Chapter 4

Ranking Reﬁnement by Boosting

In this chapter, we consider the problem of improving the accuracy of an existing ranking
function with a small set of labeled instances. We are particularly interested in learning
a better ranking function using two complementary sources of information, ranking in-
formation given by the existing ranking function (i.e., the base ranker) and that obtained
from user feedback. We call this problem ranking reﬁnement. Ranking reﬁnement is
very important in information retrieval where feedbacks are gradually collected. The key
challenge in combining the two sources of information arises from the fact that the ranking
information presented by the base ranker tends to be imperfect and the ranking information
obtained from users’ feedbacks tends to be noisy. We develop an objective function based
on the pairwise approach for this problem and utilize the boosting technique to optimize
it. Our empirical study shows that the proposed boosting algorithm is effective for rank-
ing reﬁnement, and furthermore it signiﬁcantly outperforms the baseline algorithms that

incorporate the outputs from the base ranker as an additional feature.

4.1 Introduction

Most research in learning to rank is conducted in the supervised fashion, in which a ranking

function is learned from a given set of training instances. The drawback with the supervised

58

approach is that they tend to fail when the number of training instances is small. In several
real-world applications, in addition to the labeled training instances, a base ranker is avail-
able that can be used to rank the documents. Then, the research question is how to exploit
the outputs from the base tanker when Ieaming a ranking function from a small number of
labeled instances. We refer to this problem as Ranking Reﬁnement to distinguish it from
supervised learning to rank. Below we show two examples for the application of ranking

reﬁnement:

Relevance feedback In information retrieval, documents are often ordered by a predeﬁned
relevance ranking function, such as BM25 [51] and Language Model for IR [73],
that assesses the relevancy of documents to a given query. Relevance feedback tech-
niques are proposed to improve the retrieval accuracy by allowing users to provide
relevance judgments for the ﬁrst a few retrieved documents. The research question
here is how to improve the accuracy of relevance feedback by combining the rank-
ing information from the user feedback as well as the ranking information from the
predeﬁned ranking function. We can cast the relevance feedback problem as a rank
reﬁnement problem by viewing the relevance ranking function as the base ranker and

the documents that are judged by the user as training instances.

Recommender system The goal of a recommender system is to rank the items according
to the interest of an active user (i.e., the test user). Usually, a few rated items are
provided to indicate the preference of the active user. However, on the other hand,
we can rank the items for the active user based on the rating information of the other
users using the collaborative ﬁltering techniques [74]. The research question here is
how to improve the ranking performance by leveraging the two types of information,
i.e. the items rated by the active user and the ranking list generated by the collabora-
tive ﬁltering technique. We cast this problem into theframework of rank reﬁnement
by viewing the collaborative ﬁltering algorithm as the base ranker and the rated items

as training instances.

59

.,‘-_l"

Furthermore, any online learning of ranking functions can be viewed as a ranking re-
ﬁnement problem in that the ranking function is updated iteratively with new training ex-
amples collected on the ﬂy.

A straightforward approach toward ranking reﬁnement is to view the scores of the base
ranker as an additional feature, and learn a ranking function from a limited number of
training examples over the augmented features. As will be shown in the experiments, this
is not the best approach for exploiting the information hidden in the base ranking function.
We believe that the most valuable information behind the base ranker is not its scores but
the ranked list of documents it produces. We therefore view the base ranker and the labeled
instances as two complementary sources of information, each produces a different loss to
evaluate the performance of the new ranking function. The key challenge in combining
these two sources of information is that the ranked list generated by the base ranker is
imperfect while the labeled instances tend to be noisy. There are two research questions in

this problem to address:

Balancing between two sources of relevancy information: The ﬁrst question is how to
balance between two sources of relevancy information; i.e. how to evaluate the effec-
tiveness of a given a ranking function that orders the documents for each query. This
question is directly related to the design of the loss function. The common approach
in machine learning to balance between two sources of losses is to linearly combine
them with a constant. Since the reliability of each source is unknown, ﬁnding a good
balance parameter is critical in this case. We propose the multiplication of the losses
related to two sources of information as an effective and parameter free approach to

combine them and show that it satisﬁes the Parieto Optimality condition [75].

Learning: Given the multiplicative approach for balancing between two different sources

(i.e., the base ranker and the training examples), the second research question is how

 

Notice the application of cross-validation is not possible here since no reliable source of information, i.e.
the correct ordering of documents, is available in this case

60

to learn a ranking function by effectively combining these sources. Our approach to

answer this question is based on the boosting framework.

Our empirical study with relevance feedback and recommender system show that the
boosting algorithm with multiplicative loss function is effective for ranking reﬁnement, and
signiﬁcantly outperforms the baseline algorithms that incorporate the outputs from the base

ranker as an additional feature for the documents.

4.2 Related Work

Most learning to rank algorithms are designed for the setting of supervised Ieaming, in
which a ranking function is learned from labeled instances. However, the problem of semi-
supervised ranking, the topic of this chapter, has not been addressed in the literature, to
the best of our knowledge. The algorithm developed in this chapter belongs to pairwise
approach to learning to rank and is closely related to relevance feedback. Therefore, we
describe a short bibliography of these two.

Three well-known pairwise approaches to Ieaming to rank are Ranking-SVM [11, 76],
RankBoost [19], and RankNet [20]. Ranking-SVM minimizes the number of incorrectly
ordered pairs within the maximum margin framework. Several variants [21, 77] are de-
veloped to further enhance the performance of Ranking-SVM. RankBoost learns a ranking
model based on the same consideration, but by means of Boosting. RankNet [20] is a neural
network based approach that uses cross entropy as its loss function.

The relevance feedback techniques [78] are developed to improve the accuracy of the
existing retrieval algorithms. There are two types of relevance feedback. The ﬁrst type,
termed user relevance feedback, enhances the retrieval accuracy by collecting the user rel-
evance judgments for the documents that are ranked on the top of the list. As pointed out

in the introduction section, the user relevance feedback problem can be treated as a prob-

 

For the list of different approaches to Ieaming to rank, refer to Chapter 3

61

lem of ranking reﬁnement. As we showed in the empirical study, the proposed algorithm
for ranking reﬁnement signiﬁcantly outperforms the standard relevance feedback algorithm
(i.e., the Rocchio algorithm) over several datasets. The second type of relevance feedback,
often termed pseudo relevance feedback, does not explicitly collect the user relevance judg-
ments. Instead, it treats the top ranked documents as relevant to the given query, and the
documents ranked at the bottom as irrelevant. These pseudo relevance judgments are used
to improve the existing ranking function. It is well known in information retrieval that
pseudo relevance feedback may result in degradation of retrieval performance given the
high probability of errors in pseudo relevance judgments [78]. This is similar to the noise

of training instances in ranking reﬁnement.

4.3 Ranking Reﬁnement

4.3.1 Problem Deﬁnition

Let D = (x1, x2, . . . ,xn) denote the set of instances to be ordered, where each instance
x, 6 Rd is a vector of d dimensions. Let G : Rd —+ IR denote the base ranking function
(base ranker), and g,- = G (x,) denote the ranking score assigned to x,- by the base ranking
function G. Instance x,- is ranked before xj if g,- > 93-. To make our problem general, we
assume the label information collected from user feedback is presented as a set of ordered
pairs, denoted by 0 = {(xik >- xjk)|k = 1,. . . ,m} where each pair x,- >- xj indicates
that instance x,- is ranked before xj. The goal of ranking reﬁnement is to learn a ranking
function F : Rd —> R by exploiting both the labeled pairs in 0 and the ranking information
given by G.

 

This is because any labeled instances can be converted into ordered pairs while the converse is not true.

62

4.3.2 Encoding Ranking Information

The ﬁrst important question for ranking reﬁnement is how to encode the ranking informa-
tion provided by the base ranking function G. A straightforward approach is to use the
ranking scores computed by G as an additional feature, and apply the existing algorithms,
such as RankBoost [19] and Ranking-SVM [76], to learn a ranking function from the la-

beled instances. The drawback of this approach is twofold:

0 First, this approach only utilizes the ranking scores of the labeled instances. The
ranking information generated by the base ranker for the unlabeled instances is com-
pletely ignored by this approach. However, base ranker is a rich source of infor—
mation for the unlabeled instances that can be exploited for a better ranking. This
is particularly important when the number of labeled instances collected from the

users’s feedback is considerably small.

0 Second, we believe that the ranking orders generated by the base ranking function
is substantially more reliable than the numerical values of the ranking scores. Sim-
ilar observation is found in the study of meta search whose goal is to combine the
retrieval results of multiple search engines to create a better ranking list [79]. Em-
pirical studies [79] showed that the meta search algorithms based on the document

ranks often outperform the algorithms that directly use the relevance scores.

To address the above problems, we encode the order information generated by the base
ranking function G with matrix W E [0, 1]"x". Each Wz’o’ in the matrix represents the

probability of ranking x,- before xj and is deﬁned as follows

8Xp()\9i)
eXth) + exp(/\9j)

 

Wz',j (4.1)

In the above, Wm- is deﬁned by a softmax function and the parameter A _>_ 0 represents the

conﬁdence of the base ranking function. To see the effect of A, we consider two extreme

63

C38681

0 A = 0. In this case, we have WM 2 0.5, which indicates that the ordering informa-

tion generated by the base ranker is completely ignored.

0 A = 00. In this case, we have

1 92' > gj
Wi,j = 0.5 92' = gj (4-2)
0 92' < gj

Thus, W is almost a binary matrix, implying that we completely trust the ranking list

generated by the base ranker.

In our experiment, we set A to be inverse to the standard deviation of the ranking scores for
the ﬁrst 10 retrieved documents.
Similarly, we encode the ordering information inside the set 0 with matrix T as follows:

1—n/2 (x,- >-x-) eo
T1,,- = J (4.3)

77/ 2 otherwise

where parameter 7] E [0, 1]. TM represents the probability of ranking ranking x,- before
xj in the training data. The parameter 1] reﬂects the error rate of training data, and is
particularly useful when the labeled instances are derived from implicit user feedback that

is usually noisy. In our experiment, we set 77 = 1 / 2.

4.3.3 Objective Function

The goal of ranking reﬁnement is to learn a ranking function F : Rd —> IR from matrix W
and T that produces a more accurate ranking list than the base ranking function G. In par-

ticular, the optimal ranking function F should be consistent with the ranking information

in W and T. To this end, we measure the ranking errors of F with respect to both W and

F, i.e.,
errw = Z Wi,jI(Fj _>_ Fi) (4-4)

errt = Z 7;,jI(Fj 25“,) (4.5)

In the above, we introduce F,- = F ()9) and the indicator function [(33) that outputs 1
when the input boolean variable x is true and zero otherwise. There are two problems with

directly using the ranking errors errw and errt as the objective function:

0 First, both error functions are non-smooth functions since the indicator function I (2:)
is non-smooth. It is well-known that optimizing a non-smooth function is computa-

tionally more challenging than optimizing a smooth one [80].

0 Second, with two objectives at hand, the problem is essentially a multi-objective
optinrization problem [75]. Thus, another important question is how to combine

multiple objectives into one single objective.

In what follows, we will address these two questions separately.

Relaxation with Exponential Functions. To address the problem with non-smooth ob-
jective functions, we follow the idea of boosting by replacing the indicator function
I (a: 2 y) With an exponential function exp(:c — y). The resulting new objective functions

are:

n
e/FTw = Z Wi,j exp(Fj — Fi) (4.6)
i3j=1
n
6m = Z Ti,jeXP(Fj—Fi) (4.7)
iij=1

65

Note that since exp(a: — y) 2 1(m 2 y), by minimizing the errors Eﬁ‘w and 67?}, we are
effective in reducing the original ranking errors errw and errt. Another advantage of using
e’ﬁw and e’Frt comes from the theoretic result of AdaBoost [81], i.e., by rrrinirrrizing the
exponential loss function, the resulting classiﬁer will not only reduce the training errors but
also maximize the classiﬁcation margin. The enlarged classiﬁcation margin is the key to
guarantee a low generalization error for testing instances [81].

Remark: It is interesting to examine the effect of the smoothing parameter 77 on the
ranking error e’Frt. By substituting the expression (4.3) for TiJ' in (4.7), we have e’Frt

expressed as follows:

e’Frt=(l—77) Z exp(Fj-Fi)
(xi>-xj)60

+3 2 [exp(Fi — Fj) + exp(Fj - FD]

z‘,j=1

+77 n

z (1"?) 2 +5 2120”“

(szc jexp)€0 i=1

17
= (1— 77) Z €Xp(Fj - Ft) + mllFllg (4-3)
(xi-Ht j)60
where “Fug. is a norm of vector F = (F1, . . . ,Fn) deﬁned as follows:

uan = FT(nl—ee)F

where 1 is the identity matrix and e is a vector of all ones. In the second step, the approxi-
mation follows the Taylor expansion of the exponential function. The second term in (4.8),
i.e., nllFllg / 2( 1 — 7)), plays a similar role as used by Support Vector Machines (SVM) [3].

In this sense, the parameter 7] essentially regularizes the ranking error err/7t.

66

Combining TWO Objectives. The problem of optimizing multiple objectives is usually
called multi-objective optimization problem [75]. In a multi-objective problem, there is
usually no single solution that can satisfy each objective to its fullest. In this problem, we
are looking for a solution at which no objective can be further reduced without increasing
the value of other objective functions, a condition known as Pareto optimality. The easiest
approach to combine several objective functions that results in a Pareto optimal solution is
to linearly combine them [75]. In our case, there are two error functions, each related to
a different source of relevancy information. A given ranking function can satisfy only one
source of relevancy information for each pair of documents in case of a conﬂict between
two sources; i.e. decreasing the error related to one source can increase the error related to

another. The linear combination leads to the following optimization problem, i.e.,

11

La = ”Ye/ﬁne + 6771: = Z (”twig + Tm) €XP(Fj — Ft) (4.9)

t',j=1

where parameter 7 is used to weight error e771”. We refer to the approach based on the
above objective function as “Linear Ranking Reﬁnement”, or ILRR, for short. The main
drawback with the linear combination approach is how to decide the value for 7. In our
experiments, we will show that different 7 could result in very different performance in
information retrieval. Since there is no easy way to ﬁnd the best tradeoff, we consider the

combination of the two errors by their products, i.e.,

n n
= Z Tin” eXP(Fj — Ft) I 2: W233“ eXP(Fj — F,) (4.10)

We refer to the approach as “Multiplicative Ranking Reﬁnement”, or MRR for short.
Now, the question is whether the resulting solution is Pareto efﬁcient [75]. More for-

mally, a solution F = (F1, . . . F") is Pareto optimal for the objectives e’Frw and 67:71; if

67

there does not exist any other solution F’ = (Ff , . . . , F,’,) that is either

1. C/ﬁ‘w(F’) < ﬁw(F) and 6’7“?”qu S ﬁdF), 01'

2. 5mm) 3 6mm) and 63mm) < new).

In other words, if F is Pareto efﬁcient, it guarantees that no solution is able to further
reduce the two objectives simultaneously than F. Regarding the Pareto efﬁciency when

rrrinirnizing Lp in (4.10), we have the following theorem:

Theorem 8. The optimal solution F = (F 1, . . . , Fn) found by minimizing the objective

ﬁmction Lp is Pareto efﬁcient.

The proof of this theorem can be found in Appendix A9. The main advantage of using
Lp rather than La is that it does not need a weight parameter. This will be revealed in our
empirical studies in that minimization of Lp usually signiﬁcantly outperforms minimization
of La even when the optimal combination weight 7 is used for La.

In order to compare the properties of the two different approaches for combination,
we examine their ﬁrst order derivatives. Let 5 denote the parameters used by the ranking
function F (x). Then, the ﬁrst order derivatives of La and Lp with respect to E are given as

follows:

n
VtLa = 2 (T2,) + Twig) exp<Fj — Ft><ng(xj) — ng(x.->)
2',J'=1

VtLp = L, Z(at,j+b.,j>exp<Fj—F.>(ng(xj>—vtth.-)>
i,j=1

where

am = n 2,) em 3 ') (4.11)
22,321 WM @1ij - F2")
Tm €XP(Fj - Ft“)

b. .
2’] 22j=1TiJeXP(Fj — Ft)

 

(4.12)

 

68

Note that both derivative shares similar structures. The key difference between V5 La and
Vng is that in Vng, am- and bz‘,j are used to weight the contribution from W and T
for instance pair (xi,xj) when computing the derivative. This is in contrast to VgLa
where the weights for instance pair (xi, x_,-) are 7W“ exp(Fj - F2) and TM exp(Fj - F2).
The main advantage of using ai,j and bi,j is that they are normalized, i.e., 23H “M =
223:1 bz',j = 1, and therefore the contributions from W and T are naturally balanced

when calculating the derivative.

4.3.4 Boosting Algorithm for Ranking Reﬁnement

In this section, we will consider algorithms for Ieaming the ranking function F (x) by
respectively minimizing the objective function Lp. The objective function La is similar
to the objective function used by Rank-Boost [19] except that a weight (Tid- + 7W“) is
used for each instance pair. We thus can simply modify the Rank-Boost algorithm to learn
the optimal ranking function F(x). Hence, in the sequel, we will focus on the boosting
algorithm for minimizing Lp.

To learn the optimal ranking function F (x) by minimizing LP, we follow the greedy
approach of boosting algorithms. Since the training examples are the labeled instance
pairs, a straightforward boosting approach is to iteratively update the weights of instance
pairs and train a new ranking function for the given weighted pairs. This is the strategy
employed in the RankBoost algorithm [19]. However, since the number of instance pairs is
0(n2), this approach could be computationally expensive when the number of instance 71
is large.

To address the above problem, we present a new boosting algorithm that converts the
weights of instance pairs into weights for individual instances. The key idea behind the
new boosting algorithm is to derive an upper bound for the target objective that decouples
functions for pairs of instances into functions for individual instances. It is this decoupling

that makes it possible to infer weights for individual instances from weights for instance

69

 

Algorithm 6 Boosting algorithm for minirrrizing Lp

 

1: Input: Wijj and Tio' as two encrypted sources of information

2: repeat

3: Compute 7233' for each instance pair as 'yz-J- = am- + bz',j where am- and bz',j are
deﬁned in (4.11) and (4.12).

Compute the weight for each instance as w,- = 231:1 7231' — 7]",-

Assign each instance the class label y,- = sign(w,-).

6: Train a classiﬁer f (x) : Rd —-) {0,1} that maximizes the following quantity

50.4?

m = Z lw’ilf(x’i)yi (4-13)
i=1

7: Predict fz- for all instances in D
27.1._1 72' '5(fi,1)5(f',0)
: . . . h = 1 1 Z,]—— 1.? J
8 Compute combination werg ts a 2 0g 23:1 7M 5(fj11)5(f¢,0)
f (x,) 6 (3:, y) outputs 1 if a: = y and zero otherwise.
9: Update the ranking function as F (x) <— F (x) + a f (x)
10: until reach the maximum number of iterations

 

where f,- =

 

pairs. In addition, the new boosting algorithm is able to derive an appropriate binary class
label for each instance using the computed weights. Using both the weights and the class
assignments of instances, we can train a binary classiﬁer f : Rd —> {0, +1} and update the
overall ranking function by F’ (x) = F (x) + a f (x) where a is the combination weight.
Note that by converting a ranking problem into a series of binary classiﬁcation problems,
the new boosting algorithm avoids the high computational cost arising from the large num-
ber of instance pairs.

Algorithm 6 summarizes the overall procedures for the proposed boosting algorithm
minimizing Lp. In each iteration, this algorithm computes '72-,3- for every pair of instances
that measures the uncertainty of ranking instance x,- ahead of xj. Then, it adds up the
uncertainties of comparing instance x.) to all other instances, which results the calculation
of the weight for instance x.) as w,- = 231:1 72-J- - 7332'— The bigger the magnitude of 21),,
the large the uncertainty of F is in ranking xi. So, the algorithm redistributes weights pro-
portional to the uncertainty to at each iteration. It is important to note that to,- can be both

positive and negative. In particular, to, > 0 indicates that the algorithm did not succeed

70

x10

 

_\
O

‘9

0?

 

Objective Function Value

 

 

 

ON

Iteration
Figure 4.1: Reduction of the objective function Lp using the OHSUMED Data Set

in ranking x,- on the top of the ranked list; and to, S 0 indicates the opposite. Hence, the
boosting algorithm derives the class label y, for x,- based on the sign of 10,-: a positive class
for placing instances on the top of the ranked list, and a negative class for placing instances
on the bottom of the list. To summarize the above steps, using the magnitude and sign of
211,-, the algorithm chooses a weighting and a labeling direction for instances. Given a set
of weighted binary class examples, the weak learner trains a classiﬁer that maximizes m
in (4.13), which can be interpreted as a sort of classiﬁcation accuracy. Since most binary
classiﬁers are unable to take weights into consideration, the boosting algorithm divides
the training procedure into two steps: in the ﬁrst step, it samples 3 instances according
to the distribution that is proportional to the weights |w,-|; It then trains a binary classiﬁer
f : Rd -—> {0, +1} using the sampled instances . After Ieaming the new binary classiﬁer
in Step 6, the algorithm evaluates its success in reducing the loss (the value of the objec-
tive function) in Step 7 and 8 and adds it with a proportional weight to the current list of

classiﬁers at Step 9.

 

We manually set .9 = max(20,n/5) in our empirical study. A similar strategy is employed in the
AdaBoost algorithm [19] and its effectiveness has been veriﬁed in empirical studies.

71

Before providing the justiﬁcation for Algorithm 6, notice that in order to have this
algorithm continue iterating, the weak learner need to do better than random guessing in

the following sense. Writing a in the following form

a = élog(1—€) (4.14)

6

 

where
22j=17i,j6(fja 1)6(fi:0)
Eider 7i,j(6(fja1)5(fi10) + 5(fi,1)5(fj,0))

applies that by better than random guessing we mean 6 < 0.5.

 

5:

In the remaining of this section, we will provide justiﬁcation to the prOposed boosting

iterations in Algorithm 6. The main result is summarized in Theorem 9.

Theorem 9. Let f k(x) denote the binary classiﬁcation function obtained in the kth itera-

tion, and 72",] denote '7,- j learned tn that iteration. The objective function after T iterations,

denoted by L5, is bounded as follows:

T

n n
Lg" g 2 Tm- 2 Wm- exp (,/—— ([12 (4.15)

i,j=1 z',j=l k=1

where

m. = Z)nzijétf’ttx.)name-)0)

iJ=1

n
Vt = Z 412,6(fk(xt).0)6(f’°(xj>, 1)
i,j=1
The above theorem essentially shows that by using the proposed algorithm, the objec-
tive function Lp will be reduced exponentially.
The key to proving Theorem 9 is to establish the relationship between the objective

function Lp of two consecutive iterations. This is because by upper bounding the log-ratio

72

between Lp of two consecutive iterations, i.e.,

r, 2 log Lg, — 10g Lg-l, (4.16)
we will have
T T Lt T
_ 0 P 0
LP _ LPH 7: g Lpexp Zr, (4.17)
t=1 LP t=1

For the convenience of presentation, in the following, we only consider two consecutive
iterations without specifying the index of iteration. Instead, we denote the quantities of the
current iteration by symbol"to differentiate the quantities of the previous iteration. In order

to establish an upper bound for the log ratio, we ﬁrst introduce the following lemma

Lemma 5. Assume F (x) = F (x) +01 f (x) where F (x) and F (x) are the ranking functions
of two consecutive iterations, respectively. f : Rd —> {0, 1} is a binary classiﬁer and oz is

the combination weight. We have the following inequality hold for any F, f, and (I:

~

L n
108 —p S -2 + Z (ai,j + bi,j) eXp(a(fj - ft)) (4.18)
L” 231:1

where am- and b233- are deﬁned (4.11) and (4.12), respectively.

The proof of Lemma A.10 can be found in Appendix A.10. Using Lerrrrna A.10, we
present the proof of Theorem 9 in Appendix A.11.

Finally, we can show the relationship between the objective function Lp and the quan-
tity m (in (4.13)) that is used to guide the training of binary classiﬁers in iterations. This

result is summarized in the following theorem:

Theorem 10. Let ”We denote the value of the quantity in Equation (4.13) that is maximized
by the binary classiﬁer f k(x) learned in the tth iteration. Assume that lit 2 0 for each

iteration. Then, the objective ﬁmction after T iterations, denoted by L5, is bounded as

73

follows:

12 n T
L; S 2 Tm 2 Wm 8X1) - Z 77: (4.19)

z',j=1 i,j=1 t=1
The proof of the above theorem can be found in Appendix A. 12. Theorem 10 provides
a theoretical justiﬁcation for Algorithm 6. In particular, by maximizing m, Algorithm 6
effectively reduces the objective function Lp. This is further conﬁrmed by our empirical
study. Figure 4.1 shows an example of reduction in the objective function Lp. We clearly
see that the objective function is reduced exponentially and receives the largest reduction

during the ﬁrst few iterations.

4.4 Experiments

In this section, we evaluate the proposed algorithm for ranking reﬁnement by two tasks,
i.e., user relevance feedback and recommender system. The objectives of our experiments
are: (1) to compare the proposed algorithm for ranking reﬁnement to the existing ranking
algorithms, (2) to examine the performance of the proposed algorithm for ranking reﬁne-
ment with different numbers of training instances, (3) to examine the effect of different
base rankers on the performance of the proposed algorithm, and (4) to examine the time
efﬁciency of the proposed algorithm for ranking reﬁnement. We use the Letor data set
for Relevance Feedback experiment and Movies data set for the Recommender System

experiment. The description of these data sets can be found in Section 1.6.2.

4.4.1 Experimental Setup

Algorithms. To examine the effectiveness of the proposed algorithm for ranking reﬁne-

ment, we compared the following ranking algorithms:

Base Ranker: It is the base ranker used in the ranking reﬁnement.

74

Rocchio: This algorithm extends the standard Rocchio algorithm [82] for user relevance
feedback that creates a new query vector by linearly combining the original query
vector and vectors of feedback documents. Given the initial query Q0, the relevant
documents (R1, R2, ..., Rnl) and non-relevant documents (81, 5'2, ..., Snz), the new

query according to Rocchio is computed as:

Q: Q0+a251 —e2S—;L (4.20)

Note, in our case, that each document is not represented by a vector of word fre-
quency, but a vector of features that are computed based on its match to the query.
Hence, we don’t have Q0, i.e., the representation vector for query itself. We therefore
set Q0 to be a vector of all zeros. We used the inner product between the new query
and documents as the scores to rank the documents. W‘e vary a and 5 from 1 to 10

and choose the best setting.

SVM: This implements the Ranking-SVM algorithm using the SVM light package. Note
that it is commonly believed that Rank-Boost performs equally well as Ranking
SVM. The experimental results provided in the LETOR collection also conﬁrm this.
Hence, we only compare the proposal algorithm with Ranking-SVM, but not Rank-

Boost.

MRR: This is the Multiplicative Ranking Reﬁnement algorithm that minimizes Lp
in (4.10).

LRR: This is the Linear Ranking Reﬁnement algorithm that minimizes La in (4.9). Since
the performance of LRR depends on the parameter 7, we run LRR with 100 different
values from 0.1 to +10 and choose the best and worst performance. We referred them

to as LRR-Worst and LRR-Best, respectively.

75

For a fair comparison, the output from the base ranker is used as an extra feature when
using SVM (i.e., Ranking-SVM) and Rocchio. Notice that we do not compare the perfor-
mance of the proposed method with different baselines provided in LETOR because the
experiments in LETOR are obtained under a different setting. We will discuss the experi-
mental setup used in this chapter in Section 4.4.1. Similar to Chapter 3, we used NDCG to

evaluate the performance of different methods. NDCG is described in Section 3.3.1.

Evaluation Protocol. For each LETOR data set, we choose the best ranking feature
compared to other features and use it as the base tanker. The best ranker for datasets
OHSUMED, TD2003, TD2004, HP2003, HP2004, NP2003, NP2004 are feature number
11, 46, 46, 46, 46, 6, and 6. We followed the common practice of user relevance feedback
by collecting the relevance judgments for the ﬁrst 20 retrieved documents; i.e. we sort all
documents of one query based on the base ranker and simulate the user feedbacks by using
the true relevancy of the ﬁrst 20 documents. These user relevance judgments served as la-
beled instances in ranking reﬁnement. Notice that it is well known that relevance feedback
depends on the quality of feedback documents. If the underlying base ranker does a poor
job in identifying the relevant documents, it is very likely that most of the feedback docu-
ments are irrelevant, leading to a poor performance of the proposed algorithm. We come
back to this problem in Section 4.4.3.

For the experiment with recommender system, the base ranker was created by apply-
ing a collaborative ﬁltering algorithm, more speciﬁcally, the Personality Diagnosis algo-
rithm [74], to the user rating data. In particular, 20 users were randomly selected as the
training users, and the remaining 923 users were used for testing. For each test user, 10
rated movies were randomly selected and were used by the collaborative ﬁltering algo-
rithm to identify the 20 training users who share the common interests with the test user.
Note that we did not compare the proposed algorithm to other information ﬁltering algo-

rithms because the focus of this study is to examine the effectiveness and the generality of

76

the proposed approach for ranking reﬁnement.

4.4.2 Results for Relevance Feedback

Figure 4.2 show the ranking results of different algorithms in terms of NDCG for the ﬁrst
25 ranked documents. First, by comparing the performance of the two variants of ranking
reﬁnement, we observed that the Multiplicative Ranking Reﬁnement (MRR) algorithm is
more effective than the Linear Ranking Reﬁnement (LLR) algorithm. Indeed, MRR per-
forms signiﬁcantly better than the best case of LRR (i.e., LRR-best) for OHSUMED and
TD2004 datasets. The key difference between MR and LR is that MRR minimizes the
product of the two error functions while LRR minimizes the weighted sum. We believe it is
the normalization scheme brought by MR (see equations in (4.11) and (4.12)) that makes
it performing better than LRR. The performance of MR is more appreciated given it does
not have a single parameter that needs to be adjusted manually.

Second, comparing to the other three baseline algorithms, i.e., the base ranker, Roc-
chio, Ranking-SVM, we observed that MRR signiﬁcantly outperforms the base ranker and
Rocchio algorithms in all the cases; it outperforms Ranking-SVM in the ﬁrst three data
sets and yields similar performance for the remaining four data sets. We also note that the
improvement made by the ranking reﬁnement is more signiﬁcant for the ﬁrst a few rank-
ing positions than the other ranking positions, a very desirable property for web search in
which users usually only pay attention to the ﬁrst a few retrieved results. We thus conclude
that Multiplicative Ranking Reﬁnement is more effective than the baseline algorithms for
user relevance feedback in information retrieval.

Finally, notice that the ranking algorithms show different trend of NDCG on different
data sets. Particularly, NDCG is decreasing for the ﬁrst three data sets and increasing for
the remaining data sets. The increasing or decreasing trend is directly dependent on the
number of relevant documents and the quality of ranking. If a ranking algorithm performs

a good job in retrieving the relevant documents on the top of the list, it is generally expected

77

to have a decreasing trend. This is because it is more likely to see irrelevant documents in
the list as we retrieve more documents. For the last four data sets, there is only one relevant
document for each query. Even a good ranker is not able to retrieve the only relevant
document on the top of the list and that is why you see NDCG increases until it retrieves

the relevant documents of all the queries and then remains constant.

4.4.3 Effect of Base Ranker

W examine how the proposed algorithm response to different base tankers, in particular
the base rankers with relatively poor retrieval performance. We tested MRR algorithm with
three different base rankers that are selected automatically based on their ranking perfor-
mance. These three base rankers are the worst, the best and a medium quality base ranker
selected from the list of features for each data set. Figure 4.3 shows how MRR algorithm
performs when the selected base rankers are used. In each sub-ﬁgure, different base rankers
are distinguished with a number in the legend that shows the feature number they use. The
result indicates that the quality of base rankers has a direct impact on the performance
of the MRR algorithm. However, the proposed algorithm is able to signiﬁcantly improve
the performance for a base ranker that can retrieve some relevant documents. When the
base ranker performs extremely poor (like in TD2003, HP2003, HP2004, NP2003, and
NP2004), all the retrieved documents are are judged as irrelevant by user and no infor-
mation is available from either sources. Therefore, no improvement can be made by the
proposed algorithm for extremely poor base rankers. It is also interesting to observe that for
data set OHSUMED, even with the worst base ranker, MMR algorithm is able to achieve
similar performance to the baseline methods when they use the best base tanker. This re-
sult further conﬁrms the effectiveness of the proposed algorithm for ranking reﬁnement.

We thus conclude that the MR algorithm is resilient to the imperfectness of base rankers.

78

4.4.4 Effect of Size of Feedback Data

To investigate the effect of the number of feedback documents on the performance, we ran

the MR algorithm by varying the number of feedback documents from 5 to 20. Figure 4.4
shows the result using varied number of feedback documents. We clearly observed that
the number of feedback documents have a direct effect on the performance of ranking
reﬁnement. However, even with a small amount of feedback, MRR is able to improve
the retrieval performance considerably, particularly for the accuracy of the ﬁrst few ranked
documents. We thus conclude that the proposed algorithm for ranking reﬁnement is robust
to the size of feedback data. Also notice that for data set NP2003, there is no changes in the
performance of MR with different relevance feedback. The reason is that the base ranker

in this case is not able to retrieve any relevant documents for most queries.

4.4.5 Results for Recommender System

We evaluated the generality of the proposed algorithm by applying it to recommender sys-

tem (movie recommendation). Figure 4.5(a) show the results of different algorithms when

applied on the MovieLens dataset. It is surprising to observe that the results of LRR, the lin-

ear ranking reﬁnement algorithm, even with the tuned parameter 7, is not comparable to the

the performance of the base ranker. In contrast, the MRR algorithm is able to signiﬁcantly

improve the accuracy of the base ranker and outperforms the other baseline algorithms con-

siderably. This result further indicates the importance of appropriately combining the two

information sources, i.e., the ranking information behind the base sranker and the feedback
information provided by users.

Figure 4.5(b) shows the sensitivity of MRR to the size of feedback data by varying the
number of movies rated by the test user from 5 to 25. Similar to the result for relevance
feedback, we observed that the size of feedback data affects the performance of MRR
considerably. However, even with 5 rated movies, the MR algorithm is able to make a

noticeable improvement in the ranking accuracy compared to the base ranker. This result

79

 

 

 

further conﬁrms the robustness of the proposed algorithm to the size of feedback data.

4.4.6 Time Efﬁciency of Ranking Reﬁnement

Figure 4.6 shows the efﬁciency of the MR algorithm in terms of the running time for
different numbers of rated movies for each test user. We chose movies data set for the
experiment because the number of rated movies varies signiﬁcantly from users to users,
making it easy for us to evaluate the computational efﬁciency of the proposed algorithm.
We partitioned the test users into groups where each group of users has a different number
of rated movies. The running time of MR for each group is calculated by averaging it
across all the users in the group. As pointed in Section 4.3.4 and seen in Figure 4.6, the
running time is linear in the number of instances. Note that the relatively long running time

is due to the MATLAB implementation.

80

 

—r- Base Ranker
+ Rocchio
—°— SVM

Mo Best_LRR
-~— Worst_LRR
—-— MRR

 

 

 

 

 

 

 

 

 

0.8

0.2

0 10 15 0 5

o 51b75 2o 25

Top Documents
HP2003

 

4 '-

_gangaaailaaaamaaaaaam‘

 

AAAAAAAAAA

A A A
............ vn"

 

 

 

o 1'0 15 2o 25

Top Documents
NP2003

1

 

 

 

 

Top Documents

 
 
 

OHSUMED

 

 

 

 

o 5 1‘0 15 21)
Top Documents

T02004

25

 

NDCG

 

 

 

o 5 1o 15 20
Top Documents
HP2004

 

 

   
 

 

 

 

 

o 5 1o 15 2o
Top Documents

 

 

 

 

 

o 5 1o 15 2o
Top Documents

Figure 4.2: NDCG of relevance feedback for different algorithms

81

25

OHSUMED

—0— Base Ranker-46
—+— MRR-46
+ Base Ranker-36
---o-- MRR-36

—-— Base Ranker-16

 

 

0.2 L

 

 

 

 

_._ _ o 5 1‘0 1‘5 20 2'5
MRR 16 Top Documents
TDZOOB TDZOO4

 

0.4’

NDCG

   

 

 

W W
00 5 10 15 20 25 00 5 10 15 20 25
Top Documents Top Documents

HP2003 HP2004

 

 

 

W W
00 5 10 15 20 25 00 5 10 15 20 25
Top Documents Top Documents

NP2003 NP2004

 

 

 

W W
00 5 10 15 20 25 0O 5 10 15 20 25
Top Documents Top Documents

Figure 4.3: NDCG of MR with different base rankers for relevance feedback

82

 

 

 

 

—1— Base Ranker
—o— MRR-5

—e— MRR-10
.-...9 MRR-15
-~— MRR-20
—-— MRR—25

 

 

 

TDZOO3

 

0.2 r

 

 

o 5 1b 15
Top Documents
HP2003

0.951 -_
0.9

2 0.75-
M 1
0.65

 

20

 

o 140 20
Top Documents
NP2003
0.8 '

0.7'

0.6 ’

NDCG

0.5’

 

0'40 5 1o 15
Top Documents

250

3‘0

25

0.9“

0.8* .

0.4

 

OHSUMED

 

 

 

 

5 1‘0 15 20 25
Top Documents

TD2004

 

5 10 15 2o 25
Top Documents
HP2004

 

 

 

0.2

5 1‘0 15 20 25
Top Documents
NP2004

 

 

5 1‘0 15 20 25
Top Documents

Figure 4.4: NDCG of MR with different numbers of feedback documents for relevance

feedback

83

rr—y .- W luv-al.:

 

 

 

 

  
 
  
  

    

 

 

 

 

 
  

 
  

 

 

 

 

Movie +3 8 R k .
1 . a e . an er MOV' “ -+-Base Ranker
'9' ROCChIO 0 9.
' -0- MRR-5
0 9 +SVM -e— MRR-1O
' r «a» Best_LRR 0.85 .t _a” MRR-15
8 0.8' « ‘ +Worst_LRR (9 0.3 +MRR-20
o 8 +MRR-25
ZQT zom-
0.61 0.7
0.50 5 1o 15 2‘0 25 0'650 10 20 so
Top Documents Top Documents
(a) NDCG chart (b) Sensitivity to the number of rated movies

Figure 4.5: The ranking result for rmommender system

3.5 I I I I I I I I

 

Time (Seconds)
... 5

A
01
I

L

 

 

 

0'5 1 1 J l L l I 1
o 50 100 150 200 250 300 350 400 450
Number of Movies

Figure 4.6: Running time of MMR for different numbers of movies rated by test users

84

1!

 

Chapter 5

Online Classiﬁcation with Bandit

Feedback

In this chapter, we consider the problem of online classiﬁcation with bandit feedback: in
each trial of online learning, instead of providing the true class label for a given instance,
the adversary will only reveal to the learner if the predicted class label is correct. Unlike
online learning with full feedback, learner here does not receive the loss value for all the
hypotheses in the hypothesis space after it chooses one, which demands a new approach
for an effective Ieaming. We present a general framework for online multi-class learning
with partial feedback based on the notion of potential [83]. The generality of the proposed
framework is veriﬁed by the fact that Banditron [5] is indeed its special case with the
squared L2 norm of the weight vector as the potential. Using the exponential potential,
we propose an exponential gradient algorithm for online multi-class Ieaming with partial
feedback that has the interesting property that its mistake bound is independent from the
dimension of data, making it suitable for classifying high dimensional data. Our empirical

study with the classiﬁcation data sets show that the proposed algorithm for online learning

with partial feedback is more reliable than Banditron.

85

 

 

 

 

5.1 Introduction

Online learning with partial feedback assumes that, in each trial of online learning, the
adversary only reveals to the learner if the predicted class label is correct and does not
provide the true class label for a given instance. Online learning with partial and full
feedback are equivalent when there are only two classes. Therefore, we assume it is clear
that the classiﬁcation problem is a multi-class one when we talk about online classiﬁcation
with bandit feedback.

Online learning with partial feedback is closely related to the problem of multi-armed
bandit which is the generalization of a traditional slot machine game, called one armed
bandit [84]. In multi-armed bandit, there are n arms to pull with unknown rewards. A
player aims to maximize its reward over the trials by Ieaming the best arm to pull. When
the player starts, he/she does not know which arm is more proﬁtable. It is only over the
trials that he/she learns the best arm to pull. In each stage of this game, the player needs
to decide if he/she is going to explore a new arm or exploit his/her knowledge by choosing
the best arm, a technique called exploration vs. exploitation tradeoff. This strategy helps
the player to constantly receive feedback for all arms.

The problem of online classiﬁcation with bandit feedback can be considered a multi-
armed bandit problem, with the feature vector of example available as a sort of side in-
formation; i.e., at each round, after observing an instance, the learner needs to decide a

class label (an arm). Although online multi-class Ieaming with full feedback has been ex-
tensively studied, the problem of online multi-class Ieaming with partial feedback is only
studied recently [5, 85]. The challenge in online Ieaming with bandit feedback is the fact
that after classifying a new instance, the learner only receives the loss value for the part
of the hypothesis space that have the same prediction as current hypothesis. To explore
different parts of the hypothesis space, the learner needs to sacriﬁce the chance of correctly
classifying the current instance in the hope that it ﬁnds the best model that minimizes the

long-term number of mistakes. We will give a detailed description of this strategy and its

86

characteristics in Chapter 6.

In this chapter, we propose a general framework to address the challenge of partial
feedback in the setup of online classiﬁcation. This general framework adapts the potential-
based gradient descent approaches for online Ieaming [83] to the scenario of partial feed-
back. The generality of the proposed framework is veriﬁed by the fact that banditron is
indeed a special case of our framework if the potential function is set to be the squared L2
norm of the weight vector. Besides the general framework, we further propose an expo-
nential gradient algorithm for online multi-class Ieaming with partial feedback. Compared
to the Banditron algorithm, the exponential gradient algorithm is advantageous in that its
mistake bound is independent from the dimension of data, making it suitable for classifying

high dimensional data. We verify the efﬁcacy of the proposed algorithm for online learning

with partial feedback by an extensive empirical study.

5.2 Related Work

Although introduced very recently and there is only a few work directly related, the prob-
lem of online multi-class Ieaming with bandit feedback can be traced back to online multi-
class classiﬁcation with full feedback and multi-armed bandit Ieaming. The former pro-
vides the required tools to handle the problem of partial feedback and the later offers a
starting point for the development of an online multi-class Ieaming with partial feedback.
Both these areas have been extensively studied and we only provide a brief review Several
additive and multiplicative online multi-class Ieaming algorithms have been introduced in

the literature [52]. Perceptron [43] and Winnow [86] are two such algorithms. Kivinen

and Warrnuth developed potential functions that can be used to analyze different online

algorithms [87]. Grove et al. [88] showed that polynomial potential can be considered as a

parameterized interpolation between additive and multiplicative algorithms.

Multi-armed bandit problem refers to the problem of choosing an action from a list

87

 

of actions to maximize reward given that the feedback is (bandit) partial [44, 89, 90]. The
algorithms developed for this problem usually utilize the exploitation vs. exploitation trade-
off strategy to handle the challenge with partial feedback [46, 47].

Multi-class learning with bandit feedback can be considered as a multi-armed bandit
problem with side information. Langford et al. in [85] extended the multi-armed setting to
the case where some side information is provided. Their setting has a high level of abstrac-
tion and its application to the multi-class bandit Ieaming is not straightforward. Banditron,
which can be considered as a special case of our framework, is a direct generalization of
Perceptron to the case of partial feedback and uses exploration vs. exploitation tradeoff
strategy to handle partial feedback [5]. Potential function and exploration vs. exploitation
tradeoff techniques are the main tools used to develop the framework in this paper.

Notice that the problem of bandit with side information has been also addressed in rein-
forcement learning under the name of Associative Bandit problems [91—94]; however those
work assume that the side information are i.i.d samples from an unknown distribution. This
is unlike our online approach that no assumption is made about the process that generates

data.

5.3 A Potential-based Framework for Classiﬁcation with

Partial Feedback

We ﬁrst present the problem of online classiﬁcation with partial feedback, followed by the

presentation of potential based framework and exponential gradient algorithm.

5.3.1 Problem Deﬁnition

We denote by K the number of classes, and by x1, x2, . . . ,xT the sequence of training
examples received over trials, where x,- 6 Rd and T is the number of received training

instances. In each trial, we denote by g, E {1, . . . , K} the predicted class label. Unlike

88

 

the classical setup of online learning where an oracle provides the true class label y,- E
{1, . . . , K} to the learner, in the case of partial feedback, the oracle only tells the learner if
the predicted class label is correct, i.e., [yt = 'y't]. This partial feedback makes it difﬁcult to
learn a multi-class classiﬁcation model.

In our study, we assume a linear classiﬁer for each class, denoted by W =
(W1, . . . , w K) E RdXK, although the extension to nonlinear classiﬁers using ker-
nel trick is straightforward. Given a training example (x, y), we measure its loss by
E (maxkﬂl ng — wa) where 13(2) = max(0, z + 1) is a hinge loss. We denote by
W1, . . . , WT a sequence of linear classiﬁers generated by an online learning algorithm
over the trials. Our objective is to bound the number of mistakes made by the online learn-
ing algorithm. Since the proposed framework is a stochastic algorithm, we will focus on

the expectation of the mistake bound. As will be shown later, the expectation of the mistake

bound is often written in the form

T
a<I>(U) + ﬁg! (gagixzuk — xguyt)
where U = (u1,. . . , u K) is the linear classiﬁer, <I>(W) : RdXK H R is a strictly convex
function that measures the complexity of the linear classiﬁers, and a and 5 are weight
constants for the complexity term and the classiﬁcation errors. Note that the Banditron
algorithm is a special case of the above framework where it measures the complexity of W
by its Frobenius norm, i.e., <I>(W) = %|W|%.. In this chapter, we design a general approach
for online learning with partial feedback that is adapted to any complexity measure <I>(W).

Finally, for the convenience of presentation, we deﬁne

K W =€(maxxth -xth > (5.1)
t( ) [696% t k t 31:

89

5.3.2 Banditron

Kakade et al. [5] developed Banditron for the problem of online classiﬁcation with bandit
feedback. Banditron, depicted in Algorithm 7, is basically Perceptron adapted to handle the
case of bandit feedback by utilizing the exploration vs. exploitation tradeoff technique. Af-
ter receiving a new instance xt, Banditron computes the primary class assignment if]; using
the weight matrix W“1 at Step 5, just like Perceptron. Using the exploration vs. exploita-
tion tradeoff parameter '7, the learner decides label 3}} at Step 6 and 7 which is either gift
(exploitation) or another random class label (exploration). After receiving a feedback, the
algorithm computes the update matrix xtrit which, on average, is equivalent to the update
matrix in Perceptron for the full feedback setting. Kakade et al. provided the following

mistake bound for Banditron in [5].

Bound for Banditron: Let K be the number of classes. After running over a sequence
of examples x1, . . . ,xT, with ||xt||2 g 1 for all t, the expected number of mistakes made

by Banditron, denoted by E[M], is bounded as follows

2mm} 2 [Morgan
EM<€U T 3 ,‘/U T + 5.2
[]_()+1+maX{7 Ilpr} 7 ()

where U is any arbitrary weight matrix (classiﬁer) and L.

 

 

 

5.3.3 Potential-based Online Classiﬁcation for Partial Feedback

Our framework, depicted in Algorithm 8, generalizes the Banditron algorithm [5] by con-
sidering any complexity measure <I>(W) that is strictly convex. In this algorithm, we intro-
duce 0 6 RdXK, the dual representation of the linear classiﬁers W. In each iteration, we
ﬁrst update at based on the partial feedback [3]; = ﬁt], and compute the linear classiﬁer Wt
via the mapping V<I>* (6), where <I>* (6) is the Lagendre conjugate of <I>(W). Similar to Ban-

ditron and most online Ieaming with partial feedback [83], a stochastic approach is used

90

 

Algorithm 7 The Banditron Algorithm

1: Parameters:

0 Step size: '7 > 0

2: Set wg = 0,11: =1,...,Kand90 = V<I>*(WO)

3: fort = 1,...,Tdo

4: Receive xi 6 Rd
Compute 37;} = arg maxls kg K ”(I‘VE—1
Seter = (1 -’7)[k = 17t1+7/K,k =1,---,K
Sample {it by distribution p = (p1, . . . , p K)-
Predict it and receive feedback [yt = 5,]

099.2939

Compute (it = 1.771: -— 12% 1311):}11 where 1 k stands for the vector with all its elements
31

being zero except its kth element is 1.
10: Compute Wt = Wt‘l — xtcitT
11: end for

to predict class assignment, in which parameter 7 > 0 is introduced to ensure sufﬁcient

exploration [44].

In the following, we show the mistake bound for the proposed algorithm. For the con-

venience of discussion, we deﬁne vector rt 6 Rd as

T, = 1% _ 1,, (5.5)

Proposition 3. For €t(W) 2 1, we have

ve,(W)=<Wt-1,X,T,T), and E,[5,]=T, (5.6)

where Et[] is the expectation over fjt and (St is deﬁned in (5.4).

We denote by D¢(A, B) the Bregman distance function for a given convex function (I),

which is deﬁned as follows

Dq,(A, B) = <I>(A) — 5(3) — (A — B, V<I>(B)) (5.7)

The following classical result in convex analysis summarizes useful properties of Bregman

91

 

Algorithm 8 Online Learning Algorithm for Multi-class Bandit Problem

 

1: Parameters:

o Smoothing parameter: 7 E (O, 0.5)
0 Step size: 17 > 0
0 Potential function: <I> : RdXK v—> R and its Legendre conjugate <I>* : RdXK I—+ 1R

2; Setw2=o,k= 1,...,Kand00=vq>*(W0)
3: fort=1,...,Tdo

 

 

4: Receive xt 6 Rd
5: Compute
{it = arg maac)c;rwt_1 (5.3)
1_<_lcSK
6: Setpk = (1 —7)[k=37t] +7/K,k= 1,...,K
7: Randomly sample ﬁt according to the distribution p = (p1, . . . ,pK).
8: Predict {it and receive feedback [yt = gt]
9: Compute
(it = 137t — 1§t[yt = gt] (5.4)
pA
yt
where 11: stands for the vector with all its elements being zero except its kth element
is 1.
10: Compute 6t = 6t"1 — nxtdg—
11: Compute Wt = V<I>(6t) where 6t = (6t,...,6§()
12: end for
distance.

Lemma 6. Let <I>(W) be a strictly convex function with constant p with respect to norm

H ‘ H i.e.,for any W and W’ we have

(W — W', V<I>(W) — WW» 2 mm — W’llz.

We have the following inequality for any 9 and 0’

<6 — 6', v<1>*<0> — V<I> * (6')) s ﬁne — 6’11:

where <I>*(0) is the Legendre conjugate of <I>(W) and H - H... is dual of norm || - H. Further-

92

more, we have the following equality for any W and W'
Dq,(W, W’) = 19.1,. (0, 0’),

where 9 = V<I>(W) and 6’ = V<I>(W’).

Proposition 4. For any linear classiﬁer U 6 Rd" K, We have the following inequality hold

for two consecutive classifier Wt_l and Wt generated by Algorithm 8

17.1,. (U, WH) — 0.1,. (U, Wt) + 13.1,.(Wt—1, Wt)

= —(U — Wt_1,nxt6;r) (5.8)

Proof. Using the property of Bregman distance function (see for example Chapter 11.2

in [83]), we have

0.1,.(U, WH) — 0.1,.(U, Wt) + 13.1,.(Wt-1, Wt) = (U — Wt_1, V<I>*(Wt) — V<I>*(Wt_1))
= (U _ Wt_1,6t _ gt—l)

= — <U -— Wt—1,nxt5;r>

The second step follows the property 9t = V<I>*(Wt), and the last step uses the updating
rule of Algorithm 8. C]

Now, we can bound E[|6t l2} as follows, with the proof provided in Appendix A.13.

Proposition 5. For any 3 > 0, we have
K 3 2/3
E 6 2 <_.L A _ l _
[Itlsl—1_7+[yt7éyt]{1 7+K(1+[7]) }

We use |W|p,3 to measure the norm of matrix W E RdXK with p 2 1 and s 2 1. It is

93

deﬁned as

W = max 11, Wv 5.9

I has lulpsws £1< > ( >
where u e Rd, v E RK, and |u|q and Mt are L, and Lt norm of vector u and v, respec-
tively. Evidently, the dual norm of I - Ip,s is l - I”, with p"1 + q“1 = land s_1+ t”1 = l.
The theorem below shows the regret bound for Algorithm 8. The proof of this theorem is
provided in Appendix A.14 ‘

Theorem 11. Assume that for the sequence of examples, (x1, 311),. . . , (XT, yT), we have,
for all t, xt 6 Rd, ||x||p 3 land the number ofclasses is K. Let U = (ul, . . . ,uK) E
RdXK be any matrix, and <I>* : RdXK r—> 1R be a strictly convex ﬁmction with constant

p with respect to norm I - lp,s- The expectation of the number of mistakes made by by

Algorithm 8, denoted by E[M], is bounded as follows

T
1 1 177T
EMS—D*U+— €U+——+T
[] m <1>() REA) 2pn(1_7) 7

where

_ 77 ’7 K3 2/3
“—1‘z{1‘7+k‘(”[?]) }

Notice that the Banditron algorithm is a special case of the general framework with
<I>*(W) = %|W|%. and | - Ip,3 = | - I22 = I - |F~ The Banditron bound is speciﬁcally
obtained through approximations 7/ (1 — 7) S 27 and 1+ Iii/’7 S 2k/7 in summarizing the

terms in n.

94

5.3.4 Exponential Gradient for Online Classiﬁcation with Partial
Feedback

In this section, we extend the exponent gradient algorithm to online multi-class Ieaming
with partial feedback. A straightforward approach is to use the result in Theorem 11 by

setting

K (1
«5(9) = [Zane-,1.) (5.10)

77‘
ll
l—l
a:
ll
H

’9;
1%

u
M»
Ma.

Wi,k(1an',k — 1) (5.11)
' 1

Fr
ll
H
s
II

where each wk is a probability distribution. Following the general framework presented in
Algorithm 8, Algorithm 9 summarizes the exponential gradient algorithm for online multi-
class Ieaming with partial feedback. Since <I>*(W) is strictly convex with constant 1 with

respect to | - | F, we have following mistake bound for the exponential gradient algorithm.

Theorem 12. Assume that for the sequence of examples, (x1, yl), . . . , (xT, yT), we have,
for all t, xt 6 Rd, ||x||2 g 1 and the number ofclasses is K. Let U = (111, . . . , uK) E
RdXK where each uk is a distribution. The expectation of the number of mistakes made by

by Algorithm 9 is bounded as follows

T

Kan 1 777T
EMS +— E U+————+ T

 

wherenzl—zzlp(l—7+%+§-).

By minimizing the mistake bound in the above theorem, we choose step size n as fol-

lows

 

 

_ K(1-'7)1nK
_\/ T7 (5.12)

95

 

Algorithm 9 Exponential Gradient Algorithm for Online Multi-class Learning with Partial
Feedback

1: Parameters:
o Smoothing parameter: 7 E (0, 0.5)
0 Step size: n > 0
2: Set (90 = llT/d
3: fort = 1,...,Tdo
4: Compute W153,c = exp(6§,k)/Z,tc where Zfc = 2L1 exp(6f,k).
5: Receive X): 6 Rd
6: Compute

i1} = arg maxxgrwt—1 (5.13)

lngK

$6th = (1 -7)[k = 37t1+7/K.k = 1,---.K

Randomly sample ﬁt according to the distribution p = (p1, . . . ,pK).
Predict 5t and receive feedback [yt = at]

10: Compute

999:1

[yt = 9t]

(St = 1A — 1~ —— (5.14)
y y
t t pilt
where 11: stands for the vector of all elements being zero except that its kth element
is 1.
11: Compute at = Ot‘l — nxtdg
12: end for

 

For the high dimensional data, we can improve the result in Theorem 12 by using the

following lemma. The proof of this lemma is provided in A.15.

Lemma 7. <I>(W) and <I>* (W) deﬁned in (5.10) and (5.11) satisﬁes the following properties

K
(W — w’, v<1>*(W) — V<I>*(W’)) 2 Z Iwk — wm
k=1
K
<6 — 0’. We) — V<I>*(6’)> 5 Z l6.,k — 61,..IE.

gr
II
b-l

where 0*): = (61,,“ . . . ,HdJc).

Using the above lemma, we have the following theorem that updates the result in The-

orem 12

96

Theorem 13. Same as the setup of Theorem 12 except that |x1|oo S 1. The expectation of

the number of mistakes made by by Algorithm 9 is bounded as follows

 

T
Kan+ 777T
EM < __
[ 1 +-Zlft(U) +2p.1_,)+rT

wherenzl—Qn—p(2—2’y—4ﬁ).

Proof The proof is the same as the proof of Theorem 11 except that we have

K 722
E[D¢(6‘ 165.61 12723 Zlénuxtlio s; Euatm
lc=1

A simple computation shows that E[|6t|1] = 2 — 27 — 47 / K . By combining these results,

we have the theorem. [:1

The major difference between Theorem 12 and 13 is the constraint on x: L2 is used
in Theorem 12 and Loo is used in Theorem 13. Therefore, Theorem 13 shows that the
exponential gradient algorithm is essentially independent from dimensionality d, making it

suitable for handling high dimensional data.

5.4 Experiments

To study the performance of the proposed framework, we applied the exponential poten-
tial algorithm introduced in 5.3.4 on the multi-class classiﬁcation data sets introduced in
Section 1.6.1.

We compared the classiﬁcation performance of the proposed exponential gradient algo-
rithm, Exp, to the Banditron algorithm. Since the exponential gradient algorithm assumes
all the combination weights to be non-negative, in order to make fair comparison between
the proposed approach and the Banditron algorithm, we run two sets of experiments for

Banditron, one which is the original Banditron and one that projects the learned weights

97

MNIST NURSERY

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  
 
 
 

 

 

 
 

 

 

 

1 _
& Perceptron 0 6513‘ — Perceptron
0.8“; G Banditron_pos ' . 4}Banditron_pos
o 3‘. "a" Banditron 0-5 ’ \‘1 «n- , Banditron
i‘: --O-'Exp % 0.55" “€1.13 -o--Exp
s a-§§gg .3333
o 0 0 one o o--o ,0
0 - - 4 . . L
0 2 , . 4 6 0 5000 10000 15000
“8|an rounds x 10“ Training rounds
PROTEIN LETTER
0.8 1%.
—Peroeptron “a... a w
4} Banditron_pos 09' ' ‘-"béc‘—§.;3 -f= - ~ g. .n. _
, , m3-.. _. _ __n
0-7 tan-Banditron 08 8 0 9" '9
g on. ..0- Exp .3 - —Perceptron
5 0 '5 0.7- G Banditron_pos
LE ' ,5 06 ail-Banditron
0.5- G'ﬁ‘ygri‘éwsé . ( "0" Exp
0.5 \—
0.4 . r . l . ‘ ‘
0 0-5 , . 1 1-5 2 0.40 5000 10000 15000
Training rounds x 104 Training rounds
PENDIGITS OPTDIGITS
1 —-Peroeptron 1[
--D'Banditron_pos 5,,
0 8 “i ‘G'Banditron 0-8' #3; _. _
0: (83:3: cr- 0 Exp 0 g~§§? '8: ‘1 ~{: a»
a .‘ .. _ - " “ ﬂ ._ n- .- «... r» _ "‘E} __a
E :0.,'B"g.: G T - “ E— a S 0'6\ 0'”-3. STE“ 3- El
'5 0 6 Glue ‘ BlBT'ﬂ-‘O-«g h " -0- .0
I: °"-o---o..., 2 0,4- —Perceptron
m 0 0~ 0 LU
0 4 JCFBanditron_pos
' 0.2’ --D--Banditron M
0'20 2000 4000 6000 8000 00 1000 2000 3000 4000
Training rounds Training rounds
ISOLET
1a
.~a.=§.:.é.z-82-~32é8£ £8 ' '- p.
0.8 ‘ ”rial-0
.9 —Perceptron
E 0 6 0 Banditron_pos
g ~D~Banditron
0.4
0.2

0 2000 4000 6000 8000
Training rounds

Figure 5.1: The ﬁgure shows the error rates of different methods over trials with the best
setting of 7.

98

.1" '.-_ nun. .l.w.n_
_I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MNIST NURSERY
1 . . . . . -
—Perceptron —Perceptron
0_ ‘ --D--Banditron . «Gr-Banditron
2 h's Exp 0 Exp .
S 0.6' K," _ 'gﬁ ,G‘B‘IQ
a :-.,..ﬂ_.g-_,g:~..-err8~--5“‘ .‘ ,.a—-cr_,,.o---o '
0.2 Omo‘"
00 0:1 0:2 0:3 014 0.5 0 0:1 0:2 0:3 0:4
Gamma Gamma
PROTEIN LETTER
0.8 . . , 1 - .
—Perceptron 1%,;va
iii-Banditron 0.9 "---..o"'E'--a-..g.._3__ .-.a---u---m
07’ --o--Exp ""'o-~-o--~o.._., ...-3---o-~~o~---<>
g g 0.8: :
a 2. 9
“ 05f ‘ " 0-7 —Perce tron
g b, _ 11317.8 g . . -p
LU 1.5.3. __ _, .o-8'f_;87'---'8—“'0 LIJ 0.6: ﬂ- Banditron
0.5 9.....5.,,:g....8-~ ‘ o-Exp
0.5-
. . r r . .4 l . . .
040 0.1 0.2 0.3 0.4 0.5 0 O 0.1 0.2 0.3 0.4 0.5
Gamma Gamma
PENDIGITS OPTDIGITS
1 . g 4 1 . . ,
& ——Pereeptron é,‘ -— Perceptron
.: .,\ Kit-Banditron 0.80 'x' ~0- Banditron
o 0.829 K‘s ..0.. Exp 0 “a" ..0.. Exp V
E ‘-. "n-5," ,P'ua" 160.6» "0,“a’n'» x” “OMB"..-O
‘- 0.6' 21' ‘3'":13' ..0...,o."‘<> ._ 00300
E "o. ...-0: -o---o“' g 04-
LIJ '--.,o,..-0’ uJ '
0.4- 0.2
0.2 1 r 1 1 G . r . .
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Gamma Gamma
ISOLET
1 . . . .
&. ---..- -- Perceptron
Emma‘s.- ... ‘G'Banditl'on
0.8- 8 ~-o--Exp
O
E
h 0.6-
E
w
0.4-
0'20 0:1 012 04.3 0:4 0.5
Gamma

Figure 5.2: Figure shows the ﬁnal error rates of different methods with varied 7.

99

 

into the positive orthants, which is equivalent to setting all the negative weights to be zero.
It is easy to verify that the projection step does not change the theoretic properties of Ban-
ditron, in particular the mistake bound (of course only with respect to linear classiﬁers U
in the positive orthants. We call this projected Banditron, Banditron_Pos.

For each tested algorithm and for each data set, we conduct 10 independent runs with
different seeds for randomization. We evaluate the performance of online Ieaming by the
accumulate error rate, which is computed as the ratio of the number of misclassiﬁed sam-
ples and the number of samples received so far during the online learning process.

Since these online algorithms rely on the parameter 7 to control the tradeoff between
exploration and exploitation, we examine the classiﬁcation results of all the algorithms in
comparison by varying 7. The step size 17 of online learning often play an important role
in the ﬁnal performance. For the proposed algorithm, we set the step size according to
Eq. 5.12. Because the exponential function may exceed the upper bound of a real number
with double precision type in a 64-bit computer, we further multiple the step size with a

small factor (typically 10—5) to avoid this issue.

5.4.1 Experimental results

Figure 5.2 compares the average error rates of the online algorithms with varied 7 values,
and Figure 5.1 shows the average error rates of the three online methods over the entire
online process. For the proposed algorithm and both version of Banditron, we choose the
optimal 7 that results the lowest classiﬁcation error rate.

First, by examining the classiﬁcation performance with varied 7, we clearly see that the
exponential gradient algorithm shows comparable performance compared with the original
Banditron algorithm for online multi-class learning with limited feedback. In particular,
we observe that the proposed algorithm performs signiﬁcantly better than the Banditron
algorithm for three data sets ’OptDigits’, ’Pendgitis’, and ’Nursery’. The result indicates

that the proposed algorithm is overall more reliable. Notice that for all data sets except for

100

 

’Nursery’ data set, we observe a signiﬁcant gap between online Ieaming with full feedback
and online learning with partial feedback, which is due to the limited feedback from the
adversary.

Second, we compare the learning rate of all three algorithms. We observe that the pro-
posed algorithm overall exhibits a signiﬁcantly better learning rate than the Banditron_Pos
algorithm (i.e. Banditron with positive weights), for most data sets and most part of the
online Ieaming process. This result indicates that the proposed online Ieaming algorithm
with partial feedback is generally effective in reducing the error rate.

Finally, notice that these algorithms are sensitive to the choice of parameter 7. In
Chapter 6, we provide more details on the exploration vs. exploitation tradeoff parameter

7 and provide effective algorithm to automatically tune it.

101

”L‘s. .... ...—m,

 

Chapter 6

Robust Online Classiﬁcation With

Bandit Feedback

As we have already seen in Chapter 5, exploration vs. exploitation tradeoff strategy is the
main tool to develop online classiﬁcation algorithms with bandit feedback. The major prob-
lem with utilizing this strategy is the sensitivity of the resulting algorithm to the exploration
vs. exploitation tradeoff parameter. In this chapter, we propose three learning strategies to
automatically adjust the tradeoff parameter for Banidtron. Our extensive empirical study
with multiple real-world data sets veriﬁes the efﬁcacy of the proposed approach in learning

the exploration vs. exploitation tradeoff parameter.

6.1 Introduction

Exploitation vs. exploration tradeoff strategy has been widely applied to develop online
learning techniques when the feedback provided to learner is bandit, i.e. the learner only
receives the cost of its action but not the cost of other possible actions. Exploration refers to
the choice of an action not recommended as the best action by the current model (classiﬁer).
It allows the learner to explore the game and receive the feedback for different strategies

and gain new knowledge from the adversary. Exploitation refers to choice of the best action

102

 

according to the current knowledge in order to maximize the gain. These two objectives are
complementary, but opposite: exploration leads to maximization of the gain in the long run
at the risk of losing short term reward; exploitation maximizes the short term gain at the
price of losing the gain over the long run. A careful tradeoff between these two objectives
is important to the success of any online learner utilizing the combined strategy.

The challenge of online classiﬁcation with bandit feedback is that after classifying an
instance, the learner only receives the loss value for those hypotheses that have the same
prediction as the current hypothesis. This means that the learner is not able to explore
the whole hypothesis space if it only classiﬁes according to the current hypothesis. As
described in Chapter 5, Banditron [5] utilizes the exploration vs. exploitation tradeoff tech-
niques to handle this challenge. This tradeoff is explicitly captured by a single parameter
7 6 (0, 0.5) in Banditron: with probability 1 - 7, the learner will predict the most likely
class label based on the current classiﬁcation model (exploitation), and with probability 7,
the learner will randomly choose one of the remaining class labels for prediction (explo-
ration).

Figure 6.1 shows the performance of Banditron for different data sets by varying the
value of 7. The best 7 values for data sets ’Protein’, ’Pendigits’, ’Isolet’, ’Nursery’, ’Opt-
digits’, ’Letter’, and ’Mnist’ are respectively 0.1, 0.25, 0.2, 0.15, 0.25, 0.35, and 0.15. It
is clear that the performance of Banditron strongly depends on the value of 7 and and it is
therefore very helpful to develop strategies to automatically tune this parameter. Intuitively,
at the beginning of the learning stage, due to the fact that classiﬁcation model is trained by
a limited number of examples, it is likely that the classiﬁcation model will perform poorly.
As a result, it may be more desirable to have a large value for 7. As the Ieaming procedure
proceeds, the classiﬁcation model is updated with sufﬁciently large number of examples,
and therefore is likely to yield accurate classiﬁcation performance. Therefore, it is desir-

able to reduce the value of 7 and the amount of exploration with increasing number of

 

Notice these plots are extracted from Figure 5.2.

103

 

 

 

 

 

MNIST NURSERY
0.3 . . - 0.6 . a
0.7 0.55.
m
E 0.6» ‘5
III 0.5 El
04 0.45:
I . J r 0.4 r . A r
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4
gamma gamma
PROTEIN LETTER I
0.6 . - 0.96 . . .1
'1
0.58 I g
‘3 0.56- L
E
m 0.54-

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0'50 0.1 0.2 0:3 0:4 0'840 0:1 0:2 0:3 0:4
gamma gamma
PENDIGITS OPTDIGITS
1 ~ - 0.9 . .
0.9-
§0.8- f3
3 §
LEI 0.7 m
0.6-
0'50 0:1 0:2 0:3 0:4 0.5 0'50 0:1 0:2 0:3 04 0.5
gamma gamma
ISOLET
0.95 . .

 

Error rate

 

 

 

L

 

0 0:1 0:2

03

0.4

 

gamma

Figure 6.1: The error rates of Banditron with different choice of 7 for different data sets

104

training examples. Theoretically, as suggested by [5], the choice of 7 = 0(tTI/3) pro-
duces the optimum result in the agnostic case. This is because by minimizing the mistake
bound provided in Inequality 5.2 with regard to 7, we obtain 7 = O(t‘1/3). However, as
we show in this chapter, the adaptive choice of parameter 7 should be not only dependent
on time t but also dependent on the number of correctly/incorrectly classiﬁed instances to

control the speed in which we reduce the amount of exploration.

6.2 Related Work

To the best of our knowledge, this is the ﬁrst study that aims to learn the exploration vs
exploitation tradeoff parameter for online classiﬁcation with bandit feedback. Since explo-
ration vs. exploitation tradeoff parameter is widely utilized for the problem of multi-armed
bandit, here we brieﬂy describe the tuning techniques for this parameter in multi-armed
bandit [6]. However, none of these methods utilizes the classiﬁcation speciﬁc information,
e.g. the number of mistakes.

In the simplest form, called 7-ﬁrst strategy [6], a pure exploration phase is followed by
a pure exploitation phase [46, 95]. Evan-Dar et al. [46] showed that to obtain an a-optimal
arm with probability 1 — 6, 0(EKZ log(éf—D rounds of exploration is needed. The problem
with this approach is that it cannot produce arbitrary small regrets. A second approach,
called 7-decreasing strategy [6], is similar to the approach proposed by Kakade et al. [5]
for the problem of online classiﬁcation with bandit feedback. In this approach, 7 is a
decreasing function of time t. Several decreasing function have been proposed including
7t = 0G) [6], 7t = 06—0—ng) [96]. and 7t = Otis) [5].

Another approach, the Boltzmann exploration, chooses each arm with a probability
proportional of their obtained reward [96]. A temperature parameter can be utilized to
smoothly switch from pure exploration to a pure exploitation. Notice that except 7-

decreasing strategy proposed in [5], no theoretical results are known for the other methods

105

Algorithm 10 The Banditron Algorithm
1: Set wg = O,k=1,...,Kand00 = V<I>*(W0)
2: fort = 1, . ..,Tdo
3: Receive xt 6 Rd
Compute Q} = arg maxls ks K xtwac—l
Choose sampling probability ’Yt
Setpk = (1 -7t)[k =17t1+7t/K.k = 1..-..K
Sample 31;} by distribution p = (121, . . . , p K)~
Predict ﬁt and receive feedback [yt = ﬁt]
Compute 6; = lilt — lgtiﬁtiytl

V where 1k stands for the vector with all its elements
yt
being zero except its kth element is 1.

10: Compute Wt = W“1 - xtdtT
1 1: end for

4:
S:
6'.
'7:
8
9

introduced here for the problem of online classiﬁcation with bandit setting.

6.3 Balancing between Exploration and Exploitation

6.3.1 Preliminary

Algorithm 10 shows the Banditron algorithm [5], which is exactly the same algorithm as
given in Algorithm 7 however it uses an adaptive 7t to emphasize that the exploration vs.
exploitation tradeoff parameter changes over time. Theorem 14 provides a new form for

the mistake bound of Banditron. The proof of Theorem 14 is provided in Appendix A.16.

Theorem 14. Let K be the number of classes. After running over a sequence of examples

x1, . . . ,xT, with ”Xt ”2 _<_ 1 for all t, the expected number of mistakes made by Banditron,
denoted by E[M], is bounded as follows

T T T T
A K A
EiMlSWli‘i'E :(t(U)+E 2% +13 E:’7t[yt=ytl+ E :Tﬂlyti‘éyt] (6-1)
t=1 t=1 t=l t—l

where U is any arbitrary weight matrix (classiﬁer) and {7t}?=1 are exploration vs. ex-
ploitation tradeoﬁr parameters of trials.

106

 

Remark: It is important to realize that by a proper re-scaling of the complexity of U and

margin as suggested in [97], the bound provided in Theorem 14 can be rewritten as:

 

 

T T T T
ElM] S EMU) + E :72: + IUIF 23 ZTtigt = 11:] + 2%[1’11 75 yr] (6-2)
t=1 t=1 t=1 t=1
which is the inequality used in [5] to obtain the bound of Bandtiron given in Equation 5.2.
More speciﬁcally, the bound in Equation 5.2 is obtained by two relaxations in Inequal-
ity 6.2: [ﬁt = gt] 3 1 and W S x/E + x/b. Moreover, notice that €t(U) is the hinge
loss with margin equal to 1 in both inequalities 6.1 and 6.2.

Given the bound provided in Theorem 14, the optimal set of sampling probabilities
{7t};F=1 will be evidently obtained by minimizing the mistake bound stated in Theorem 14,
i.e.

T K T
L 2 Z _[gt 7g yt] + Z7t(1+[17t = ytI)
t=1 7‘ t=1

However [31} ¢ yt] and [ﬁt = yt] are not provided as the feedback in the bandit setting and

we need to approximate them with an expectation in terms of 17. We consider the following

approximations in Section 6.3.2 :

 

' [lit = yt] = Et [Efﬁglgtzyd] S Et [5? =§_3t=y 1] S 2Et [[17t = 01111;} = yd] =
yt

m

' [lit 3'5 yt] = 1 — [371: = tn] 5 1 - Et [Ii/‘1: = @101: yd] = Tt

To understand the merit of the above approximations, we analyze the following two ap-

proximations in Sections 6.3.3 and 6.3.4 respectively and use them as the competitors in

the experiments.
' [and S 1and[@‘t=yt1 Sﬂt

0 [ﬁt 2 yt] g 1 (which is the relaxation used in [5]) and [1h 9'5 9t] S ”Ft

107

 

 

6.3.2 Finding Optimal 7 using [3’], 7E gt] S 7', and [3, = 3),] 3 pt

In order to bound the quantity L = 2:1 753} + 221:1 7t(u + 1), we consider a general
family of 7 that is deﬁned based on a concave function. We introduce the concept of good

support ﬁmction.

Deﬁnition 15. A function 02(2) deﬁned in the domain of 2 2 0 is called a good support
function if it satisﬁes the following conditions: (a) 02(2) is concave for z 2 0 and 02(0) 2 0,
(b) 02(2) is monotonically increasing, i.e., 02’(2) > 0, for 2 2 0, (c) 02(2) is Lipschitz
continuous with Lipschitz constant L, i.e., 02’ (2) S L, for 2 2 0, and (d) there exists a

constant p _>_ 1 such that for any t 2 0 and 2 Z 0, we have w’(2) S ptw' (2 + t).

Proposition 6. (1) 02(2) = (a + 2))‘, with A E (0, 1] and a > 0, is a good support ﬁmction,
with Lipschitz constant L = Aux-1, and p = 6(1—A)/ 0, and (2) 02(2) = ln(a + 2) with

a > 0 is a good support function, with L = l/a, and p = el/a.

The proof of this Proposition is provided in Appendix A.17.

In order to bound quantity L, we introduce two good support functions 021(2) and 022 ( 2) ,

with 021(2) 2 05(2) for any 2 2 0. We deﬁne

WE (213 1 + 6:)
7t : I t—l
2W1 (22:1 Ti)

 

(6.3)

It is straightforward to verify that ’Yt E (0, 1 / 2]. In addition, since 021(2) and 022(2) are two
concave functions, 021(2) and 02'2 (2) are non-increasing functions of 2, leading to a decreas-
ing function of M and t and increasing function of 'rt. The following proposition shows a

key property for the construction of 7t in 6.3. The proof is provided in Appendix A.18.

108

 

Proposition 7. Given the construction of 711 in (6. 3), we have the following inequalities:

 

 

 

T 022 (T) T 2 “’2 (Zirzl Ht)
2’72: S 102 2 T , 27%: S 102 I T
t=1 2011 (thl Tt) t=1 20)] (Zt=1 Tt)
T K 021 (2:31:10)
2: _Mt S 2P1K 2 T
t=1 7t “2 (21:1 1 + Mt)

where p1 and p2 are the constants deﬁned in Deﬁnition 15 respectively for 021 and 022.

Theorem 16. Let 021(2) and 002(2) be two good support ﬁtnctions with 021(2) Z 02% (2) for
any 2 Z 0. By running Algorithm 10 with ’Yt set as in Eq. (6.3), we have the following

bound for the expected number of misclassiﬁed examples

 

 

 

T
p2w2 (T) 2P1 K021 (3T) 022(2T)
EIMI S #21001) + 202’1(T) + IUIF (\/ 02’2(3T) +P2 202((2T)

where p1 and p2 are the constants of two good support ﬁmctions.

Proof. The proof is straightforward by using the result in Remark 1, Proposition 7, in-
equality Va + b S ([6 + x/b, and considering the fact that for a good support function 02:
02 (233;, 7,) _<_ am 5 02(2T) and 02' (23;, p.) 3 02’(2T). III

Now, using the above theorem and Proposition 6, we have:

Corollary 17. Suppose 7t is in Eq. (6.3) with 021(2) 2 (1 + 2))‘1 and022(2) = (1 + 2))‘2
where A1, A2 6 (0, 1] and A1 = A2 + 1 / 3. By running Algorithm 10, we have the following

bound for the expected number of misclassiﬁed examples

 

1
3K
E[M]<:lt()U +:—2—( (1+T)3+|U|p ”2:2 (1+3T)3+\/p——§_T(1+2T)3

where p2 = (Bl—AZ. This bound is of 0(T2/3) and similar to the bound of the original

Banditron.

109

 

Proof. It is a simple plug-in of the two support functions in Theorem 16. CI

6.3.3 Finding Optimal 7 using [39} 7é yt] _<_ 1 and [’y} = gt] 3 at

In this section, we use the the upper bound approximation L = 2&1 % + 23:1 ’Yt(1+ltt)-

Given a good support function 02(2), we deﬁne ”it as

t—l
1 I
7, = 51:02 (21+ 11,-) (6.4)

It is straightforward to see that ’71: is valid since ’7t 6 [0, 1 / 2]. In addition, since 02(2) is
a concave function, 02’ (2) is a non-increasing function of 2, leading to a decreasing value
for 7,3 as more and more training examples have been classiﬁed correctly. The following
proposition shows a key property for the construction of 7t in (6.4), with the proof provided
in Appendix A.19.

Proposition 8. Given the construction of ’Yt in (6.4), we have the following inequalities

l/\

 

T
_p— 5 2KLT
an< _ 21101:” : Z713 2Lw(T)’ 1,27: 0)’(thr=11+ﬂt)

Using the above proposition, we have the following theorem for the mistake bound of

dynamic 7 introduced in 6.4.

Theorem 18. Let 02(2) be a good support function. By running Algorithm 10 with ’Yt set as
in Eq. (6.4), we have the following bound for the expected number of mistakes made by the

 

 

algorithm
EIM] < {i +p02(T) + IUI 02(2T) + 2KLT
_ t—l F p 2L 02’ (3T)
Proof. Similar to Theorem 16. E]

110

 

The following corollary directly follows from the result of Proposition 6 and Theo-
rem 18.

Corollary 19. By running Algorithm 10 with 7), as in Eq. (6.4) and 02(2) = (1 + 2)“) , where

)1 E (0, 1], we have the following bound for the expected number of classiﬁcation mistakes

T 1-I\ 1-I\ A l-A 1
E[MISE €t(U)+E——(1+T)’\+|U|p t-3——(1+2T)?+\/2K(1+T) 2 T2
t=1 2’\ V20

When A = 2/ 3, we have E[M] = 0(T2/3) which is the same convergence rate as

Banditron.

6.3.4 Finding Optimal 7 using [fit 7é 3),] S T, and [3’], = yt] _<_ 1

Similar to the approach presented in the previous sections, we set ”It as

I
t
n = , “221 (6.5)
2021 (22-21 Ti)

where 021(2) and 022(2) are two good support functions and 021(2) _>_ 025(2). It is easy to

verify that 7t 6 (0, 1/ 2] due to the properties of a good support function. The proposition

below allows us to bound 2:le ’Yt and 2&1 K / ”Yt-

Proposition 9. Given the construction of 7t in (6.5), we have the following inequalities

 

T K ., (21:1 6) T 626)
— < 2K & <
Z; ’Yt Tt — p 02$(T) g,” — 2011 (El; Tt)

Proof. Similar to Proposition 7. III

Theorem 20. Let 021(2) and 022 (2) be two good support ﬁmctions. By running Algorithm 10
with 7t set as in Eq. ( 6.5), we have the following bound for the number of misclassiﬁed

111

 

examples M = ELIE: 75 ytl

T
022(T) w2(T) M
51M] 5 EMU) + 20', (23:1 n) + 'U'F (V 202; (T) + \/2K”1 025(1))

Proof. The proof directly follows Theorem 14 and Proposition 9. CI

 

 

The following corollary directly follows from the result of Proposition 6 and Theo-

rem 20.

Corollary 21. By running Algorithm 10 with ’11: in Eq. (6.5) and 021 = (1 + 2)’\1 and
022 = (1 -I- 2))‘2 with A1, 2\2 E (0, 1], we have the following bound for the expected number

of misclassiﬁed examples

T
1 A _
E[M] _ t§=1:8t(U) + —2/\1(1+T)

1 ,\ +1—,\ 2k l-Al A +1-,\
+ IUIF (l2—/\1(1+T)_2T'l+(l :2 (marl—24

with A1 = A2 + :1; we have E [M] = 0(T2/3) which is of the same rate as Banditron.

 

6.4 Experiments

In this section, we conduct experiments on the classiﬁcation data sets, introduced in 1.6.1,
to validate the proposed strategies for balancing the tradeoff between exploration and ex-

ploitation.

6.4.1 Experimental Settings

We refer to the algorithms developed in Sections 6.3.2, 6.3.3, and 6.3.4 as banditron_ag3,

banditron_agl and banditron_ag2. To evaluate the classiﬁcation performance of the

112

 

three proposed Ieaming strategies for exploitation vs. exploration tradeoff parameter 7,
we compare them with three different version of Banditron, namely, Banditron_worst,
Banditron_Best, and Banditron_ag0. Banditron_worst and Banditron_Best are Ban-
ditron algorithm when 7 is set to the worst and best value for a given data. Banditron_ag0
is the Banditron with the adaptive ’Yt = %t-1/ 3 as suggested in [5] for the general ag-
nostic case. We repeat each experiment 50 times by generating random sequences of in-
stances and report the average accumulate error rates, which are computed as the ratio
of the number of misclassiﬁed samples to the number of samples received so far. For
all three proposed methods in all the experiments, we use similar good support functions
02(2) = 021(2) = 022(2) = (1 + 2)’\ with A = 0.1 for a fair comparison. Also notice that

the result is pretty stable for most of these data sets with different values of A.

6.4.2 Experimental results

To study the behavior of different Ieaming algorithms over trials, we show the average error
rates of all the methods over the entire online process in Figure 6.2. First notice that there
is big gap between Banditron_worst and Banditron_Best in all data sets that emphasizes
that the Banditron algorithm can perform very poorly if 7 is not set appropriately. We
observe that overall the proposed algorithms exhibit similar or better learning rates as the
Banditron algorithm with the optimal 7. In particular, banditron_ag2 and banditron_ag3
yields the best performance among the algorithms in comparison. In almost all the data sets,
banditron_ag2 and banditron_ag3 perform signiﬁcantly better than banditron_agO which
suggests that 7t 2 %t—1/3 is not a good adaptive choice. As a few examples, notice
that the ﬁnal error rate of banditron_agO is 45% versus 38% error rate of banditron_ag2,
banditron_ag3 and banditron_best for MNIST data set. For Pendigits data set, the ﬁnal error
rate of banditron_ag2 and banditron_ag3 is 56% which is signiﬁcantly low compared to
60% error rate of banditron_best and 62% error rate of banditron_agO. The latter example

also suggests that our adaptive strategy is better than the Banditron with a single best 7.

113

 

Error rate

Error rate

Error rate

 

---B-- Banditron_Best
- o-~ Banditron_worst
- 0 - Banditron_ago

‘ Banditron_ag1
- + - Banditron_agZ
- 4+ — Banditron_Ag3

NURSERY

o
0000000

:éat}; \
0.45» :8 83233 g, 8

 

 

 

 

 

 

 

0'40 5000 10000 15000
Training rounds
LETTER
1r
0953, ...,
\\ °~$ '8'”- «jj, - 1’ ~ ‘
‘ - ~0‘e6- "mm:
0.9' ‘: \ '0- -
a‘ﬂzﬁ. e-$ 0-0
"‘3‘: ._
0.85- 1.13:4;
0'80 5000 10000 15000
Training rounds
OPTDIGITS
I ‘28: ‘1‘). -
0 7 “WK: ~ ‘-" . (1.:
#1: : :#‘°~
06 28:33; ~70~°
8213532
0.5

0 1000 2000 3000 4000
Training rounds

MNIST

 

 

  
 

 

 

‘I
o ~ _ ,
E ‘ "if” . _
h- » ‘3- ’: 1 ‘ 5, .
Lg0.5 ~u~n~~2~~22 '2. 2
0 - . .
0 12- . g 6
raining roun s 4
x 10
PROTEIN
0.8-
£2
9
go 6 . Mg....0...0...o...0..,0...o .0
‘1' -‘ . I '1 51-2.
LU ahﬁﬂiﬁ$_#_$
0.4 - .
0 T . . 1 d 2
raining roun s 4
x 10
PENDIGITS

 

 

 

i{'0-“0- -o-~--o 2.0..
.-,, o.
0-8'\:‘:.. 0 0111050
0) ~' ‘
*5 it,
§0.7’ 1k at" :9 .
m “in...“ ‘9?
0.6 ‘15» 51— g 8
‘ “-1
0'50 2000 40800 6000 8000
Training rounds
ISOLET
1.
0.95 ' o
E x" if; 2.3%“.
12- 0.9- 9:0. ~; «5-,..
‘ ‘0- -
LU ﬂ‘mﬁ. .6 ONO
0.85» 3;. . _

 

 

0'80 2000 4000 6000 8000

Training rounds

 

Figure 6.2: The error rates of different methods over trials. Each point on a curve is the
average results of 50 randomly generated sequences of data.

114

Although better than banditron_worst, the performance of banditron_agl is not com-
parable to that of the other methods. This can be explained by the inherited difference
between banditron_agl and the other two proposed approaches. Unlike banditron_ag2 and
banditron_ag3 where two good support functions are introduced to determine 7t, the ”It
deﬁned in banditron_agl is determined by a single good support function. As a result, we
have a better control of the value for 7 over time in banditron_ag2 and banditron_ag3 by a
tradeoff between two functions: one which is the decreasing function of time and the other

which is the increasing function of the number of misclassiﬁed examples.

115

 

Chapter 7

Conclusion and Future Work

In this chapter, we summarized the main contributions of this thesis and draw some direc-

tions for future work.

7 .1 Summary and Conclusions

We developed several online and batch learning algorithms in this thesis. The batch Ieam-
ing algorithms that we covered have the common property that they all utilize boosting for
optimizing an objective function in a function space. Utilizing boosting is particularly ben-
eﬁcial because it allows any existing supervised learning algorithms be applied for a new
learning task. For the online Ieaming, our focus has been on the classiﬁcation with bandit
feedback. In the following subsections, we brieﬂy review our main contributions in two

separate sections, one for boosting and one for online Ieaming with bandit feedback.

7.1.1 Boosting

We developed boosting algorithms for several classiﬁcation and ranking problems, as sum-

marized below.

0 Semi-supervised classiﬁcation: Unlike existing semi-supervised learning algo-

116

 

 

rithms that focus on binary classiﬁcation problems, we addressed the problem of
multi-class semi-supervised learning directly. We proposed a new framework, termed
multi-class semi-supervised boosting (MCSSB), that is able to improve the classiﬁ-
cation accuracy of any given base multi-class classiﬁer. MCSSB utilizes both the
cluster and manifold assumptions in the design of objective function and exploits
boosting techniques to optimize the objective function. We showed that our proposed
framework is able to improve the performance of a given classiﬁer much better than
Assemble, a well-known semi-supervised boosting algorithm, on several real world
data sets. We also showed that MCSSB is very robust to the choice of base classiﬁers,

the number of labeled examples, and the value of parameter C.

Learning to rank by maximizing NDCG: Listwise approach is a relatively new
approach to Ieaming to rank that aims to optimize listwise loss functions; i.e. loss
functions that measure the performance of a ranking model in the query-level. The
difﬁculty in optimizing such losses lies in the inherited sort function used for comput-
ing them. We address this challenge by a probabilistic framework for the problem of
maximizing NDCG that optimizes the expectation of NDCG over all the possible per-
mutations of documents. We present a relaxation strategy to effectively approximate
the expectation of NDCG, and a bound optimization strategy for efﬁcient optimiza-
tion. Our experiments on benchmark data sets shows that our method is superior to

the state-of-the-art learning to rank algorithms in terms of performance and stability.

Ranking Reﬁnement: We considered the problem of ranking reﬁnement whose goal
is to improve a given ranking function by a small number of labeled instances. The
key challenge in combining the ranking information from the base ranker and the
labeled instances arises from the fact that the information in the base ranker tends
to be inaccurate and the information from the training data tends to be noisy. We

presented a multiplicative objective function to combine these sources of information

117

and proposed a boosting algorithm for learning. Empirical studies with relevance
feedback and recommender system show promising performance of the proposed

algorithm.

7.1.2 Online Learning

0 General framework: We presented a general framework for online multi-class
learning with partial feedback using the potential-based gradient descent approach
of which Banditron is a special case. In addition, we proposed an exponential gra-
dient algorithm for online multi-class Ieaming with partial feedback. Compared to

the Banditron algorithm, the exponential gradient algorithm is advantageous in that

 

its mistake bound is independent from the dimension of data, making it suitable for
classifying high dimensional data. We veriﬁed the efﬁcacy of the proposed algo-
rithm by empirical studies with several real-world data sets. Our experiments show
the exponential gradient approach for online learning with partial feedback is more
effective than Banditron in terms of the Ieaming rate, which makes it more suitable

for the scenario when the number of training examples is relatively small.

0 Automatic tuning of trade-off parameter : We studied the problem of optimizing
the exploration—exploitation tradeoff in the context of online classiﬁcation with bandit
feedback. We proposed three different strategies to automatically tune the tradeoff
parameter used by the Banditron algorithm. We showed through extensive experi-
mental study that the proposed approaches are effective in adjusting the exploration-
exploitation tradeoff. In particular, we found that two of the proposed algorithms

achieve similar or better performance compared to Banditron with the best value for

7.

118

7 .2 Future Work

In this section, we summarize future research directions that are directly related to the
theme of this thesis, in two separate subsections, one for boosting and one for online Ieam-

ing.

7 .2.1 Boosting

There has recently been increasing interests in understanding the relation between game
theory and machine learning and furthermore examining how each ﬁeld contributes to the
other [98, 99]. Particularly, boosting can be considered a ﬁctitious zero-sum game [39]
between two agents: a data generator as a row player that chooses a mixed strategy over
the space of training examples and a learner as a column player that chooses strategies
over the hypothesis space. The followings are some interesting game theory questions for

boosting:

e Representability of a given hypothesis for an speciﬁc task: Using Minimax the-
orem, Freund et al. [39] showed that there is a mixed strategy over the space of
hypotheses H that produces zero classiﬁcation error over the training set if for any
mixed strategy over the training examples, there is one hypothesis in H able to per-
form better than random guessing. Similar results may be extended to other tasks
that also utilize boosting. For example, we utilized the space of binary classiﬁers to
learn a ranking algorithm that maximizes NDCG in Chapter 3. It is interesting to
study the ability and limitation of binary hypotheses in maximizing NDCG; i.e. to
analyze the maximum value of NDCG obtained by a mixed strategy over the binary

hypotheses given.

9 New methods to ﬁnd mixed strategies: Boosting (and other ensemble methods)
can be considered methods to ﬁnd the mixed strategy over the hypothesis space.

However, the designer of these methods did not have the notion of equilibrium in

119

mind while developing them. Designing new algorithms that directly consider the
data generator as the row player and the learner and the column player and aim to
ﬁnd a equilibrium solution is potentially advantageous and interesting. One possible
option is to learn a ﬁnite set of weak models sequentially (similar to boosting) and

then playing a game to ﬁnd the best weighted majority votes (mixed strategy).

0 Batch learning with partial feedback: In this problem, the feedback (i.e. labeling)
is similar to online learning with partial feedback except that training instances are
provided in batch mode. For instance, consider the multi-class learning problem
where each instance is given a class label and a ﬂag that indicates whether or not
the given class label is correct. Similar to online Ieaming with partial feedback,
contextual advertisement and recommender systems are some example applications
of this problem. For these problems, training examples can be collected and utilized
for learning in batch mode similar to the click-through ranking feedback that is being
used in learning to rank. Designing a boosting algorithm that utilizes a supervised
classiﬁer for this problem is one direction of research work. From the game theory
point of view, this problem can be considered a game between two players with

partially known payoff matrix.

7 .2.2 Online Ieaming

Online learning with bandit feedback is a new research area for which there are several
open research questions, as summarized below:

0 Tighter bounds: Kakade et. al [5] proved that there exists algorithms for online
classiﬁcation with bandit feedback with bounds of order 0(T1/2), however the algo-
rithms that are introduced so far are of order 0(T2/3). Developing algorithms that
have better regret bounds than existing ones is one of the future research directions.

0 Online Ieaming to rank and multi-label classiﬁcation with partial feedback:

Contextual advertising and recommender systems are originally ranking problems

120

that were simpliﬁed as multi-class problems when dealing with online partial feed-
back. An intermediate setting between online ranking and online classiﬁcation with
bandit setting is online multi-label classiﬁcation in which more than one class (adver-
tisement) are relevant. Developing algorithms for online Ieaming to rank and online
multi-label classiﬁcation with partial feedback is another research direction that will

be explored in the future.

121

 

APPENDICES

122

Appendix A

APPENDIX

A.l Proof of Lemma 1, Chapter 2

Proof Bound in Equation (2.8) can be derived as follows:

 

 

 

1 1 (22::b,b’exp(abf’)><z$__1b,-exp<abk»
T = —‘:m 1." 12:1: 1: k+ k
Zm- k’— 1bi b]. z:k:=1bz'bjemewi‘l-hj»
7" m .k. -— hk +hk))
kl k, kl k, (213:1 szexp( a( ' j
3 (Z bi exp(ahz- )) (Z bi exp(ahj )) x Em bkbl‘?
klzl k’zl k: 1 2 J
[:1 k2 Tk3-
m b b
k2 k
= Z ———ij ’j exp (amt-k1 + h].2 _hz'3 — $73))
k1,k2,k3=1 2,]
1 + exp(6cr) + exp(—6a) exp(6c1) — 1 m k k k
irj 2!] =1

The inequality in (A.l) follows the convexity of reciprocal function, i.e.,

1 1 1
m k k k k = k k m k k k
Zk=1bibj exp(a(hi +hj)) Z:Ic-lbibjEmitt=1Tz'j(”Cpmwi +hj))

 

 

——-—1——bkbk 27'2": jexp(—a(hk +hk)
23k: 1 2 J k=1

123

 

The inequality in (A.1) follows the convexity of exponential function, i.e.,

k1 k2 k3 k3

 

 

 

 

k k k k 60 +
hfl + hi2 — h2g3 — h’73 + 2 1 —hf1 — hi2 + hk3 + h’73 + 2
S J 6 ‘7 exp(6a) + 5 exp(6a) + '7 6 ‘7

Bound in Equation 2.9 can be derived as follows

I
-:- Z yz- kexp(Hk —Hk+a(hJ k —hk))

 

 

 

l
22' m’ k’ ,=k 1
1 m bk k
+ exp(6a) + exp(-— 6a) +e_x______p( (6a) k’ z y,
s 1 >3 >3b, —--,
32w _ bk’ b.
k- k’ =1 2 2
The inequality used by the above derivation follows the convexity of exponential function, i.e.,
I I
, h’.c —h’?+2 —h’.c +h"?+2 1
exp(a(hl-c — hk)) < exp ﬁa—z———J——-— + 0 x Z J + 60—
% .7 _ 6 6 3
I I
hf —h’?+2 1 415° +h’?+2
S ——6—J——— exp(6a) + 5 exp(6a) + 6 3

Using the deﬁnition of dbfj, we have the result in Equation 2.9.

A.2 Proof of Lemma 2, Chapter 2

Proof. Following the result in (A.1), we have

1 m bklbkz k3 k k k k
37 Z —-'7y—’Jexp(a(hil +11]? —hz.3 —hj3))
751.7 k1,k2,k3=1 3:]

1 exp_(___2cr)- kk kkexp(201)—1 k+ k
5 75+— 2th]. (Zh'bi'b thbj 22gb]. Zak?” +3“

 

124

The inequality in (A.l) follows the convexity of exponential function, i.e.,

hkl

'i
k1 ’62 ’63 k3 _ 6"
0x 2 J " 3

k2 k3 k3
+11]. hi h]. +2

 

+

1
+ 603

 

E3+2 —h

hfl + W hb°3 — h 1
exp(6a) + 3 exp(6a) +

< j z
_ 6

 

 

Bound in Equation 2.9 can be derived as follows

m
_ k k’_ k k’_ k
..H 2 yiexp(Hj Hj+a(hJ hj))
7'7] kl,k=1

1 + exp(60:) + exp(—6a) exp(6a) - 1 E: hk m k’ bf yzlc
z + 6 i _ _ —
32' ' k’=1

 

|/\

 

y .
J k
1,] k=1 bf, bi

The inequality used by the above derivation follows the convexity of exponential function, i.e.,

k’ k

I
41%“ +hf+2
6 +0X 6 +60§

IA

 

2 exp 60

exp<a(b’-°' — 1.59))

I I
hk—h’?+2 41’? +h§7+2

2 J 1 z
-—————-6 exp(6a) + 3 exp(60:) + 6

|/\

 

Using the deﬁnition of qbfj, we have the result in Equation 2.9. D

A.2 Proof of Lemma 2, Chapter 2

Proof. Following the result in (A.1), we have

’61 k2 1:3
1 2 b J m k1 k2 k3 k3
a] k1,k2,k3—l 1.7

1 exp(2a) — 1 m k k k k exp(—2cr)— 1 m k hk hk
5 T+—'T._ 2"be ”ij +‘7f‘3 201,31 2' + j)
Z 2Z 2
2,] 1,] k=1 1:] 11:21

124

The inequality in (A.l) follows the convexity of exponential function, i.e.,

hf1+h§2 —hf3 —h§3+2
6a +
—h. —h- +h- +h- +2
2 _7 7. J

 

 

 

 

 

 

1
hkl + hk2 41:63 — h’73 + 2 1 41:71 — h’72 + hk3 + hi3 + 2
g '7 exp(60:) + — exp(60z) + J J
6 3 6
Bound in Equation 2.9 can be derived as follows
i...
,E
1 _ k k’ k 'L
if — Z: y’i €Xp(H - — Hj + GUI-7' _hj )) .'
7'1] k’,k= 1 71, i
m m b’? k
1 + exp(6a) + exp(—ﬁa) exp(601) — 1 k k, z 312
l + 23b.- 2 b- ———
32 6 J bk’ bk
zrj k=1 ”=1 2 2

The inequality used by the above derivation follows the convexity of exponential function, i.e.,

 

 

I
k’ k hf —h§+2 —hf’+h§+2
exp(a(hi — hj» S Exp 60——-6—_ + 0 X 6 + 605
k’ k k’ 1:
hi —hj+2 1 —hz- +hj+2
< _ _
_ 6 exp(6a) + 3 exp(6a) + 6
Using the deﬁnition of (bi-‘7 j , we have the result in Equation 2.9. CI
A.2 Proof of Lemma 2, Chapter 2
Proof Following the result in (A.1), we have
’91 k2 k3
m b2- b
1 , k2 k k
27. Z —-J—lexp(a(hzk1+hj2—h.3 —hj3))
2,] k1,k2,k3=1 'i,J
1 e___xp(2cr)— k k k k +e_____xp(- 20) - 1 k+ k
5 75+—— 22,3 1(thbi+ ++hjbj ___-{__er Zafj (11 +111.)

124

The inequality in the above derivation follows the convexity of exponential function (similar to the proof of

Lemma 1). For Z‘j ,we have

I I I
kbk h’? h‘? 2—h’? —h’?
21 yk —exp Za—L-Za—2—+0——l———’Z-
21].: kk’— 2 bj 2 2 2
I I
kbk hk hk m kbk 2— hk— hk
3 Z yz b—-:k Texp(2a)+—2-exp(— 2a) +21 y;c BF——%—
+k,=k’ bj

k,k’=1

Replacing 1 / Z 1‘ , and l/Zzl j in (2.8) and (2.9) with the above bounds, we have the result in Lemma 2. Cl

A.3 Proof of Theorem 4, Chapter 2

Using Lemma 2 and Theorem 3, we have

_ < __ ___— _
F F _ l/Au+CA1 (Au+CAl)+ Bu+CBl(B“+CB’) (Au+CAl+Bu+CBl)

—(,/Au+CA,—,/BU+CB,)2,

which is equivalent to
2
1_ (VZU‘l'CZ —‘/Bu+CBlj
F
2
(_(ﬁZu't'Cz —‘/Bu+CEl> ) (A2)
F .

"ul "m
/\

|/\

exp

The above inequality follows from exp(:2:) 2 1 + x. We rewrite FT as

T
t=1

By substituting Ft / Ft"1 with the bound in Equation A.6, we have the result in the theorem.

125

 

 

A.4 Proof of Proposition 2, Chapter 3

 

 

 

 

 

1 1
-k ~k = k k k k
1+exp(Fz. —Fj) 1+exp(Fi _Fj +a(fi "'fj ))
k k
= ( + 1 J eXP(a(fik - ff»

1 + expwzk _ F119) 1 + exp(Fz.k — FJ’F) 1 + exp(Fik — FJ’F)
exp(Fz-k — Ff) exp(Fz-k — FJ’F)

1
1 — +
1 + exp(Fik — F3153) ( 1 + exp(Fik — F?) 1 + exp(Fik — F319)

|/\

 

 

 

exp<a<ff - do)

 

1 k k k
= . . +7~exrba(f--f-)-1
1+ exp(sz - F341”) 2’3[ < J z 1

The ﬁrst step is a simple manipulations of the terms and the second step is due to the convexity of inverse

function on 12+.

A.5 Proof of Lemma 4, Chapter 3

1 = Z Pr(7rk|F,qk) + Z Pr(7rk[F,qk)
«keG§(i,j) wkerb'J)
= Z Pr(7rk|Feqk) (1 + exp [who — xk(j))<F(d§eqk> - F(dibqk))] )
IrkEGIﬂiJ)
2 Z (Prekinqk) (1 + exp [Z(waeq’“) - F(dibqkﬂm
IrkGGgﬁJ)

(1 + exp [2(F(df,qk) — F(df, qk))]) Pr (aka) > #0))

We used the deﬁnition of Pr(7rk IF, qk) in Equation (3.6) to ﬁnd G§(i, j) as the dual of G50, j) in the ﬁrst
step of the proof. The inequality in the proof is because wk(z‘) — Irk( j ) _>_ 1 and the last step is because

Pr(7r’c IF, qk) is the only term dependent on 7r.

126

A.6 Proof of Theorem 5, Chapter 3

In order to obtain the result of :1}: Theorem 5, we ﬁrst plug Equation (3.13) in Equation (3.11). This leads to
minimizing 22:1 2173!“: 1 2%,?ij [exp(a(f;-° — fz-k))] , the term related to a . Since fz-k takes binary

values 0 and 1, we have the following:

 

 

 

Getting the partial derivative of this term respect to a and having it equal to zero results the theorem.

 

A.7 Proof of Theorem 6, Chapter 3

First, we provide the following proposition to handle exp(a( f f — fz-k)).

Proposition 10. Ifas, y E [0, 1], we have

exp(3;1) - 1(x _ y) + exp(3a) + e:p(—30r) + 1 (A3)

 

exp(a($ - 31)) 5

Proof Due to the convexity of exp function, we have:

z—y+1 1—x+y 1

exp(oz(a: — y)) = exp(3a 3 + O x —3—— + 3 x —3a)
— 1 1 — l
3 32—3; exp(3oz) + ——::£ + 5 exp(——3cr)

Using the result in the above proposition, we can bound the last term in Equation (3.13) as follows:

91“,,a[exp( (ff—If>—1]<_0§j(———exp(3a) (If— If)+ exp<3a)+e:p(’3a)’2) (AA)

127

Using the result in Equation (AA) and (3.13), we have M(Q, 17‘) in Equation (3.] l) bounded as

 

M(Q,F)sM(Q,F)+v<a>+el——3——p(3a) ZZZ“ 26,-‘1, (If—ft)

mk 7.119

k
1'.
__ - exp(3a)— k 22 —2.7 k
—M(Q,F) +7(a)+ —— 2 E3132 Tam
k=12=1 3:1

A.8 Proof of Theorem 7, Chapter 3

Proof By plugging Equation 13 into Equation 11, we have

rk
ijzb) —M(Q,F) s 21:42:}: 1 Z—gﬁbﬁj [exp<e<f,’-c —f.-’°)> — 1]

Since fik takes binary values 0 and 1, we have the following

 

 

n mk {bk-10k k k

2:1 :1 Z [0,,jexp<a(f,- —I,-))—1]

= 2"]:

n 77113211»?—

=2 Zk ——0{f (exp<a>I<I,’°>I,-k>+exp(- e1) <ff<I,-’“))
k=1i ,j=1

n mk 2rz:_1

—k§jm Z b,’f,-(I<f,’-“>I§>+I<Ij<ff+1<ff=ff>)

128

" "_I'.:|"I\?V
I
"sl

 

 

So,

 

n "U; 27'2-
M<Q.F,~’°)—M<Q,F) s 2 :— b,’i,(exp<a)I<I,-’°>If)+exp<—a>1<ff<ff))
k=1i z,j=1
n mk
— Z 23:— ‘ (1(fk>fk)+1(fk<fk+1(fk do)
k=12,j=1
11 ml: 2r- _
S. eXP(a)E:1 23212 ,-I(f,k>f,k)
= l
+ exp(— 0):“: f— jI(fk<f‘k)‘01-O‘2 i
k=1i,j=1

= exp(a)012 + exp(—a)a1k — 01 — 02

= exp(-— 2log(—-)ag + exp(--—log(: ;)al — a1 — 02
= WE‘LLE—ar -02 - -(\/_-\/a_2)2

which is equivalent to

M(Q,F,-") < 1_(,/cT—,/cx—2)2
M(Q,F) _ M(Q.F)

spec-we?)

|/\

M(Q,F)

The above inequality follows from exp(x) 2 1 + 2:. We rewrite MT as

   

T
Millet—b

‘ t
By substituting I}??? with the bound in Equation A.6, we have the result in the theorem.

129

mf‘

 

(A5)

(A.6)

 

A.9 Proof of Theorem 8, Chapter 4

Proof First, note that the objective function Lp is convex in terms of F. This is because Lp can be expanded

as follows:

12
Lp = Z Tm-Wz-J exp(Fj — F,- + Fk - Fl)
2', j,k,l =1

Since exp(Fj — Fz- + Fk — Fl) is a convex function, Lp is convex. Since Lp is a convex function, the solution
found by minimizing Lp will always be global optimal, instead of local optimal.

Second, to show that the optimal solution found by minimizing Lp is Pareto efﬁcient, we prove by contra-
diction. Let F* denote the global minimizer of function Lp. By assuming that Theorem 8 is not correct,
there will exist a solution F aé F* that either (1) WWW) < e’ﬁ'w(F*) and (2773 (F) g ﬁﬂF“), or (2)
e’ﬁ'w(F) S e’ﬁ'w(F*) and e’ﬁ't(F) < €r7t(F*). We can easily infer Lp(F) < Lp(F*) since (1) both
e’ﬁ'w and e’Frt are non-negative for any solution F, and (2) Lp = (27710 x (277}. Clearly, this conclusion

contracts the fact that F* is a global minimizer of Lp. El

130

.ﬁi- j.

 

 

A.10 Proof of Lemma 5, Chapter 4

Proof Since F(x) 2 F(x) + a f (x), we have

i n n
L—: ( 2: WM eXp(Fj —F,- +O(fj _fz'») >< ( 2 TM exp(Fj -F'i +0‘(fj -fi)))

( Z 0233' 6XP(0(fj -fz')) ( Z: bi,j exp(an —fi))

i,j=1 2',j=1

where ai , j and bi , j are deﬁned in (4.11) and (4.12). Thus, we have an upper bound of the log ratio as follows

i n 11
long log ( Z aiaj exp(a(fj — fi») +10g ( Z bi,j exp(a(fj - ft»)
iJ=1 iJ=1

n
—2 + 2: (am- + bi,j)exp(a(fj — fi»
z',j=1

|/\

The second inequality follows the concaveness of the logarithm function, i.e., logx _<_ x — 1 for any a: >

0. C]

A.11 Proof of Theorem 9, Chapter 4

Proof Using the upper bound expressed in Lemma 5, we have

~

L n
mfg-+2 5 Z 7,,jexp<a(fj—f,->)
z',j=1

= ( Z 1i,j6(fj,1)6(f,-,O)) exp(a) + ( Z 7,,j6(fj,0)6(f,~,1)) exp(—a)
z',j=1 2',j=1

Using the deﬁnition of a in (8), we have

 

 

I: ’n. n
logfg 5 -2+2 (2 7i,j5(fj,1)5(fi,0)) (Z 7i,j5(fja0)6(fia1))
z',j=1 i,j=1

= —2 + Law

131

{L- v.1

 

In the above, we use the deﬁnitions of p and V in Theorem 8 to simplify the expression. Since 2 =

22j=1 72.1.7. 2 p + V, we have

~

L
log?)2 5 —2+2,/;u/ S —p—V+2,/;w= — (ﬁ— WV
P

 

We thus have
Li 2
log Lt—l S rt : " (W'- M)
P

Substituting the above expression for rt into (4.17), and further using the fact

11.
L0 = Z T,,,-+W,-,,-,
231:1

we obtain the result in Theorem 9. E]

A.12 Proof of Theorem 10, Chapter 4

Proof We rewrite the quantity 17 as follows:

72 n n 'n
'7 = ZfiinWi|=Zfi Eng-71¢ = Z 7i,j(fi“fj)=”‘”
i=1 i=1 '=1 gj=1

Since

u—v=(¢ﬁ—W)(¢ﬁ+ﬁ)2(¢ﬁ—m2,

we have n 2 (ﬂ — W)? Substituting this result into the expression of Theorem 9, we have Theorem 10.

C]

132

A.13 Proof of Proposition 5, Chapter 5

Proof

yt yt pm
8

2-

Etnatlﬁi = Et [

 

[ =12
1,. _1~_U_

1_ [ﬁt = ﬁt]
p371:

 

 

 

= {3’1}: ytlEt [

1A __ 1~ [yt = yt]

+ [gt 7’; yt ]Et yt yt ~
yt

 

 

 

2
3% #w

_ A_ 7(K-1)/K
_ [yt‘ydm

[Milli—m”err/3}

fir—7+[iit #yt]{1_7+% (1+ [g]s)2/s}

|/\

A.14 Proof of Theorem 11, Chapter 5

We take the expectation of both sides of the equality in (5.8) with respect to ﬁt, denoted by Et [-,] and have

E, [04,...(U, Wt_1) — 0,1,... (U, Wt) + D¢*(Wt—1,Wt)]

(Wt—1 — U, nxt'rtT )

We deﬁne Mt = [fit 76 yt]. Since 37, 79 yt implies V€t(Wt—1) = xtTtT, using the convexity of the loss

function, we have

(€t(Wt_1)—€t(U))Mt S (Wt—l—UaWt(Wt—1))

= (Wt—1 — U, xtrtT)

133

We thus have

1 _ _
6E1; [D¢*(U,Wt 1)—D¢*(U,Wt)+Dq,*(Wt 1,Wt)]

<W“1 — Umxnf) 2 (et<W‘-1)— et(U))Mt

By adding the above inequalities of all trials, we have

T 1 T
E [Z (“Wt—1)] - {;D¢*(U) + :Q(U)}
t=1 t=1
1 T t 1 t T 1 t 1 t
g —ZE[D¢*(W- ,w )] =Z-E[Dq,(9— ,0 )]
"t=1 t=177

The last step uses the property of Bregman distance in Lemma 6. Since <I>* is a strictly convex function with

constant p with respect to H - ||p,3, according to Lemma 6, we have
D (A B) < 1 A B 2
q) , _ 2p” II,,,
where p"1 + q—1 = 1 and 5‘1 + t"1 = 1. Hence,

E[Dq,(6t—1,0t)]

l/\

2
77 T 2
ZEllxt‘st lq,t]

2 2
77 2 2 77 2
EEH‘Stlslxtlp] S 2—pEH5tls]

|/\

where the second inequality is due to Holder’s inequality. Using the result in Proposition 5 and the fact

27le €t(Wt-1)Mt 2 £3le Mt, and that E[M] g E[M] + 77‘ we have the result in the theorem.

134

A.15 Proof of Lemma 7, Chapter 5

(W — W’, V<I>*(W) — V<I>*(W'))

||
IM
(5:1

[)2

Kd(w WW“;
=22 ’

 

 

Maul

The second equality is due to mean value theorem and uses the Taylor expansion of log function where

W = AW + (1 — A)W’ with A 6 [0,1]. Since

(2?: 1 Wak—
22?:

 

__ r ,
21“]ch Z M231: ‘Wz'Jc'
and 2:121 W“: =1, we have
K
(W — W', V<I>*(W)— VcI>*( W’)) >kz| |w,c — chﬁ

Using the property of Bregman distance in Lemma 6 and the fact the dual norm of L1 is Loo, we have the

result for <I>(0).

A.16 Proof of Theorem 14, Chapter 6

Proof Considering that Banditron uses a second-order potential function, we have the following bound when

37 is used as the predictor:

1 1
2 MW) — Z W) s min + §|Wt—1 — th%~ s Ivlfp + 5E [WW]

|/\

|U|F+E{Z’Yt[yt =ytl+ Z —lyt #11111}

t- l

135

where we used Theorem 11.1 of [83] and Lemma 6 in the ﬁrst inequality, let ”2 5 1 in the second inequality,
and Lemma 5 of [5] in the third inequality. Using E[M] S 2:le £t(W) concludes the theorem when we

add E [2221 7,] to get the bound for 37 [5]. [:1

A.17 Proof of Proposition 6, Chapter 6

Proof. We only show the result for w(z) = (a+ z)>‘. A similar derivation can be applied to w(z) = ln(a+ 2:).

We have

L = max———)‘——=

A—l
Aa
220 (a + z)1’)‘

To derive p, we have

(a + z + t)1_)‘

( + )1_’\ S(1+t/a)1—)‘Set(1—’\)/a

 

Hence p = e(l_’\)/a. E]

A.18 Proof of Proposition 7, Chapter 6

Proof By deﬁning At = 2:21 Ti, we have

Kwi(At_1)(At — At..1) < 2K 2;"; w'1(At—1)(At - At—l)
t—l —
wé (22:1 1 + #2) wé (2:1 1 + M)

 

 

T T
2 K 2 2
—Tt :-
t=1

where the inequality is because w'2 is a non-increasing function. We also have:

T T
Z w'1(At-1)(At — At—l) S pl 2: c0'1(At)(2‘1t - At—l)
t=1 t=1
T
S p1 2601040 - w2(At—1) S p1W1(AT) (A7)
t=1

where the ﬁrst step is due to the deﬁnition of good support functions, the second step is due to the concavity

of “’1 and the last step is due to the telescope property and the fact that w1(0) 2 O. Combining the above

136

 

results produces the ﬁrst inequality in the proposition. The proof of the second inequality in the proposition

is similar and follows. By deﬁning Bt = 2::1 pi, we have

 

T T I T I

w (Bi—1 +t-1) w (Bt—l)

2:7tl‘t = E, 2 ___—2 (Bt-Bt—l)
t=1

B—B_ S
H 2w’1(At-1) (t ‘1) 22404-)

2&1 w§(Bt—1)(Bt — Bt—1)<Zt__1 pgw’2(Bt)(Bt - Bt— 1) < P%W2(BT)
_ 2w'1(AT) — 2w1(AT) _ 2w’1(AT)

 

Notice the third equality is because wé (Bt—l) S pgt w’2(Bt) and #t S 2. Similarly, we have:

T T w; (B T I _
_ ”(2 t—1+t 1) _ _ “’2“ 1) _ _
tan — ; 2‘0th 1) [t (t 1)]sZ———2w,(At_)[t (t 1)]

 

5 20421411 )Zw2(t— 1)[t— (t—1)]_2—————L‘J,1‘E2 ﬂzw'ﬂtﬂt- -(t-1)]
< P2w2(T)
— I
20.21 (AT)
D
A.19 Proof of Proposition 8, Chapter 6
Proof Similar to the argument in Section 6.3.2, we have
T T T 1 I p2
2 WM: 22—11] 2 ”2‘ =2 R” (At— 1) (At At—l) < EZwMt)
t=1 t=12 t=1
A similar argument can be applied to prove the second and third inequality in the proposition. El

137

BIBLIOGRAPHY

138

Bibliography

[1] N. Cesa-Bianchi and G. Lugosi. Prediction, Ieaming, and games. Cambridge Univ Pr, 2006.

[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.

[3] Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Min. Knowl. Discov., 2(2):]21—167, 1998.

 

[4] Robert E. Schapire. The strength of weak leamability. Journal of Machine Learning, 5:197—
227, 1990.

[5] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efﬁcient bandit algorithms for

online multiclass prediction. In International Conference on Machine Learning ’08, pages
440—447, 2008.

[6] J oannes Vermorel and Mehryar Mohri. Multi-anned bandit algorithms and empirical evalua-
tion. In In European Conference on Machine Learning, pages 437-448. Springer, 2005.

[7] Alon Altman and Moshe Tennenholtz. Ranking systems: the pagerank axioms. In EC ’05:
Proceedings of the 6th ACM conference on Electronic commerce, pages 1—8, New York, NY,
USA, 2005. ACM.

[8] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun.
ACM, 18(11):613—620, 1975.

[9] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In
SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research
and development in information retrieval, pages 275—281, New York, NY, USA, 1998. ACM.

 

[10] Hamed Valizadegan, Rong Jin, Ruofei Zhang, and Jianchang Mao. Learning to rank by op-
timizing ndcg measure. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. 1. Williams, and
A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1883—1891 ,
2009.

 

[1 1] T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133—142,
2002.

[12] Tie-Yan Liu. Learning to Rank for Information Retrieval. Now Publishers Inc, 2009.

[13] Yison g Yue, Thomas Finley, Filip Radlinski, and Thorsten J oachims. A support vector method
for optimizing average precision. In SIGIR 2007, pages 271—278, 2007.

139

[14] Kalervo Jiirvelin and Jana Kekiil'ainen. Ir evaluation methods for retrieving highly relevant
documents. In SIGIR 2000, pages 41—48, 2000.

[15] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in Neural Information
Processing Systems I 4, pages 641—647. MIT Press, 2001.

[16] Ping Li, Christopher Burges, and Qiang Wu. Mcrank: Learning to rank using multiple classi-
ﬁcation and gradient boosting. In Neural Information Processing Systems 2007, Cambridge,
MA, 2008.

[17] Ramesh Nallapati. Discriminative models for information retrieval. In SIGIR 2004, pages
64—71, 2004.

[18] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector Ieaming for ordinal
regression. In ICANN 1999, pages 97—102, 1999.

[19] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efﬁcient boosting algorithm
for combining preferences. Journal of Machine Learning Research, 4:933—969, 2003.

[20] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hul-
lender. Learning to rank using gradient descent. In International Conference on Machine
Learning, pages 89—96, 2005.

[21] Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao—Wuen Hon. Adapting
ranking svm to document retrieval. In SIGIR 2006, pages 186—193, 2006.

[22] Ming Feng Tsai, Tie yan Liu, Tao Qin, Hsin hsi Chen, and Wei ying Ma. Frank: A ranking
method with ﬁdelity loss. In SIGIR 2007, 2007.

[23] Rong Jin, Harned Valizadegan, and Hang Li. Ranking reﬁnement and its application to infor-
mation retrieval. In W 2008, pages 397—406, 2008.

[24] Tao Qin, Tie yan Liu, Ming feng Tsai, Xu dong Zhang, and Hang Li. Learning to search web
pages with query-level loss functions. Technical report, 2006.

[25] Christopher J. C. Burges, Robert Ragno, and Quoc V. Le. Learning to rank with nonsmooth
cost functions. In Neural Information Processing Systems 2006, 2006.

[26] Zhe Cao and Tie yan Liu. Learning to rank: From pairwise approach to listwise approach. In
International Conference on Machine Learning 2007, pages 129—136, 2007.

[27] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to leam-
ing to rank: theory and algorithm. In International Conference on Machine Learning 2008,
pages 1192—1199, 2008.

[28] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing non-
smooth rank metrics. In WSDM 2008, pages 77—86, 2008.

[29] Maksims Volkovs and Richard S. Zemel. Boltzrank: Ieaming to maximize expected ranking
gain. In International Conference on Machine Learning, page 137, 2009.

[30] K. P. Bennett and Ayhan Demiriz. Semi-supervised support vector machine. In Neural Infor-
mation Processing Systems, 1999.

140

[31] O. Chapelle and A. Zien. Semi-supervised classiﬁcation by low density separation. In 10th
Int. Workshop on Al and Stat, 2005.

[32] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph
mincuts. In International Conference on Machine Learning, 2001.

[33] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian ﬁelds and
harmonic functions. In International Conference on Machine Learning, 2003.

[34] D. Zhou, O. Bousquet, T. La], J. Weston, and B. Scholkopf. Learning with local and global
consistency. In Neural Information Processing Systems, 2003.

[35] M. Belkin, 1. Matveeva, and P. Niyogi. Regularization and semisupervised learning on large
graphs. In Internation Conferent on Learning Theory, 2004.

[36] Yoav Freund. Boosting a weak learning algorithm by majority. In COLT ’90: Proceedings of
the third annual workshop on Computational Ieaming theory, pages 202—216, San Francisco,
CA, USA, 1990. Morgan Kaufmann Publishers Inc.

[37] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals
of Statistics, 29:1189-1232, 1999.

[38] Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahramani. On the convergence of bound
optimization algorithms. In Uncertainty in Artiﬁcial Intelligent, pages 509—516, 2003.

[39] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Inter-
national Conference on Machine Learning, 1996.

[40] Pavan Kumar Mallapragada, Rong Jin, Anil K Jain, and Yi Liu. Semiboost: Boosting for
semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,
31:2000—2014, 2009.

[41] Yr Liu, Rong Jin, and Anil K. Jain. Boostcluster: boosting clustering by pairwise constraints.
In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 450—459, New York, NY, USA, 2007 . ACM.

[42] Jun Xu and Hang Li. Adarank: a boosting algorithm for information retrieval. In SIGIR 2007,
pages 391-398, 2007.

[43] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65 2386—408, 1958.

[44] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic
multiarrned bandit problem. SIAM Journal of Computing, 32(1):48—77, 2003.

[45] C. Watkins. Learning from delayed Rewards. PhD thesis, Cambridge, 1989.

[46] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and
markov decision processes. In Internation Conferent on Learning Theory ’02, pages 255—270,
2002.

[47] Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed
bandit problem. Journal of Machine Learning Research, 5:623-648, 2004.

141

[48] K. P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods. In
KDD, 2002.

[49] K. Chen and S. Wang. Regularized boost for semi-supervised learning. In Neural Information
Processing Systems, 2007.

[50] Hamed Valizadegan, Rong Jin, and Anil K. Jain. Semi-supervised boosting for multi-class
classiﬁcation. In ECML PKDD '08: Proceedings of the European conference on Machine
Learning and Knowledge Discovery in Databases - Part 11, pages 522—537, Berlin, Heidel-
berg, 2008. Springer-Verlag.

[51] S. Robertson and D. A. Hull. The tree-9 ﬁltering track ﬁnal report. In TREC9, pages 25-40,
2000.

[52] Koby Crammer, Yoram Singer, and K. Warmuth. Ultraconservative online algorithms for
multiclass problems. 3:2003, 2003.

[53] Shijun Wang, Rong Jin, and Hamed Valizadegan. A potential-based framework for online
multi-class Ieaming with partial feedback. In International Conference on Artiﬁcial Intelli-
gence and Statistics ’10, 2010.

[54] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[55] Chih-Chung Chang and Chih-Jen Lin. Libsvm : a library for support vector machines, 2001.

[56] Tre-Yan Liu, Tao Qin, Jun Xu, Wenying Xiong, and Hang Li. Letor: Benchmark dataset for
research on learning to rank for information retrieval. In LR4IR 2007, 2007.

[57] GroupLens. MovieLens Data sets. http://www. grouplens.org/node/ 12, 2006.

[58] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-
correcting output codes. J. AI Res., 2:263—286, 1995.

[59] Rong Jin and Jian Zhang. Multi-class Ieaming by smoothed boosting. Mach. Leam.,
67(3):207—227, 2007.

[60] Bernhard Scholkopf and Alexander J. Smola. Learning with Kemels: Support Vector Ma-
chines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[61] B. Zadrozny and C. Elkan. Transforming classiﬁer scores into accurate multiclass probability
estimates. In KDD, 2002.

[62] Giinther Eibl and Karl-Peter Pfeiffer. Multiclass boosting for weak classiﬁers. J. Mach. Learn.
Res., 6: 189-210, 2005.

[63] Ling Li. Multiclass boosting with repartitioning. In International Conference on Machine
Learning, 2006.

[64] Xiaojin Zhu. Semi-supervised Ieaming literature survey. Technical Report 1530, Computer
Science, University of “Wisconsin-Madison, 2005.

[65] Thorsten Joachims. Transductive inference for text classiﬁcation using support vector ma-
chines. In International Conference on Machine Learning, 1999.

142

 

[66] T. De Bie and N. Cristianini. Convex methods for transduction. In Neural Information Pro-
cessing Systems, 2004.

[67] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector
machines. InAAAI, 2005.

[68] F. d’Alche Buc, Y. Grandvalet, and C. Ambroise. Semi-supervised marginboost. In Neural
Information Processing Systems, 2002.

[69] Nicholas J. Higham. Matrix neamess problems and applications. In M. J. C. Gover and
S. Barnett, editors, Applications of Matrix Theory, pages 1—27. Oxford University Press, 1989.

[70] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):888—905, 2000.

[71] Jen-Yuan Yeh, Yung-Yi Lin, Hao-Ren Ke, and Wei-Pang Yang. Learning to rank for informa-
tion retrieval using genetic programming. In LR4IR 2007, 2007.

[72] Zhengya Sun, Tao Qin, Qing Tao, and J ue Wang. Robust sparse rank learning for non-smooth
ranking measures. In SIGIR, pages 259-266, 2009.

[73] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for
information retrieval. In SIGIR, pages 111—119, 2001.

[74] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles. Collaborative ﬁltering by personality
diagnosis. In Uncertainty in Artiﬁcial Intelligent, 2000.

[75] R. E. Steuer. Multiple Criteria Optimization: Theory, Computation and Application. John
Wiley, 546 pp. 1986.

[76] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres-
sion. In Advances in Large Margin Classiﬁers, pages 115-132, 2000.

[77] J. Gao, H. Qi, X. Xia, and J.-Y. Nie. Discriminant model for information retrieval. In SIGIR,
pages 290—297, 2005.

[78] D. Harman. Relevance feedback revisited. In SIGIR, 1992.

[79] Mark Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In CIKM ’02:
Proceedings of the eleventh international conference on Information and knowledge manage-
ment, pages 538—548. ACM, 2002.

[80] RT. Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 1970.

[81] Robert E. Schapire. Theoretical views of boosting and applications. In Algorithmic Learning
Theory, I 0th International Conference, ALT ’99, volume 1720, pages 13—25. Springer, 1999.

[82] J. J. Rocchio. Relevance feedback in information retrieval. 1971.

[83] Nicolo Cesa—Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge Uni-
versity Press, 2006.

143

[84] Donald A. Berry and Bert Fristedt. Bandit problems: Sequential allocation of experiments.
Chapman and Hall, 1985.

[85] John Langford and Zhang Tong. The epoch-greedy algorithm for contextual multi-armed
bandits. In Neural Information Processing Systems ’07, 2007.

[86] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Journal of Machine Learning, 2(4):285-318, April 1988.

[87] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for
linear prediction. In ACM Symposium on Theory of Computing ’95 , pages 209—218, 1995.

[88] Adam J. Grove, Nick Littlestone, and Dale Schuurmans. General convergence results for
linear discriminant updates. Journal of Machine Learning, 43(3): 173—210, 2001.

[89] Herbert Robbins. some aspects of the sequential design of experiments. Bulletin of the Amer-
ican Mathematical Society, 58:527—535. 1952.

[90] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping condi-
tions for the multi-armed bandit and reinforcement Ieaming problems. Journal of Machine
Learning Research, 7:1079—1105, 2006.

[91] Chih chun Wang, Student Member, Sanjeev R. Kulkami, and H. Vincent Poor. Bandit prob-
lems with side observations. IEEE Transactions on Automatic Control, 50:338—355, 2005.

[92] V1 jaykumar Gullapalli Computer and V1 j aykumar Gullapalli. Associative reinforcement leam-
ing of real-valued functions. In Proceedings of the IEEE Conference on Systems, Man, and
Cybernetics, 1991.

[93] Alexander L. Strehl, Chris Mesterhartn, Michael L. Littman, and Haym Hirsh. Experience-
efﬁeient Ieaming in associative bandit problems. In ICML ’06: Proceedings of the 23rd inter-
national conference on Machine Ieaming, pages 889—896, New York, NY, USA, 2006. ACM.

[94] Chih chun Wang, Sanjeev R. Kulkami, and H. Vincent Poor. Arbitrary side observations in
bandit problems. Adv. Applied Math, 34:903-936. 2005.

[95] Nicolas Meuleau and Paul Bourgine. Exploration of multi-state environments: Local measures
and back-propagation of uncertainty. Mach. Leam., 35(2):117—154, 1999.

[96] Nicole Cesa-Bianchi and Paul Fischer. Finite-time regret bounds for the multiarmed bandit

problem. In In 5th International Conference on Machine Learning, pages 100—108. Morgan
Kaufmann, 1998.

[97] Kolby Crammer. Online learning of real-world problems. In International Conference on
Machine Learning ’07, 2007 .

[98] lead Rezek, David S. Leslie, Steven Reece, Stephen J. Roberts, Alex Rogers, Rajdeep K. Dash,
and Nicholas R. Jennings. On similarities between inference in game theory and machine
learning. J. Artif Intell. Res. (JAIR), 33:259—283, 2008.

[99] Amy R. Greenwald and Michael L. Littrnan. Introduction to the special issue on learning and
computational game theory. Machine Learning, 67(1-2):3—6, 2007.

144