magma...» a _ . sauna. 5. I '"H¥ .q . ... .. ..s...§...n.. in? )5" “$83!”: n... v I: 1:5... .3133. a i ILkF ...—fill ‘ 131.73.55: . .4 5.194 «3%... .. La: ..i...» a?! .....2; .53.}... 3% .......?;m«.. unafl.:!.v%wsn . v u; i: 3114. ...».inLcivnum. «.5 Gun? ”first” , ll .fclill. 1.54:... if... . gang? .fiyhmci... J“. ‘) .- L. w. v. 5» 3:131: 1 I ... 1.23:1 ~Izlzr .I s 2 ,5 {.37 1.3... ..x... 3.9!. $632.... KOII! I. 3:91.! THESlS 7 ’3. r?) ff‘ x. 1; y LIBRARY—1 Michigan State University This is to certify that the dissertation entitled ONLINE HANDWRITING RECOGNITION USING MULTIPLE PATTERN CLASS MODELS presented by Scott D. Connell has been accepted towards fulfillment of the requirements for Doctoral degree in Computer Science & Engineering Major professor Date 5/H/OO MSU i: an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 11m emu-m.“ 0N1 [SIM ONLINE HANDWRITING RECOGNITION USING MULTIPLE PATTERN CLASS MODELS By Scott D. Connell A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2000 ONLIX I\ II P A '7 i. .r . Th“ rind U. hr, M. h, :11‘ l; 10‘) IU GIUALPUJ 3:39 hand PM ~A4. frfignitirm ally {QC-{9319311 1m ,2: 12% makes ll .9 LG ii”; ‘74 . “c“t'lllmo D4. . .. - ?w“l " :l’ ’fl _ ...uriib Ill 4 r-l r\; ABSTRACT ONLINE HANDWRITING RECOGNITION USING MULTIPLE PATTERN CLASS MODELS By Scott D. Connell The field of personal computing has begun to make a transition from the desk- top to handheld devices, thereby requiring input paradigms that are more suited for single hand entry than a keyboard and recent developments in online handwriting recognition allow for such input modalities. Data entry using a pen forms a natural, convenient interface. The large number of writing styles and the variability between them makes the problem of writer-independent unconstrained handwriting recogni- tion a very challenging pattern recognition problem. The state-of—the-art in online handwriting recognition is such that it has found practical success in very constrained problems. In this thesis, a method of identifying different writing styles, referred to as lexemes, is described. Approaches for constructing both non-parametric and para- metric classifiers are described that take advantage of the identified lexemes to form a more compact representation of the data, while maintaining good recognition ac- curacies. Experimental results are presented on different sets of unconstrained online haflli'i'llllfll Cliiiiiii'i from Eexemo mmiyi. 58390 (Ill birth Eu; utter-adaptatiun “.5 5791;) of warm 3,, ( handwritten characters and words. In addition, a method of combining information from lexeme models built on different feature sets is described, and results are pre— sented on both English characters and Devanagari characters. Finally, a method of writer-adaptation is described which makes use of the lexemes identified from a large group of writers to define lexemes within a small amount of data from a single writer. (”9) © COpyright 2000 by Scott D. Connell All Rights Reserved ThauflmrthuW idiii'f‘. Slippy)“, rim 1‘. :51. Id like It tin: agemem. copamun tam-31mm to his >0 her mPHILUIT§ int, . .\..L..dtl Pf‘fffjflf‘. ( It ‘ {41' my. .‘ hm? DIUVUIW 9.53% — .Etizi‘h. hf“ p; ‘ ‘ . t I ‘all .(j§" 4hd€0f(?u r- “ My ' .435 King _ . WW. \..‘,‘~.-}l 0 ' ll. ACKNOWLEDGMENTS The author would like to sincerely thank the following people for providing assis- tance, support, encouragement, and inspiration during the writing of this dissertation. First, I’d like to thank my advisor Dr. Anil Jain. He has provided guidance, encour- agement, Opportunities, and knowledge at a level that few advisors are capable of. His dedication to his students and his field is a true inspiration. Second, I’d like to thank the other members of my committee: George Stockman, Eric Torng, Habib Salehi, and Jayashree Subrahmonia. I’d also like to thank Dr. Subrahmonia, along with the other members of the IBM Pen Technologies group at Watson Research Center, Michael Perrone, Gene Ratzlaff, and John Pirelli, for their support and assistance. They have provided data, financial support, internships, and advice that have been extremely helpful in the completion of this dissertation. No student can survive graduate school without the help of their fellow students to discuss ideas, share Opinions, and to make time spent in the lab an all around enjoyable experience. In particular, I’d like to thank Aditya Vailaya, Salil Prabhakar, Arun Ross, Nicolae Duta, Lin Hong, Yonghong Li, Paul Albee, Friederike Griess, and Anoop Namboodiri, and the other members of the Pattern Recognition and Image V Framing lab. In ml for their valuallv aw l'd also like It! git liar have always l,. to thank all Iliv frlvz. Processing lab. In addition, I’d like to thank the Computer Science Department staff for their valuable assistance, in particular Linda Moore and Cathy Davison. I’d also like to give special thanks to my parents, sister, and the rest of my family. They have always been there to support me, and I am very grateful. Finally, I’d like to thank all the friends I have made during my time at Michigan State University. vi usr or TABLES usr or FIGURES 1 Introduction 1.1 timers. ()filzzro l.’ Handwriting §,. .1! 1.3“ "ii-GI (it 1ii‘lirl' ' i.‘ Par item Cldss’ 1.3 SIEIQ‘OI‘IIIP'AII i 6 Research Pfllili"li l7 Thesis Outline 2 literature Revie' ll Introductiun. . . I? Data Collmi-riuu . 2.3 Papmcwy‘a , ll \r’l‘P'inl'll . 3.3 ritmauq ....i 1. 1 Primitir e D r,-.,.,;_,1 ) - . .. lluror llorlrb 3.3.3 3' 1 CI' H L) Deformation ’ 1 H (idem .\i 2J) .\Pura1 ‘1! Cu d. lri‘x Xenia-1' r1 IOIal l'lS G1.',, I- lllllln0\r\1 B O‘DIOIFN ‘BO 7 Summn- i ‘4. 3‘ “I Tiling '-l SQIQ Gldg mIOdUIIlfln l p10((’§\1113 \1» n L|1HE\IGIIh111" \ Dru df I 1 Id Eff IIJI (1.11 VIII’I‘IIHEI {1 ‘J r111” ‘ II ”as; i :6 Rfihlh dCIll\TI 33 ll '3 .3 '1: —4 TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xii 1 Introduction 1 1.1 Online vs. Offline Handwriting Recognition ................ 3 1.2 Handwriting Segmentation .......................... 5 1.3 Writer-dependent vs. Writer-independent Recognition ........... 8 1.4 Pattern Class Models ............................. 11 1.5 State-of-the-Art in Handwriting Recognition ................ 13 1.6 Research Problem Statement ........................ 14 1.7 Thesis Outline ................................ 18 2 Literature Review 19 2.1 Introduction .................................. 19 2.2 Data Collection ................................ 20 2.3 Preprocessing ................................. 20 2.4 Segmentation ................................. 23 2.5 Modeling Methods .............................. 24 2.5.1 Primitive Decomposition .......................... 25 2.5.2 Motor Models ................................ 26 2.5.3 Deformation Models ............................ 26 2.5.4 Hidden Markov Models .......................... 28 2.5.5 Neural Networks .............................. 29 2.5.6 Local vs. Global Features ......................... 30 2.5.7 Writing Style-Based Models ........................ 31 2.6 Postprocessing ................................ 33 2.7 Summary ................................... 34 3 Writing Style Identification - Lexemes 35 3.1 Introduction .................................. 35 3.2 Preprocessing ................................. 36 3.3 String Matching Measure ........................... 39 3.4 Squared-Error Clustering ........................... 43 3.4.1 Selecting An Appropriate Number of Clusters .............. 44 3.5 HMM-based Clustering ............................ 46 3.6 Results ..................................... 47 3.7 Summary ................................... 52 vii l Non-parametric 1 Introduction . l? ClusterCrnrr’rs 3 lrainingSot ill; 11 Clarifirarinn ill Nearest Nani 1.1.2 Dfi'ib'irill Iii-1' 1.1.3 Giulial .\l1‘.'1~11: 1.3 Results. . _ . 1.3.1 01311151115. 13.’ Digillll'*1'u}_'r.;t 153 1iplranurnvrir 16 Summary 5 Parametric Mod 5.1 lntrodnrriun. . 5.? Hidden Marlin 11' .‘ ill featureExtrn 522 Hiddtll Marl 'Jl- 3.3 HUM IIJdN‘II (lll‘ 1111111511115. . . . .. llexerne ldwnri: 3.3 Enniniarr 5 Information fu: Stroke End 1,”. Center 01' Gr d'il (16511191 CULI.’ 413119511115 be 5.1 in nimdumion.i . 6.? Component Cid»; 6.2.1 Ll3r'alfrafzz1r1." Ej’ Critical PM in I I 1113 patial Di-ruri l l (:‘T; c ya m J 5113111 arr ‘Wrdo 1 ll. dRecognltio: '1 L erufU-On 1... 3.51me 1 ’1) ‘01! ’r. r‘prJch"\_\iv. . , llg 11 W :2? i \IUdI'IIDU '3 . 2 Jffidl 1111. Sf‘t . 1 3 T1.161.911 8111,11. l .1 1‘ Ma S RN It 4 Non-parametric Models 4.1 Introduction .................................. 4.2 Cluster Centers ................................ 4.3 Training Set Editing ............................. 4.4 Classification ................................. 4.4.1 Nearest Neighbor Classifier ........................ 4.4.2 Decision Tree Classifier ........................... 4.4.3 Global Measurements ........................... 4.5 Results ..................................... 4.5.1 Datasets ................................... 4.5.2 Digit Recognition .............................. 4.5.3 Alphanumeric Character Recognition ................... 4.6 Summary ................................... 5 Parametric Models 5.1 Introduction .................................. 5.2 Hidden Markov Model Classifier ....................... 5.2.1 Feature Extraction ............................. 5.2.2 Hidden Markov Models .......................... 5.3 HMM-based Clustering ............................ 5.4 Results ..................................... 5.4.1 Lexeme Identification by HMM-based Clustering ............ 5.5 Summary ................................... 6 Information Fusion 6.1 Introduction .................................. 6.2 Component Classifiers ............................ 6.2.1 Local Features ............................... 6.2.2 Critical Point Features ........................... 6.2.3 Spatial / Directional Features ........................ 6.2.4 Stroke End-point Features ......................... 6.2.5 Center of Gravity Features ......................... 6.3 Classifier Combination ............................ 6.4 Results ..................................... 6.5 Summary ................................... 7 Word Recognition 7.1 Introduction .................................. 7.2 System Overview ............................... 7.2.1 Preprocessing ................................ 7.2.2 Modeling .................................. 7.2.3 Feature Set ................................. 7.2.4 Delayed Stroke Handling .......................... 7.3 Test Data Set ................................. 7.4 Results ..................................... viii 53 53 54 55 57 57 57 60 63 63 63 69 71 74 74 75 75 76 78 80 82 85 86 86 87 88 88 90 94 94 94 97 102 104 104 105 105 110 113 114 116 120 7,5 Summary 8 Dexanagari Sr 8.1 Intrudmrriun . S2 Sysmu Own-i 8.3 Rusults. . . . 5.4 Summary 9 Writer-Depem 9.1 Introdurti«,,»u . 9.? \Vriter-Dvazni 9.3 \Vriter-n‘uduprr: 9.4 Results. . . . 9H \\1udRm1: 9.5 Summary 10 Conclusions 10.1 Coutributiuus Iii? Future \Vurk I011 Clayh‘r Cu 13.3.? Devanagari 5 1313.3 The H.\I.\I-h. 1134 Writer-Addy BIBLIOGRAPm 7.5 Summary ................................... 126 8 Devanagari Script Recognition 128 8.1 Introduction .................................. 129 8.2 System Overview ............................... 130 8.3 Results ..................................... 137 8.4 Summary ................................... 139 9 Writer-Dependent Systems 141 9.1 Introduction .................................. 141 9.2 Writer-Dependent Lexeme Identification .................. 142 9.3 Writer-Adaptation .............................. 143 9.4 Results ..................................... 145 9.4.1 Word Recognition ............................. 147 9.5 Summary ................................... 152 10 Conclusions 153 10.1 Contributions ................................. 154 10.2 Future Work ................................. 157 10.2.1 Classifier Combination for Word Recognition .............. 157 10.2.2 Devanagari Script Recognition ...................... 158 10.2.3 The HMM-based Clustering Algorithm .................. 158 10.2.4 Writer—Adaptation ............................. 159 BIBLIOGRAPHY 161 ix I; ., , ll $01193}le l" Ra‘eut mums 3.3 RH'C‘HI resuzk ..l .lppruarlm tr. 3.l Numberolulu ll Number of (‘ll and by 2M? 3 Dim recnguit .3 Blair rerun? 4 Digit ref-«.lrrril treated a~ 4.5 Digit recogrllt plates” in- clash . (autos. ll‘m Athauunwrir. Performance filpharuurz‘; indicates 1 31310 the a' hdd IO be Overall writ wt 6.1 Characterls: 1. ‘3 TOp 10 Pl" -~..l(1\._\i 59 ..-J Thp 10 6th F" mdplwilflr J1 Ciéfiélfipr (.0: - CEKS‘f-f‘l‘a. ‘ 5,: ‘ . if J-J Trip 10 rilélnx ‘ 1.1 1.2 1.3 2.1 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 6.1 6.2 6.3 6.4 6.5 6.6 7.1 LIST OF TABLES Some applications of handwriting recognition. ............... 3 Recent results on presegmented character recognition ............ 13 Recent results on handwritten word recognition ............... 15 Approaches to online handwriting recognition ................ 25 Number of clusters, K, chosen for 36 alphanumeric classes ........ 47 Number of clusters, K, chosen for 10 digit classes by manual selection, and by automatic selection. ....................... 64 Digit recognition rates using the nearest neighbor classifier ......... 64 Digit recognition rates using the decision tree. ............... 66 Digit recognition rates using the decision tree when global traits are treated as separate features. ....................... 66 Digit recognition using boosted decision trees. The column “No. Tem- plates” indicates the number of reference templates available to the classifier, and the average number of templates used (number of dis- tances that had to be calculated) per classification (shown in parenthesis). 68 Alphanumeric recognition rates (36 classes) using nearest neighbor classifier. 70 Performance of nearest neighbor vs. decision tree classifiers for 36-class alphanumeric character classification. The column “No. Templates” indicates the number of reference templates available to the classifier, and the average number of templates used (number of distances that had to be calculated) per classification (shown in parenthesis). . . . . 71 Overall writer-independent results for three different classification problems. 81 Characteristics of the different classifiers used ................ 88 Top 10 classification results on 57,541 alphanumeric characters. Case- dependent (62 classes) results are reported ................ 97 Top 10 classification results on 57,541 alphanumeric characters. Case- independent (36 classes) results are reported ............... 98 Classifier combination accuracies on 57,541 alphanumeric characters. Case-dependent (62 classes) results are reported ............. 100 Accuracies within top 10 choices for the 5 best classifier combinations. . 101 Top 10 classification accuracies for classifier combinations with M=2. . . 102 Words in the word classifier lexicon ...................... 118 " iit‘l H ‘3 lbeuum each that.“ 7.3 Word rerun-.11 Slualr nun! Tl \Vurcl Wham: tiplu lmu. 7.5 Percentazml {OI wilful; lexium is 3 3.1 Pauly lilN m: 5.? FE‘dilll't’ N‘i.‘ I. 5.3 Flwdifi-rvrri 5.4 Percentage if.” top 101-l“ ill ll'jrerilrgwru Adupldi ‘1). 9.2 lowercase let methods. 9.3 Numberofrl. prinml a! 9.4 Caedeprnri’ 9.5 Cfiserlllllfi‘pl‘li 93 ll‘rlreuadagn Case-inde- 7.2 7.3 7.4 7.5 8.1 8.2 8.3 8.4 9.1 9.2 9.3 9.4 9.5 9.6 The number of examples available and number of lexemes identified for each character class. ........................... Word recognition accuracies using curvature/orientation features and a single model per character class ...................... Word recognition accuracies using curvature / orientation features and mul- tiple lexeme models per character class .................. Percentage of cases in which the correct word is within the top 10 choices for words from different writers. (The total number of words in our lexicon is 483). .............................. Poorly discriminated character pairs. .................... Feature sets used in each classifier. ..................... Five different classifiers used in classifier combination ............ Percentage of test characters for which the correct class falls within the top 10 choices ................................ Writer-dependent digit recognition accuracies using different writer- adaptation methods. ........................... Lowercase letter recognition accuracies using different writer-adaptation methods ................................... Number of character samples available from different writers of both hand- printed and cursive styles. ........................ Case-dependent word recognition accuracies after writer-adaptation. . . . Case-independent word recognition accuracies after writer-adaptation. . . Writer-adaptation for a single writer with a larger set of training examples. Case-independent accuracies are reported. ............... xi 122 123 139 ll Anexample- writing. « cleartot l3 Levels of (lif‘ l3 lheGrafiiti ll The Cittsrpc l5 Recognition left (“Uh-.131 column (‘ tion error 1.6 imitations \K‘ “ Edi -. r . writing 5' The heel: .T Onlinetrord by the d. shown hr 5—4 "Q ...—n Differences it indicates The three re; The amount 9 .letastrolee r“; 0‘9 .Q ...:- CA) ,4) A- "0ka 31 J feature ext“, 009' fear 1‘ a singl . _, 9 Si Mildly“- “I. 31 ‘ Ier - . - I)? -’Jrl_l(~_€_a§§jr- ,. IJj‘ ---" rppr‘X'f‘S‘th l, 2 ~ a/ 01.1,.” 33 mom . 1.1 1.2 1.3 1.4 1.5 1.6 1.7 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 LIST OF FIGURES An example of delayed strokes (shown as dashed lines). In cursive hand- writing, delayed strokes are particularly troublesome because it is un- clear to which character they belong. .................. Levels of difficulties in handwriting segmentation [103]. .......... The Graffiti character set. Reproduced from [32]. ............. The CrossPad. ................................ Recognition results using the CrossPad in writer-independent mode. The left column contains examples of handwritten sentences, while the right column contains the recognition results for those sentences. Recogni- tion errors are marked in bold ....................... Variations within three different writing styles for the lowercase character “r”. Each box, (a), (b), and (c), show three examples of the same writing style, with a different style (writer) represented in each box. The beginning of each stroke is shown by a dot. ............ Online word recognition system. The flow of data during training is shown by the dashed line arrows, while the data flow during recognition is shown by the solid line arrows ....................... Differences in sample point spacings for handwritten characters. Each dot indicates the location of a sample point. ................ The three regions used for size normalization. ............... The amount of slant in handwriting for two different writers ........ Metastroke categories and an example decomposition of a word into metas- trokes [34] .................................. Feature extraction. (a) Point-by-point feature extraction (b) sliding win- dow feature extraction. A sequence of features may be generated from a single sample point, or from a group of points that fall within a window which slides along the point sequence. ............. Preprocessing steps for handwritten characters. .............. Preprocessing steps for character ‘a’ with sample points marked as dots: (a) original, (b) after size normalization and centering, (c) after first resampling, (d) after Gaussian filtering, and (e) after the final resampling. Examples of characters (a) before and (b) after preprocessing with critical points marked as dots. .......................... xii CONOBp-h 12 16 21 22 23 27 29 37 .l The alignmeu l 3'} a "2' ct drawn to i Q» i llean squaml for the (hi .6 Examples of t «fa-.2 (’3 .7 Davies and B the numb, w w 3 Examples Hf Charm-ter .9 Two different. digit in tl that lies I CA.) 4.. 1 Online than shown by is slmyyn 4.2 An example from a te by the in r is the C training : 4.3 .ln example reference d1 3.3;;. yr digit fro: 4-4 Smulnrtry at patterns X. exam grouped side of tl plate 3', Side. iglt rem-2r T1993 aft 0n Cling“ 1.5 D 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 4.6 5.1 The alignment between two digits, and the resulting matching scores for (a) a ‘2’ compared with a ‘3’, (b) comparing two ‘3’s. Dotted lines are drawn to indicate corresponding aligned points between the two digits. 41 Mean squared error (MSE) for different values of K, the number of clusters, for the character ‘m’. ........................... 45 Examples of characters in our database. .................. 48 Davies and Bouldin index, R, for the digit “0” over different values of K, the number of clusters. .......................... 49 Examples of the patterns closest to each cluster center found for each character class. .............................. 50 Two different lexemes of the digit ‘2’ as identified by our method. The digit in the center of each cluster shown here, is the training sample that lies closest to the cluster center. .................. 51 Online character recognition system. The flow of data during training is shown by the dashed line arrows, while the data flow during recognition is shown by the solid line arrows. .................... 55 An example of part of a decision tree using similarity features. Distances from a test character to a reference character are indicated in the tree by the name of the reference character. This is shown as day, where a: is the digit class, and y is the index of that reference digit from the training set. ................................ 59 An example of part of a decision tree using difference features. The the reference pair used to make a decision at a node is labeled dx1.y1- d$2.y2, where 2:,- is the digit class, and y,- is the index of that reference digit from the training set ......................... 61 Similarity and difference features. (a) Similarity features: while training patterns a and b are grouped based on their similarity, to template X, examples c and (1, although very different from each other, will be grouped together; (b) Difi‘erence features: training patterns on each side of the decision threshold (difference between template X and tem- plate Y) are more similar to each other than to patterns on the other side. .................................... 62 Digit recognition by combining multiple tree classifiers using boosting. Trees are constructed using the similarity and difference features based on cluster center reference sets, and on the edited reference set, and similarity features based on the full training set ............. 67 Boosting for 36 alphanumeric classes. A comparison of decision trees built using similarity features and difierence features, both based on the cluster center reference set, and a combination of the two feature sets. 73 The t0pology of our hidden Markov models. Note that only single or double state skipping transitions are allowed ............... 76 xiii i? Perlormanre of case shows when the l:- dest'tlhet'l It: Shown lb 8 purposes. l were fdllll‘ '1 "Random 3’ 5.1 An example of ( 9.31mi O. t]! 3y}. it“. t‘iti' 6.3 .iuexampleul~ character d. magnitude «t tal dirmi'tiozt and (d3 unr' Shown IP- 1' [lord return i- .ln example of l Slant dttt‘t'tlutt . slant corrm-t ml in example of t; as potential} to Example of Wilt: has iht“ abili that eaeh o" A -3 .3 .‘Q ' models} are . Variations berry» '2. (c) Writer .‘rne examples t 7.8 * Jllie word rem mtx'lassified 91 i . sente - ' 3;) me mm my major Dti‘y' gory uurnbo’tr some handtyrit :. rototypital ere Q: ChafaCfpr-S. 5.2 6.1 6.2 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 8.1 8.2 8.3 8.4 8.5 Performance of HMM-based clustering over 10 iterations. The “baseline” case shows the accuracies obtained at each iteration of the algorithm when the lexemes at iteration 0 are defined using string matching as described in Section 3.4. The results on this initial set of clusters is also shown as a horizontal line (labeled “String Match”) for comparative purposes. Three additional cases in which the lexemes at iteration 0 were randomly assigned are shown as “Random 1”, “Random 2”, and “Random 3”. ............................... An example of Critical Point features: (a) calculation of structural features 0, and ob, (b) y (in relation to baseline shown), and critical point (Ar, Ay), (c) calculation of local neighborhood features Am, Ay, and to. An example of spatial / directional features extracted from the single stroke character (1, a “.”, and the two-stroke digit 4. Intensities indicate the magnitude of the features in each sector of the 5 x 5 grid in (a) horizon- tal direction, (b) vertical direction, (c) southwest-northeast diagonal, and (d) northwest-southeast diagonal. The original character is also shown (e) in relation to the baseline. .................. Word recognition system. .......................... An example of baseline/midline detection. ................. Slant detection and correction (a) before slant correction and (b) after slant correction ............................... An example of delayed stroke detection. Strokes that have been identified as potentially delayed are shown as dotted lines ............. Example of word modeling by connecting character models. This grammar has the ability to recognize the words lets, less, goes, and give. Note that each of the models shown from a single class (e.g., all the “3” models) are actually only stored as a single model ............ Variations between writers for handwritten words. (a) Writer 1, (b) Writer 2, (c) Writer 3, (d) Writer 4, and (e) Writer 5. ............. Some examples of handwritten words containing “Overwriting”. ..... Some word recognition examples: (a) correctly classified words and (b) misclassified words and the incorrect label assigned by the classifier. . A sentence written in Devanagari ....................... Forty major Devanagari characters with English pronunciations and cate- gory numbers ................................ Some handwritten examples of the 40 different Devanagari characters. . . Prototypical examples of each lexeme from each of the 40 Devanagari characters .................................. Features for classifiers 1 and 3. The “+” marks the center of the character. Features are calculated between every pair of sample points. Classifier 3 quantizes 6 and measures the total distance traveled in each direction for each box in the grid. ......................... xiv 83 91 93 106 108 109 110 112 119 120 125 129 131 132 134 136 91 An oyer (’Vt’l'. 16 In 9.? The pd that 9.3 Lexertte Ott't tl.i~ fort 9.1 An overview of the model training process. Dashed lines show the flow of events for writer adaptation. The number of states has reduced from 16 to 14 ................................... 9.2 The parameters of a hidden Markov model trained for a lexeme of the character ‘e’ before and after writer adaptation. ............ 9.3 Lexemes identified in digits written by “Writer #5”, and the number of occurrences matching that lexeme in that writer’s data. Note that for this writer, we have identified 15 lexemes (out of a total of 26 lexemes for the writer independent data). .................... XV Chapter IntroduCl lie amount of luforrz. uiutreasinr at a try“ infi'l no.1 ..nuatton can be en 'z‘ ttlenet'lt. In order I C!" natural to the user. tht tier: has been much t [2...]: .. ' umzdtton tools 9" l nu the keyboard. Sl ”‘3“!- aural {1 .tottns 0f titan-tr; m. . , . any consumer produ e3- \u‘ ,é A ’ \ dhd lBAl. lnterarti tron - ' .u¢p¢gt 1%, ' etc ]. HIJWW F. 5.. ..‘tt't’lb' van“ U‘ at rtt'oguith 9 ,‘WV fl ”LAT lg 9' p? P-ddpnt] surh Chapter 1 Introduction The amount of information that can be stored and processed by desktOp computers is increasing at a tremendous rate. Given this rate of increase, the ease at which information can be exchanged between a computer and a user is becoming a serious bottleneck. In order to be effective, user interfaces should be i) efficient and ii) natural to the user, thereby requiring little or no learning curve for the user. While there has been much progress on how data is presented to a user, such as data visualization tools [2], primary mode of data input from a human to a computer is still the keyboard. Speech recognition and handwriting recognition enable other, more natural, forms of man-machine communication, which have recently been integrated in many consumer products (e.g., Apple’s Newton Messagepad, the CrossPad by AT. Cross and IBM, Interactive Voice Response Units (IVRUs) used by many telephone companies, etc). However, for these input modalities to be economical and user- friendly, their recognition accuracy must be sufficiently high (where this is largely domain dependent) such that the user needs to make only a minimal number of 1 (attentions to tln While murh t. many ta 99% on Personal Digital recognition; unnatural character Assistant (PDA) real-time speed. set (e.g., graffiti) Segmentation into individual approx. 85% — 95% word Handwritten words; large vocabulary accuracy if trained on Notes of words categories; can user (writer-dependent) (online) be performed in given an adequate Recognition batch mode, therefore, need amount of data not be real time. Postal High-volume; must be real-time; > 99% on ZIP code, city Address requires an extremely low error rate, and state with approx. Recognition therefore, reject rate is often high. 25% — 35% reject Online Must have a low false rejection 2% - 5% Signature rate, while maintaining a low Equal error rate Verification false accept rate. 1.1 Online vs. Offline Handwriting Recognition At the highest level, handwriting recognition can be broken into two categories: of- fline and online. Offline handwriting recognition focuses on documents that have been written on paper at some previous point in time. Information is presented to the system in the form of a scanned image of the paper document. In contrast, online handwriting recognition (by its original definition) focuses on tasks where recogni- tion needs to be performed at the time of writing. This requires the use of special equipment, such as a digitizing tablet, to capture the strokes of the pen as they are being written. The trace of a writer’s pen is stored as a sequence of points sampled at equally spaced time intervals. The information captured for each sample is the (3:, y) coordinates of the pen on the digitizing tablet. While this pen trace could be used 3 figure 1.1: A: Writing. delay «character the; to construit a techniques to 1 mantles can ' Therefore. it 35 "I'Jeerw' Another at VT." ‘ -.. 58.6 SEEmpm. 'v '- . ( mgr. onl: . “I“ rE“(Talents ‘StOUP'r' "ling the l _ . ‘ if‘f‘v I ..1 can 7] Ed OED-”- :G n .1 ’ (tance,, it t 3f} r'r,~-. ' 353 0f . . d 't‘ l «3de 'SrrOkL Q '- .,._l [.5 Figure 1.1: An example of delayed strokes (shown as dashed lines). In cursive hand- writing, delayed strokes are particularly troublesome because it is unclear to which character they belong. to construct a static image of the writing, thus allowing offline character recognition techniques to be applied, it has been shown [63] that the information about the pen dynamics can be used to obtain better recognition accuracies than static data alone. Therefore, it is beneficial to capture the data in an online form, even if the real-time processing requirements can be relaxed. Another advantage of online handwritten data over offline data is the availability of stroke segmentation and order of writing. Ink in static images must first be separated from the image background, creating a potential source of error. Pen devices used in online recognition include the ability to detect the states of pen-down (when the pen is touching the tablet) and pen-up (when the pen is not touching the tablet). A single stroke is defined as the sequence of sample points occurring between consecutive pen- down and pen-up transitions. However, a complication occurs when a stroke is added to a character in a word after the rest of the word has already been written, such as the cross of a ‘t’ or an ‘x’, or the dot of an ‘i’ or a ‘j’. These types of strokes are called delayed strokes (see Figure 1.1). 1.2 Hand“ .l necessary step of lll Stiff] a way lllill system. With offline first separating yyrit ti pares of segment at there is no “sizzml” dimension of tinge i: aditioual iniorztittt it It”: form indiyidual tierent categories L style of writing in til at usng a single str 3.. carter segmentafi L Printem, the great»: not - " e a reeognition s on 1 5‘. I’, l adtangua that: lhe most diffieuit are mixed within «cf ezdi 935 Of St min": fry": “.1“ p .’ rl. {editable info" u inn—3 ~1- 4.SF"" Tr ' ‘l’atdllf in Is no 1.2 Handwriting Segmentation A necessary step of handwriting recognition is the process of segmenting the input in such a way that we can present isolated words or characters to the recognition system. With offline data, segmentation must be addressed on a pixel by pixel basis, first separating writing from background, then separating words and characters. The process of segmentation is relatively simpler for online handwriting recognition since there is no “signal” when the user is not writing. Further, we have the additional dimension of time in which characters or words can be separated. Even with this additional information, the problem of reliably identifying online stroke “clusters” that form individual characters or words can be difficult. Figure 1.2 shows the five different categories of character separation. In particular, cursive handwriting is a style of writing in which a number of consecutive characters in a word may be writ- ten using a single stroke (last two rows in Figure 1.2), thus making the problem of character segmentation very difficult. The more difficult the character segmentation problem, the greater the opportunity for misclassifications within a word, and the more a recognition system must rely on higher level information about the words (and language) that can be formed to restrict the segmentation hypotheses. The most difficult level of segmentation is when cursive and handprinted charac- ters are mixed within the same word. In this case we cannot count on the beginning and endings of strokes to define possible character segmentations, nor do they always provide reliable information for word segmentation. When the type of character and word separation is not known a priori, such writing is referred to as unconstrained. 4'— E: St Rm fw Mi: Figure 1.? When we impw' separate box. t'harat‘t areduted to the nu; some countries to my eftetireness of autonr. data. an equiyalent it, are separated by sortie an opposed to a spatia in men more rest; tint expected to make lair ~ are -h character is l hurt. the format lHll ( ‘3 "Id .«ulliU { " ‘ 0 RdeD alpha}. :H . ' lk‘nalu‘ ..t F S 91,- Currently [flolxtslbt lblIlschalel-rjg] [fiHmR—I Spaced Discrete Characters Run-on discretely MW characters Mind. (mm m! 19cm Figure 1.2: Levels of difficulties in handwriting segmentation [103]. When we impose the restriction that the writer must write each character in a separate box, character segmentation becomes trivial and the problem of recognition is reduced to the much simpler character recognition problem. This has motivated some countries to require postal zip code digits to be written in boxes to increase the effectiveness of automatic mail sorting. It should be noted that for online handwritten data, an equivalent form of this level of character separation is when the characters are separated by some predefined minimum amount of time, forming a temporal box as opposed to a spatial box. An even more restrictive form of handwriting exists in which the writer is not only expected to make character segmentation explicit, but the motion of the pen to form each character is highly confined such that class discrimination is very simple. In fact, the formation of some characters may have little similarity to their corre- sponding Roman alphabet characters. Examples of such alphabets are Graffiti [75] and Jot [51]. Currently, many commercial systems exist (Palm Pilot, Apple Newton 6 Nt\lo A/q G'Faffi'll Reference Card Letters 8 l“ Gilt asset: Back Caps Caps Space Space ‘ Rerrrn Shit Lock ShortCuts i--- --°~/l till Numbers , Number Lock- - / 0 4 Ot23 sevéi Common Punctuation Punctuation Shift: -Tap Oncei- ) [xwmzj {Winn “attache ulK‘t‘hUfl t1.9,! wt,- 10,1 sl; sh [pm Hp) mm dollar '.)"“7§--—-l /C 35 O’Palm Computing. 1994. All rights reserved. Heavy dot ind cams startm nq pom n.t he. : O a} N‘e E Q o Figure 1.3: The Graffiti character set. Reproduced from [32]. Memgepad. Wind rmrgnition accura. 4 hammer they are n~ remaking}. Funn» Other (‘mnnmrial sy mg to handprinmi (f Reognition Tmhuu d! byIBM and AT Ch» to recognize umnw alcnwer mignitm a 05““ng isuch a; (r W‘i’inition accurar}: sentences wring H {in mter-Indepexldcm n 13 “Trit 91‘- Messagepad, Windows CE machines, etc.) which make use of these alphabets. The recognition accuracies reported using such alphabets tend to be very good (> 99%), however they are not practical for applications in which writing speed is an issue (e. g., note-taking). Further, these alphabets tend to have a somewhat large learning curve. Other commercial systems make use of more natural alphabets, but they restrict writ- ing to handprinted characters and single isolated words (smARTwriter by Advanced Recognition Technologies, Inc. [1]). Still other products exist (CrossPad, Figure 1.4, by IBM and AT Cross, Inc., and Calligrapher by Paragraph, Intl. [76]) which attempt to recognize unconstrained handwriting. However, these systems do so at the cost of a lower recognition accuracy, and often include the option of more restricted modes of writing (such as cursive-only, handprint-only, and digit-only modes) for increased recognition accuracy. Figure 1.5 shows some examples of unconstrained handwritten sentences written on the CrossPad, and their recognition results using the CrossPad’s writer-independent mode of recognition. 1.3 Writer-dependent vs. Writer—independent Recognition A handwriting recognition system can further be broken down into the categories of writer-dependent, or writer-independent. A writer-independent system is trained to recognize handwriting in a wide variety of writing styles, while a writer-dependent system is trained to recognize handwriting of a single individual. A writer-dependent 8 Figure 1.4: The CrossPad. Written Sentence Recognition Results 7’, .. All work and no play play makes Jack 11,4 , 0 /,~./ / r , a dull try. I . . . - . f 2 u , r ,7 . One Sin all step for f a in an, one 9-9 it leap for man kind. Figure 1.5: Recognition results using the CrossPad in writer-independent mode. The left column contains examples of handwritten sentences, while the right column con- tains the recognition results for those sentences. Recognition errors are marked in bold. system works on dd mm rate. A Will" that could be «third: en of writing 50'1” it is easy to obtain in the data CUllt‘t’ll' menu! of data it; I .lt first. it until training should m rt and adapt it to {hr the user will be m she uses it. why (a allSE‘S in that the “l they have been prm rpm in the first plr: Tllch the User lool l Li Especially challi‘r“ Wren characters (7 We" -.l. The CrossPad tr: ‘ ' a the wmer pru‘.‘ go throuzh a wr' J' V r. Allie Haiti. «4 ‘fd u " p “11h til“ :5“ ta 1 bar can be cut system works on data with a smaller variability, and therefore achieves a higher recog- nition rate. A writer-independent system sacrifices some of this recognition accuracy that could be obtained for a single user, for the ability to handle a much greater vari— ety of writing styles. On the other hand, when training a writer-independent system, it is easy to obtain a large amount of training data, since many writers are included in the data collection process, while expecting a single writer to provide a similar amount of data to train a writer-dependent system is not feasible. At first, it would seem that the required data collection for writer dependent training should not be difficult if we start with a working writer-independent system and adapt it to the writer-dependent problem (known as writer-adaptation). Since the user will be constantly providing new data (writing) to the system, as he or she uses it, why can’t this data be used for incremental learning? The difficulty arises in that the “ground truth” (true categories) of this data are unreliable since they have been provided by the writer-independent system that we wish to improve upon in the first place. Therefore, the data must undergo a verification process in which the user looks over the data and category labels, and corrects them. This is especially challenging for cursive handwriting in which the segmentation points between characters chosen by the writer-independent system might not be reliable as well. The CrossPad’s recognition engine allows for writer-adaptation by requesting that the writer provide a set of predefined characters, after which the user must go through a verification process to ensure that the predefined labels are correctly matched up with the handwritten characters before training begins. The amount of data that can be collected in this fashion is limited by the patience of the writer. 10 lite sniAHTwritcr WI has explicitly , correttlt' labeled. \ explicitly in prut-i. 1;: tulleeted in a slrurt l 1.4 Patten: To perform (‘i‘rbslilt‘l tempt to leverage l) retain-class t'ariat it. t: Crerent writers i a it large. The variance ‘ tr, - - Ji tWU CUllpollf‘lltS'. «Writing style. This “9 refer to they. iiilitii it t ‘ he literature in ’i‘) . . 3» log. 11; : lflll"\ The smARTwriter system [1] adapts by collecting those character examples that the user has explicitly corrected during normal use, and therefore can be assumed to be correctly labeled. While this method does not require the writer to spend his time explicitly in providing training data, it further limits the amount of data that can be collected in a short period of time. 1.4 Pattern Class Models To perform classification of handwritten characters, a recognition system must at- tempt to leverage between-class variations, while accommodating potentially large within-class variations. If a recognition system is to work well for a large number of different writers (a writer-independent system), this within-class variation can be very large. The variance within a particular character class can be thought of as made up of two components: the variations between writing styles, and the variations within a writing style. This is illustrated in Figure 1.6. We refer to these different ways of writing a character as lexemes [96], patterned after the corresponding speech recognition term, phoneme. Other work has appeared in the literature in which these character subclasses are referred to as allographs [82, 106, 115]. If these lexemes can be identified then it is often possible to increase the recognition accuracy by modeling each lexeme separately, creating multiple models per character class. An automated method of reliably determining the lexemes that are appropriate for a certain data set is considered a difficult problem. Currently, the role of lexemes in writer-adaptation has remained largely unexplored. 11 Figure 1.6: Variations within three different writing styles for the lowercase character “r”. Each box, (a), (b), and (c), show three examples of the same writing style, with a different style (writer) represented in each box. The beginning of each stroke is shown by a dot. 12 Talllt' 1‘: , .luthl’l #— [ Lil’ l'euflg I60. I ii. I Starterlin «l' . ' ‘ , ll.‘ ltrzt'lal .91: l ' (‘t'l )3929! ‘l ' | l ‘ a 3‘: ml ‘lll ' ‘ l - 1 .\t‘l [\‘x t Chanl' [Xian ll'eunng’ . Strt. \ l :. li et al. 39 . H]. l \ {Pret'ostl' i (Uh litigram '52‘ ’ all i ‘ ~ ‘ .\t‘l':l.' 1.5 State-of TEA. . mi: 1.2 presents s L. iii A lettrgrtition problem ' t. .7, u would be stressed I aid in ‘ )ll] llldll)‘ diff”; ' tie' Table 1.2: Recent results on presegmented character recognition. Elastic Matching Digit Classifier: 96.3% on 300 digits Author Method Accuracy Comments Upper and Lowercase Classifier: . Nearest Neighbor 87.1% on 780 lowercase 0'35 sec/ char. Ll & usin 92 9‘7 on 780 u ercase 180 reference Yeung [60] g ' 0 pp templates selected from 840 examples. Scattolin & Krzyzak [91] Nearest Neighbor using Weighted Elastic Matching 88.67% on 1650 digits 33 writers used. 293 reference templates. Yaeger et al. [114] Combined Online and Offline Neural Network Classifier 95-class classifier (52 chars, 10 digits, 33 symbols): 86.1% 45 writers used. Chan & Yeung [18] Manually Designed Structural Models 97.4% on 9300 upper and lowercase chars. 150 writers used. Li et a1. [59] Hidden Markov Model 92% (after 3.9% reject) on 3126 digits. Unipen data used. Prevost & Milgram [82] Combined Online and Offline Nearest Neighbor Classifiers digits: 98.70% uppercase: 97.81% lowercase: 96.84% Total of 11,162 chars. from Unipen data used. 1.5 State-of-the—Art in Handwriting Recognition Table 1.2 presents some recent results from the literature for the isolated character recognition problem, where segmentation is assumed to have been correctly done. It should be stressed that these different studies have used different datasets, which makes a comparison of these results difficult. Recently, results have been reported using data from the Unipen dataset [37]. The Unipen project was formed to gather data from many different organizations to eventually form a publicly available dataset which would enable researchers to report results that are comparable to each other. 13 t l l lIts At the time of thi~ Table 13 pft‘N'j nition. Again. the I. In many of these a; t; which makes the rm -. ptssihle without rm] it should be nutvd I only a frat'tiun of ti ing recognition Sy'stt trier-dependent sy: tertiary is that uf .\ t‘tlt‘abulary for 40 it from. an independent r m- ' a owes? on 3 i tt‘lentunsuetor' J, ta «r Mfg-07E for a 1. unconstrained W'Vl .. U K it 1t? r. \d,‘/'[ ‘ L 9. ~.‘ accl.lfa[jim 1. “7:1. clM'Cph‘ l ~ Q!- «tir' . - h (18 lb!" 1,. " J‘dlf‘i l At the time of this writing however, this database was not publicly available. Table 1.3 presents some recent results for the problem of handwritten word recog- nition. Again, the lack of common datasets makes it difficult to compare these results. In many of these approaches, the character models are chained together to form words which makes the recognition of words from a very large vocabulary (over 20,000 words) possible without requiring explicit examples of all these words for training. However, it should be noted that the results reported are on test sets containing words from only a fraction of the total word lexicon that can be recognized by these handwrit- ing recognition systems. Generally, the best accuracies reported in this table are for writer-dependent systems. One writer-independent system that achieves a very good accuracy is that of Manke et a1 [64], which obtains 91.4% accuracy for a 20,000 word vocabulary for 40 writers. It is not clear, however, if the training data is obtained from an independent set of writers. Hu et a1. [44] achieve a writer-independent accu- racy of 94.5% on 3,823 unconstrained word examples, but this is achieved by limiting the lexicon size to only 32 words. In another study, Hu et al. [43] achieved an error rate of 9.0% for a lexicon of 1905 lowercase words. Examples of how difficult the unconstrained word recognition task can become can be seen in the 64.4% [95] and 81.1% [72] accuracies achieved by the other recent writer—independent systems. 1.6 Research Problem Statement While much success has been obtained on simplified handwriting recognition prob— lems, such as isolated word recognition, the goal of fully unconstrained, large vocabu- 14 Taltlt’ .tutlt'" -————'——'—— l Nathan. ‘ et al :73.- . llankefinltv. l and ll'aihel til Rig-Jill. . et al. 36 593 \ l Hu. Rusenthal. and Brown 33. \ Seni et al. .93, \ 1 H11. Brown. and Turin :44: \ lt’uy handwritingJr rot-t ,ttt'nd » 44 in characters a hethndulogy to mod and e ‘~ ' hptirsentatton 0 3&3 mi lumen word ' ‘5’ I Heapp“ Table 1.3: Recent results on handwritten word recognition. Author Method Accuracy Comments Nathan 81-1%9n . 25.writers. et al [72’] Hl\sll\ll unconstrained Writer-independent. . 21K word vocab. word examples. Manke, Finke, Multi-state 91.4% on Writer-independent. Lowercase words. and Waibel [64] TDN N 2,500 examples 40 writers. 20K word vocab. Rigoll, HMM/NN 95% on Wiltgrfiifeerzdent' et al. [56, 89] hybrid 200 examples 30K word vocab. 16 writers. Hu, Rosenthal, HMM 91% on Writer-independent. and Brown [43] 2,000 examples Lowercase words. 1,905 word vocab. 62.4% for 9 writers. Seni TDNN writer-independent, 466 cursive word et al. [95] 91.6% for examples. writer-dependent. 21K word vocab. Hu, Brown, S troke- 94.5% on 3,823 Writer-independent. and Turin [44] level HMM unconstrained 18 writers. word examples 32 word lexicon lary handwriting recognition still remains a challenge due to the amount of variations found in characters as they appear within words. The goal of this thesis is to define a methodology to model online handwritten character data through the identification and representation of disparate writing styles (lexemes), and thereby better model handwritten words. Due to the temporal nature of online data, this work has possi- ble application to the domain of speech recognition as well. The benefits of this data segmentation to both non-parametric, and parametric classifiers is explored. In the latter, we present a system for word-level recognition, which makes use of character models and a grammar to define word constructs, and investigate at what level of 15 Lexer'te M0095 figure 1.7: Online ‘ stern by the dashed H3561 id line arrows. the training process hfigrure 1.7. In ad tr tritepiudepenrlHr «it: the former to It. Online Word Input l .... _ _ __ _>_ Preprocessing Full Training Set I ‘1 Feature I Extraction | / / / l / / l A! I Lexeme : Identification l l Classifier Lexeme Models ( Nearest Neighbor, Decision Trees. Hidden Markov Models) l Classification Results Figure 1.7: Online word recognition system. The flow of data during training is shown by the dashed line arrows, while the data flow during recognition is shown by the solid line arrows. the training process lexemes are useful. A block diagram of this system is presented in Figure 1.7. In addition, we explore the identification and usefulness of lexemes in writer-independent and writer-dependent recognition, as well as writer adaptation from the former to the latter. Modeling is done at the character level, and word recog- nition is accomplished by chaining together character models. Therefore, results will be first presented at the character level, and later at the word level. The scope of this thesis consists of the following problems: 16 1. Automatic writer data. . ptlt‘tri N a]. of classifier ll '3. The use Ul li't. reduce (limit: 3. The use of le'x' hers. l. Adaptation of i on a lexeme late the systems at- sampling of tilt- 5. ‘ - .. The tdentrht‘at: models that mil 6. I‘ . . he appllcat i._,,, De,» . . dlldgan Si. fl! 1 e {Ublllfl 0f ,4: ‘41. 1. Automatic identification of character writing styles from a pool of multiple- writer data, in which the number of styles present in the data is not known a priori. We also explore the dependence of the identification method on the type of classifier that will make use of these lexemes. 2. The use of lexemes in optimizing non-parametric classifiers in which we wish to reduce classification time while maintaining our recognition accuracy. 3. The use of lexemes in optimizing the recognition accuracy of parametric classi- fiers. 4. Adaptation of a writer-independent system to a writer-dependent system based on a lexeme labeling and subselection method. This has the potential to improve the system’s accuracy on an individual writer while requiring only a minimal sampling of the writer’s handwriting as training data. 5. The identification of lexemes within a set of data as driven by the parametric models that will represent these lexemes. 6. The application of our handwriting recognition methodologies to the domain of Devanagari script recognition. 7. The fusion of different information sources to improve classification accuracies. 17 g W 1,7 Thesis The organizatiun . . online character at. identifying writ ing string matching in. 3. llethuds of nut. are identified for it: in Chapter 4. Che: recognition 8(‘l'llf‘dt‘i' nudes can be used t Cl inhrmation lusiur. ter handwriting rem ispresented in Chap? nition system to gene ltttlyed in creating a Etna: are iuyolyed. at i”tet .lndpppndpnt \‘ i . t lidltt .. Q 1 .7 Thesis Outline The organization of the thesis is as follows. Chapter 2 presents a literature review of online character and word recognition, including work related to subclass models and identifying writing styles. Our method of identifying lexemes which makes use of a string matching measure and K-Means clustering algorithm, is presented in Chapter 3. Methods of non-parametric character modeling, in which prototypical examples are identified for each lexeme and used to reduce classification times are presented in Chapter 4. Chapter 5 demonstrates how lexeme models can be used to increase recognition accuracies of a parametric classifier, and how the representation of these models can be used to drive the identification of the lexemes. In Chapter 6, the topic of information fusion is discussed as a means of combining classifiers to achieve bet- ter handwriting recognition accuracies. Our unconstrained word recognition system is presented in Chapter 7. In Chapter 8 we show the ability of our handwriting recog- nition system to generalize to other languages, in particular Devanagari script. Issues involved in creating a writer-dependent handwriting recognition system, in which lex— emes are involved, and how these lexemes can be used to leverage knowledge from a writer-independent system to perform writer-adaptation are discussed in Chapter 9. Finally, some conclusions and future work are presented in Chapter 10. 18 Chapter Literatu 2.1 Introd ThiQ chapter descri ing recognition. Ft. ofzhis type of ham thine hartdtyrit in, lliptocessin Chapter 2 Literature Review 2.1 Introduction This chapter describes some of the issues and techniques involved in online handwrit- ing recognition. For a more complete survey of techniques applied to the recognition of this type of handwritten data, a number of survey papers exist [73, 79, 81, 105, 112]. Online handwriting recognition can be broken into the following steps: data collection, preprocessing, segmentation, feature extraction, modeling/ recognition, and postpro- cessing. Section 2.2 describes data collection devices and characteristics. Common preprocessing techniques are presented in Section 2.3, and segmentation methods, at both the word and character level, are discussed in Section 2.4. Modeling techniques used for handwriting recognition, and their corresponding feature sets, are described in Section 2.5, including a discussion of defining multiple models per class. Finally, postprocessing techniques which make use of higher-level knowledge about words are presented in Section 2.6. 19 2.2 Data Online handwritte itizing tablet is list of appttitxllttdlt‘l}' ‘ ordinates which drl asoitch to detect l pen is not touching on the tip of the [W imitation. Other (it is found on many I atllf signal from a to detect the locatit surface is nor part 0 f‘h ... ' ac tithing experten 23 Prepro. Th“ r» - wF’lrt .rtmg That m&\‘ P“. ill" to . ll A. . .El‘nfa Cat s“ Ling lite S.‘ ll‘i‘lse .. JlIlE’ lofm 0f \t 2.2 Data Collection Online handwritten data must be collected using a special device. Typically, a dig- itizing tablet is used that samples the location of a stylus on the tablet at the rate of approximately 80 - 200 times per second. This generates a sequence of (1:, y) co- ordinates which define the trace of the pen over time. The stylus will typically have a switch to detect pen-down (when the pen is touching the tablet) and pen-up (the pen is not touching the tablet) status. Some devices make use of a pressure sensor on the tip of the pen and collect pressure information in addition to the location in- formation. Other devices that can be used to collect data include touch screens, such as found on many PDAs, and the CrossPad. The CrossPad is a device that receives an RF signal from a special pen. A receiving grid inside the pad allows the CrossPad to detect the location of the pen. The advantage of such a device is that the writing surface is not part of the sensor, and is actually the paper (e.g., a legal pad) making the writing experience more natural. 2.3 Preprocessing The goal of preprocessing is to reduce or eliminate some of the variations in hand- writing that may exist that are not useful for pattern class discrimination. For a good survey of preprocessing techniques, see [35]. Strokes captured by a digitizing tablet tend to contain a small amount of noise, most likely introduced by the digitizing device, causing the sampled curves to be somewhat jagged. In order to reduce this noise, some form of smoothing is often applied using a spline filter [43], or a Yulewalk 20 figure “2.1: Ditiero dot indicates the in Enhtbwpmeh may have small in n slithered or raised llost preproces ht space rather tlm Written strokes w.) tritteu strokes of The letter the r lhgeneral .ther~ .egtons (see figurn 1- A middle in, ing] to tilt? 1 above] Figure 2.1: Differences in sample point spacings for handwritten characters. Each dot indicates the location of a sample point. filter for low-pass filtering of the data [6]. Often the beginning or ending of strokes may have small hooks that are created by a quick motion that occurs when the pen is lowered or raised. Dehookz'ng is the process of detecting and removing these hooks. Most preprocessing techniques resample the data so that points are equi-distant in Space rather than time. This provides a time normalization, without which slowly written strokes would contain a much larger number of sample points than quickly written strokes of the same shape (see Figure 2.1). The fewer the restrictions placed on a writer, the larger is the intra—class variations. In general, the :1: — y space that handwritten characters occupy can be divided into 3 regions (see Figure 2.2): 1. A middle region from the baseline of writing (the line which supports the writ- ing) to the midline (line that sits on top of writing which all ascenders extend above) . 2. An ascender region that extends from the midline to the top of the ascenders. 3. A descender region that extends from the baseline to the bottom of the 21 desrerttltrv" If a miter was re thes. regions. thi it would ensure t pthceeds from lei tariations in size. large peak in a i the middle regiu: that the middle Skew can be (1”, the horizontal di- peah is the great Another forth W Figure 2.3 it t" l .641 )e rather Rs: I“ ' ..th'rnttng the ax ascender region baseline I descender regi 0 Figure 2.2: The three regions used for size normalization. descenders. If a writer was required to write within a set of horizontally running lines which mark these regions, this would reduce the amount of variation in character size. In addition, it would ensure that the baseline is not skewed upward or downward as the writing proceeds from left to right. However, without putting such restrictions on the writer, variations in size and skew needs to be reduced by first detecting these regions. A large peak in a horizontal projection profile of the writing can be used to indicate the middle region of writing [7]. The height of the writing can then be adjusted such that the middle region has a prespecified standard height. The degree of baseline skew can be detected by creating projection profiles at a number of orientations off the horizontal direction, and search for the profile in which the average slope of the peak is the greatest [84]. Other methods makes use of the Hough transform [38]. Another form of variation in handwriting is the amount of slant in the characters (see Figure 2.3). While the variance of the slant may be very small for a single writer, it can be rather large for a writer-independent system. Slant can be detected by estimating the average slope of up and down strokes in the writing [14]. 22 ///7 Figure 333 2.4 Segm Depending on Ii segmentation of Single word will level. while dat segmenting the the segnientat lL requiring some I59. 74.11.31. 3 SEE‘Itented v.- ('Jl ///7 flflfl/fl//4; W /’/a/ Can natural ejog and COOL] Figure 2.3: The amount of slant in handwriting for two different writers. 2.4 Segmentation Depending on the form in which data is presented to a recognition system, some segmentation of the input may be required. Data which is presented in the form of a single word will require character segmentation if writing is modeled at the character level, while data which is presented as multiple words first requires a method of segmenting the data into individual words. The input may be a line of text, in which the segmentation points can be defined along the x-axis, or it may be an entire page, requiring some type of page decomposition technique as is done in document analysis [69, 74, 113]. Most of the handwriting recognition systems focus on either individually segmented words or individual characters. While it may be possible to train full-word models for a small vocabulary of words, this is not feasible for large vocabularies (20,000 words or more) due to the amount of data that must be collected to train each model. Since each word is constructed from a subset of the character alphabet, it is much more efficient to classify words using character models. The degree of difficulty in segmenting a word into characters is dependent on the amount of separation that can be expected between the characters 23 v I used to segment l'.’ IlCaIE‘d HIPIiIUd ls candidate segrnen' sequences is (‘UHsz pttstential while (it. arhararter t'la‘xs h] during post-prams rtentation during rt this method all saw; paints. Character u can he formed by st recognition score al~ nerhod of corrrhinin dSCiibEd further in 2'5 Modeli .3. .. ~ J UV 1 l!) t .. (see Figure 1.2). For well-separated characters, simple spatial gap thresholds can be used to segment characters. For overlapping or connected characters, a more sophis- ticated method is required. One method begins by an oversegmentation in which candidate segmentation points are selected [11]. A graph of possible segmentation sequences is constructed by connecting consecutive segments to form sequences of potential whole characters. Each of these combined segments is then classified into a character class by the recognizer and the “best” character sequence is identified during post-processing. A further extension of this approach is the method of seg- mentation during recognition as is commonly used with hidden Markov models. In this method all sample points along the handwritten word are potential segmentation points. Character models are chained together in such a way that all possible words can be formed by some path through the graph. The path that generates the best recognition score also implicitly defines the sequence of characters in the word. This method of combining character models to form word models using a grammar will be described further in Section 2.6, and then demonstrated in Chapter 7. 2.5 Modeling Methods A number of recognition systems have been built using (i) primitive decomposition [10, 34], (ii) motor models [80, 93], (iii) deformation models such as local and global affine transformations [111] and elastic matching [18, 21, 103], (iv) hidden Markov models [8, 10, 44, 71, 89], and (v) neural network architectures such as time-delay neural networks [36, 65, 92, 95]. The advantages and disadvantages of each of these 24 Till). ‘ Tethniqur : Primitive ‘ Decomposition ___.__.___ l I Motor Mud. 341. 550. 953 [ Deforrnat i~ _\]U(lt‘l5 i 6077.91. 101'; ”x Hidden .\lar Models '3. T Neural Net“ ' 65.91 1] \ terhniques are 5 20.1 Prin Primitive decor handing blocks Table 2.1: Approaches to online handwriting recognition. Technique Advantages Disadvantages Primitive Powerful high-level Not very robust to large Decomposition [10, 34] features. variations in writing style. Takes advantage of May lack robustness when Motor Models pen dynamics. writing style variations [41, 80, 93] are large. Works very well for Does not generalize well Deformation writer-dependent data. for writer-independent tasks. Models Does not require a Classification time grows [60, 77, 91, 103, 111] relatively large amount of linearly with the number of training data. training examples. Hidden Markov Models temporal Requires a large amount of Models [8, 101] relationships well. training data. Neural Networks Classification time Does not model temporal [65, 92, 114] is fast. relationships very well. techniques are summarized in Table 2.1. 2.5.1 Primitive Decomposition Primitive decomposition [10, 15, 34, 61] identifies sub-strokes that form common building blocks for characters. Examples of such building blocks are loops, dots, crossovers, arcs, ascenders, and descenders. This method generally requires a pre- segmentation of the strokes into sub—stroke pieces. Guberman et al. [34] perform word recognition by segmenting the sequence of strokes forming the word into sub- strokes referred to as metastrokes (see Figure 2.4). This sequence of metastrokes is compared to metastroke sequences in a dictionary of words, and a scoring mechanism is used to determine the word identity. Bercu and Lorette [10] have applied a hidden Markov model to the recognition of words represented as a sequence of primitives. Chan et al. [15] have identified primitives as line segments belonging to one of the 25 fcllnwing catcgnt . sequence of pilllli 2.5.2 Motor llntor M adds :4 l. by Synthcsis in it: connecting them to paranieterized 111ml of human hand HW- 2.5.3 Deforrr Puf— ' urrunnatinn mode-l templates to retire . rt. “migrate to a n. tech“ .. tques have al: in 91. 103] “m is an bkmem 0f ddla l r he distance hem-t din of their dista ._ nt :‘D’H “4U; LE: . -. ll]? . . ;,. . “1 ”Wm d following categories: line, counter-clockwise curve, clockwise curve, 100p, or dot. The sequence of primitives is matched to a template using elastic matching. 2.5.2 Motor Models Motor Models [41, 80, 93] are a technique commonly used in what is known as Analysis by Synthesis in which models of stroke segments are created along with rules for connecting them to form characters. Motor models represent these stroke segments as parameterized models of the motion of the pen tip, simulating the physical properties of human hand motion. 2.5.3 Deformation Models Deformation models make use of a set of templates and a method of deforming these templates to represent the data. The amount of deformation required to match a template to a test example indicates the degree of match. Deformation model techniques have also been used in offline digit recognition [50]. Elastic Matching [60, 91, 103] works on the sequence of sample points directly by searching for an alignment of data points between an unknown character, and the stored templates. The distance between an unknown character and a template is then taken as the sum of their distances between aligned points. The most similar template (that is the shortest distance from the unknown character) defines its identity. Tappert [103] uses the Euclidean distance between the (x, y) coordinates of two aligned points as their interpoint distances, and computes the best alignment using dynamic programming. 26 2 3 § mun.) mwrwzoe new ANYTHING Homzormt max-aw mmwu ANGLEWIOLOOP mmmm nor W8 www.mmnm “ammo up Anc. FREE no AT END mupmmomsseuo mmwowmocw meow mccw www.mmeuo mmmmmmnsnm mmoownmc.meee~oxreuo muem.maaoarsuo www.meesuonmm mu wmcmomeuo mummcw cmLacw mow mmmmmm www.mawnmo mmmmaaonmm gmwmmc) ammmc) murmur susmcxrmuuum oownmcrm ornscmm sunncarmxmum acuquuswndo ‘ J /(\"l I (((°‘C’°))) (((°C)°3))+. ..A<>io.. 8&38883882838383 12345670910111213 Figure 2.4: Metastroke categories and an example decomposition of a word into metastrokes [34]. 27 Affine transf characters [110. 1 patients: a strul; which is. the port. and a ternplatt‘. 0i inter-character :ransfnrnrationt a afine transfurrna'. online word rem in as prints of hirl. character as the i measured by mm 66511911 by the It tetween critical 1 Affine transformation is a technique which has been used for classification of Kanji characters [110, 111]. The difference between two characters is broken into two com- ponents: a stroke-based affine transformed component and a residual component which is the point-to-point distances between the affine transformed test character and a template. The final distance is then calculated as a combination of the sum of inter-character point distances between the original test character (before affine transformation) and a template, and the residual component between the two after affine transformation. Pavlidis et al. [77] have used a physics-based approach to online word recognition. This method first extracts critical points, which they define as points of high and low curvature, and points of inflection, and represents each character as the vector of these points. The distance between two characters is then measured by morphing one vector to another, and measuring the deformation energy defined by the required amount of stretching and compression of the line segments between critical points. 2.5.4 Hidden Markov Models Hidden Markov models [57, 87] are often used in a similar fashion as elastic matching in that they represent the data in terms of its temporal sequence. Rather than repre- senting a character or word as a group of templates, however, a hidden Markov model is trained to represent the population of data for each class. Hidden Markov models were first applied to speech recognition [3, 4, 45, 58] with great success. Makhoul et al. [101] have trained HMMs on sequences of two-dimensional features, writing 28 figure '25: Featur dow feature extrat paint. or foIll 3 gr 561‘191109. angle and charm have made use of . extracted. which s of features. Belle; each of a nurnber ca n: ..n..r9d at each ,x 2.5.5 N eure Time an 51." neura ch . . 1 th - ~ P0111135 utllfal network 1 Quasi, le class idr-r N. It .lidfjw TL ‘ 119 aft; 1 L19 59c Vidence of l 1" (a) (b) Figure 2.5: Feature extraction. (a) Point-by-point feature extraction (b) sliding win- dow feature extraction. A sequence of features may be generated from a single sample point, or from a group of points that fall within a window which slides along the point sequence. angle and change in writing angle, for each sample point. Many other approaches have made use of a sliding window - a window of sample points for which features are extracted, which slides along the sample point sequence thereby producing a sequence of features. Bellegarda et a1. [8] train HMMs on a sequence of features computed for each of a number of frames. Frames are defined as a collection of 21 sample points centered at each x-axis or y-axis extrema along a written stroke. 2.5.5 Neural Networks Time delay neural networks [36, 65, 92, 95] are often used to recognize characters or character segments as a sliding window passes over them. Features are extracted for the sample points in the sliding window and passed to the input layer of a feed-forward neural network. The array of output nodes of this network correspond to the different possible class identities that may be assigned to a sequence of points within the input window. The activation level of each output node corresponds to the likelihood that the sequence of points in the sliding window matches the corresponding identity for 29 that output notiv used to obtain tl. anon-linear darn. tht‘ sequence of t networks make an al. 9'2 have train. of output values f hidden Markov llit represented using 2.5.6 Local the focus of the Ripe of informat it: E .astrc matching .otstdered very l that output node. This then produces a sequence of likelihood values, which can be used to obtain the optimal sequence of characters. Manke and Bodenhausen [65] use a non-linear dynamic time warping algorithm to produce an alignment of words to the sequence of output values from the network. Other methods involving neural networks make use of hidden Markov models to form hybrid systems [9]. Schenkel et al. [92] have trained hidden Markov models for each character to model the sequence of output values from a neural network. Similarly, Rigoll et al. [89] have trained hidden Markov models in which the continuous densities of the state observations are represented using a neural network. 2.5.6 Local vs. Global Features The focus of the features used in each of these methods can be characterized by the type of information provided (local vs. global) with respect to a word to be recognized. Elastic matching is a point-oriented method, and therefore the information used is considered very local, features used in HMMs and TDNNs are typically local at the level of a sliding window, and primitive decomposition focuses on higher-level sub- structures of a word. Global features are then characterizations of the word as a whole, such as its size or the presence or absence of descenders or ascenders, irrespective of their location in the word. Local features tend to be more robust than global features, since they can more easily handle shape variations, while global features tend to provide more information about the word’s identity when the writing is well behaved. Hu et al. [43] combine low-level and high-level features by representing cusps, crossing 30 pOlIlIS. 811d lm’l a high-level feat local and glt’l’dl verification. L0 are computed in two moment of i frame that is a N from the first wi functions: jitter t ratio. a measure ofsample points i biased harm onic at Which the fat. r] or ~ ' meets with a r. 2.5.7 Writii points, and loops as point-oriented features which measure a point’s distance from a high—level feature. Nalwa [70] pr0poses a technique that uses information at both local and global levels of a written word, and applies this to the problem of signature verification. Local information is a combination of 5 characteristic features that are computed for a sliding window. These features are center of mass, torque, and two moment of inertia measures. These are computed with respect to a coordinate frame that is a second sliding window which maintains a set distance along the curve from the first window. Nalwa defines global information as a combination of four functions: jitter (the difference between the raw stroke and smoothed stroke), aspect ratio, a measure of the amount of signature length warping applied, and the number of sample points in the signature. The local and global measures are combined using a biased harmonic mean. Using this method, Nalwa achieves an equal error rate (point at which the false accept and false reject rates are equal) of between 2% — 5% on datasets with a range of 43 - 102 different signers. 2.5.7 Writing Style-Based Models Recently, there have been efforts to automatically identify different writing styles that exist in a set of writer-independent data, and to model each writing style separately in order to boost the performance of the recognizer. Clustering methods have been used to identify writing styles, and prototypical samples from each of these writing styles can be used to greatly reduce the time required to perform nearest neighbor classification [21, 83, 91, 115]. For parametric models, these clusters can be used to 31 define multiple 5‘ to reductions in ’. identified multip‘. using a hidden M. strokes. correspt in self-organizing 11m Styles appear to l. cifthese prototypi net that connect: one or more of ti 1. The nurrrln 2' The ClUStej Slficatjon. 3' The St ud \' '75 _‘ .1 51:! [fo‘\ define multiple sub—class models that are more robust than a single class model due to reductions in the complexities of the resulting models. Bellegarda et al. [72] have identified multiple writing styles at the character level and modeled each of them using a hidden Markov model. Schomaker et al. [93, 109] have identified prototypical strokes, corresponding to writing styles, in a set of data taken from 30 writers using self-organizing maps. A hierarchical clustering of the data shows that certain writing styles appear to be based on nationality and gender. Schomaker et al. [94] make use of these prototypical strokes to perform writer-adaptation by retraining the transition net that connects these strokes to form words. These approaches currently impose one or more of the following restrictions: 1. The number of sub-classes must be pre-defined. 2. The clustering is done using a different representation space than used in clas- sification. 3. The study is restricted to nearest neighbor classification. The first restriction relies on a human expert to correctly choose the number of writing styles that exist for each character class. Not only can this prove to be a tedious task, it is difficult for a human to visualize the multidimensional temporal data in the same way as the highly complex models that will represent it. It should be noted that there may be significant differences in the temporal data between two characters, even when the static image representation of these characters appears very similar. In addition, if an automated method that is used for identification of the sub-classes 32 makes use of a «l in the second re: for the model sit for word recogni' makes the word t l to prOVide good .\ modes. have the 2 and are therefore In Chapter 3, number of sulH-‘i the use of these identifying 511 l)-( he Presented, 2'6 Post ""r . .64.) hand‘e'ri: , .23} a] in a c.0111} it“?! I-ld‘l€‘h.el 1f ‘ Jfl. 1121],. ‘ Uri; ““ thua “Cam r. K“. makes use of a different representation space than will be used to model the data, as in the second restriction, there is no guarantee that the division of data is reasonable for the model space. Finally, the use of character-level nearest neighbor classifiers for word recognition requires a pre-segmentation of the words into characters. This makes the word classifier depend heavily on the ability of the segmentation algorithm to provide good segmentation alternatives. Other methods, such as hidden Markov models, have the ability to perform segmentation and classification at the same time, and are therefore less vulnerable to such problems. In Chapter 3, we will present a method of sub-class identification in which the number of sub-classes is defined by an automatic method. In addition, we will show the use of these sub-classes for nearest neighbor and decision tree classifiers (Chap- ter 4), in which the representation space is the same as was used to identify the sub-classes, and hidden Markov models (Chapter 5). In Section 5.3, a method of identifying sub-classes in the representation space of hidden Markov models will also be presented. 2.6 Postprocessing Many handwriting recognition techniques model data at the character or the stroke level. In a complete word-level recognizer, these models can be concatenated to form word-level models. A dictionary can be used to restrict the possible character combi- nations, thus providing some higher-level knowledge. This may be implemented as a grammar which specifies all possible character model combinations, forming a graph 33 of character mt .. classifications it:‘ produced by a \' acters that Illill'x'l' the system has [1, less restrictive mo. unlikely character accuracy. 2-7 Sumn in this Chapter. 2 the current 111,-.” the data. and 3.. hidden Markov r F“d» 45.9.1111 0f 10(' ‘dl 0ft: ' ‘r ml‘ig‘sl3'lces '- .. wet ‘ ‘pfOce-~' ~55 ”1% It of character models. Often, prior probabilities of words are used to discourage mis- classifications into infrequent word categories. The best path through this graph, as produced by a Viterbi alignment of a test word, can then be used to specify the char- acters that make up that word’s identity. The restrictions of such a method are that the system has no chance of recognizing a word that is not in its vocabulary. Other, less restrictive methods, use character bigram or trigram probabilities to discourage unlikely character combinations. This is done at the cost of a reduced recognition accuracy. 2.7 Summary In this chapter, an overview of the online handwriting recognition techniques from the current literature was given. The processes of data collection, preprocessing of the data, and segmentation at the word and character levels were discussed. The modeling techniques of primitive decomposition, motor models, deformation models, hidden Markov models, and neural networks have been described, along with a dis- cussion of local and global features, and previous work involving the identification of writing-styles within a set of handwritten data. Finally, a discussion of common postprocessing techniques was presented. 34 chapte Writing LexemE 31 Int r0 A partition of th motivated by the tiaraeter. The di Separate model to the different stvle‘ are. two potential OT; " a which could CW" J ~..antlv Search ,. 11,. the feati ‘71 ,i' l!' . LE.) d1 . ( AC fferent tl Chapter 3 Writing Style Identification — Lexemes 3. 1 Introduction A partition of the training data for each character class into multiple sub-classes is motivated by the observation that there exist many different styles of writing a single character. The diSparity between many of these styles motivates the construction of a separate model to represent each style. While it may be possible to enumerate many of the different styles in existence, and solicit examples of these styles from writers, there are two potential problems with this approach: i) new styles will most likely always exist which could not always be identified a priori, requiring the system designers to constantly search for new, so far unseen, writing styles (potentially a very tedious task); ii) the features used by a recognition system to classify a character are most likely different than the features used by a human, and therefore writing styles may 35 be defitrr‘tl a reliable character This cl handtvritit The charm ndngrnat latity of In this»? hand of deterrnit cussed. alt. St‘II‘IlDH .33 fit the repr the set of u Writer-indet lit-ain‘t ~ .gat 15 “TV ugj ~ {'39 Champ be defined differently for the recognizer and the human eye. These factors motivate a reliable automatic method of discovering writing styles from a set of handwritten character examples without requiring human intervention. This chapter describes our technique of automatic identification of the different handwriting styles, referred to as lexemes, within a set of handwritten character data. The chapter is organized as follows: Sections 3.2 and 3.3 describe a preprocessing and string matching distance measure, respectively, that are used to determine the simi— larity of two individual characters. Section 3.4 describes our technique of clustering these handwritten characters to identify possible lexeme groupings. The problem of determining the most appropriate number of lexemes for the given data is dis- cussed, along with a description of the cluster similarity measure in Section 3.4.1. Section 3.5 discusses a technique by which lexeme models can be refined to better fit the representation space of the modeling technique used. Section 3.6 describes the set of writer-independent handwritten data that is used to identify lexemes in a writer-independent system. 3.2 Preprocessing Figure 3.1 shows the steps involved in preprocessing. In our approach, we have applied a size normalization to each character such that all characters are of the same height, but retain their original aspect ratio. In addition, certain characters exist in which the height is very small (e.g., “.” and “-”) and this in itself is important to classification. These characters are detected by examining their height and aspect ratio. If either of 36 Character W, Size Normalization ____—_.7i ___r.— __. _ ,7 ii; ____7 , Centering I t't ____—__,_. is l EquidistantResampling l ,!__n———_i_. _ A__—_.’- J -__. EquidistantResampling l l Preprocessed Character Figure 3.1: Preprocessing steps for handwritten characters. these falls beyond a threshold value, then the character is not size normalized. Next, the character is centered to a standard reference point. Following size normalization and centering, an equi—distant resampling of the data is applied, followed by a Gaussian filter applied independently to each of the x and y coordinates of the point sequence: 37 where This. is defined sirr Gaussian filteriti: by the points that of these operatior high curvature al detected and int-l ‘ntiepenrlently tt‘. ensuring that tli digit is smoothe one more time 5 mark ‘ , as crrtrr mum-sumo “ :5? 30 . Iiletered : 2 wt I???» (31) i=—30’ where 842/202 t. — _ 3.2 “2 ?:_30 (732/202 ( ) This is defined similarly for ytfiltered. Equi-distant resampling must be done before the Gaussian filtering is applied so that the calculation of our filtered points is influenced by the points that are nearby in space rather than nearby in time. Throughout both of these operations, there are certain critical points such as endpoints and points of high curvature along the curve whose location we should preserve. These points are detected and included with the resampled points. Gaussian filtering is then applied independently to every sequence of points between two consecutive critical points, ensuring that these critical points remain in their original locations. Now that the digit is smoothed and the character size has been normalized, resampling is done one more time so that the final sequence of points is equi-distant with the proper spacing for feature extraction, once again preserving those points that have been marked as critical points. Figure 3.2 shows an example of the character ‘a’ after each preprocessing step. Figure 3.3 shows an example of three different digits before and after preprocessing. Critical points are marked on the preprocessed digits with dots. 38 f7 1" ’1 . _ ... ’,-’\ figure 3.2: Prepr lat original. lb) 2 J ... ‘ ' dllfi Gaussian ii: 3.3 Strin .l string matchiii parts. A stroke is if ‘ A J tne sequence l 595:”th ' . .trmg T 01/ 0L CL (80 (b) (C) @ 0p Figure 3.2: Preprocessing steps for character ‘a’ with sample points marked as dots: (a) original, (b) after size normalization and centering, (c) after first resampling, (d) after Gaussian filtering, and (e) after the final resampling. 3.3 String Matching Measure A string matching technique is used to provide a distance measure between character pairs. A stroke is represented as a sequence of events (feature vectors), corresponding to the sequence of sample points in the stroke. This sequence forms a variable- length string. The distance between two different strings, A = (614,. . Weft/4) and B = (6?, . . . , 6,158), involves computing the distance between the corresponding pair of events ef‘ and 611-3. This requires an alignment of the events between the two strings, and calculating string distances based on that alignment and the distances between the individual pairs of aligned event. The string matching technique presented here is a simplification of a technique that has been previously applied to fingerprint recog- 39 01/ ,7 7 of a! (a) Figure 3.3: Examples of characters (a) before and (b) after preprocessing with critical points marked as dots. 4O nition .47}- applied to on figure“: f it a ‘2' con: indicate com): n; 15 mpg,“ :9." r .terent-e coor Piilnt We dPL Smite, G1 \‘en it‘iWr-p ‘ d Pat l zine- . me .d_\, 'lr‘gm-‘im; nition [47]. Similar methods under the name elastic template matching have been applied to online handwritten data [103]. (a) Distance=108 (b) Distance=10 Figure 3.4: The alignment between two digits, and the resulting matching scores for (a) a ‘2’ compared with a ‘3’, (b) comparing two ‘3’s. Dotted lines are drawn to indicate corresponding aligned points between the two digits. In our approach, each event (corresponding to a sample point after preprocess- ing) is represented with 3 measurements: the a: and y offsets with respect to some reference coordinate, and the angle of curvature of the written stroke at the sample point. We define the reference coordinate as the first sample point of the digit’s first stroke. Given an alignment of the events between two strokes, the distance, de(z', j ), between a pair of events, e;4 from stroke A and e? from stroke B, is defined as a linear combination of the respective differences between the two events of the three measurements taken for each event: cater) = Itxr‘ — as?) — (sci — I?» (3.3) dy(i,j) = My? - yi“) - (31f - MB)! (3-4) 41 dg(i,j) = Manuef — efl, 360 — w? — 0ft) (3.5) de(z‘.j) = a dr(z',j) + 5 dylm) + or do(z',j). (36) where all 6 are in the range (0, 360), and a, [3, and y are the weights of the linear combination. Appropriate values for these weights are determined empirically. In general, placing a large weight on d9(‘i, j) results in a better classification accuracy. The distance between two characters is then the sum of the distances between each pair of corresponding points from the strings, given some alignment of the points. For our experiments, characters containing multiple strokes (delayed strokes) are handled by creating a connecting segment from the last point of a stroke to the first point of the next stroke, and therefore all characters are represented as a single string. We place a restriction that prohibits alignments in which two or more events in one string map to a single event in the second string. Alternate alignments are created by two methods: 1) skipping an event from the first string if it is determined that, for the given alignment, the event is spurious, 2) skipping an event in the second string if it is determined that, for the given alignment, the corresponding event in the first string is missing. In each case, we add a penalty to the total distance between the strings, respectively called Spurious Penalty and Missing Penalty. These penalties also act as threshold values on the distance between two events in determining if a spurious or missing event exists. The distance between two strings is calculated by considering all possible align- ments of events in the two strings, and finding the alignment for which the total distance is minimum using dynamic programming. The calculation of the distance 42 better- liilll’t‘é t' £13“ PET. L1- between two strings, A and B, is shown here in terms of the calculation of the dis- tances between a set of events, ef‘ and e13: ( D(2—17j—1)+de(iaj) D(2' — 1,j) + Missing Penalty 1 g 2' < NA D(z', j) = Mint D(2T,j -— 1) + Spurious Penalty 1 g j < NB D(z' — 1, j — 1) + Missing Penalty + Spurious Penalty L (3.7) D’lSt(A,B) Z D(NA,.NB). (3.8) For our experiments, we have set Missing Penalty 2 Spurious Penalty. The final distance between two strings uses an additional global measurement: a stroke count difference penalty term Dist(A, B)2 D. t A B = P — . 23 SP( ’ ) N0rm.Fact0r(NA,NB)+(S )ISA SB" (39) where S A (SB) is the number of strokes that make up digit A (B), SP is the stroke penalty, and N orm_F act0r(NA, N B) is the maximum possible distance that can be calculated between any two strings of lengths N A and N B scaled by a constant factor. 3.4 Squared-Error Clustering Clusters of character samples are formed using a squared-error clustering algorithm, called CLUSTER [48]. In order to make use of this technique, we require that each 43 f) :37? f4 character be represented by a common feature vector. However, our data does not have such a representation since it is in the form of a string. An approximation to a Euclidean feature space can be obtained by representing each character as the vector of distances to every character in the feature space. Using the distances calculated by our string matching method, a proximity matrix is constructed for each character class. Each matrix measures the intra-class distances for a particular character class. Clustering is then done in which the feature vector for a character is its M distances to each one of the M characters belonging to the same class. CLUSTER attempts to produce the best clustering for each value of K over some range, where K is the number of clusters into which the data is to be partitioned. However, the final decision as to what value of K is most appropriate is still left up to the user. Figure 3.5 shows an example of the number of clusters plotted against the total intra-cluster mean squared error for the character class ‘m’. As expected, the mean squared error decreases monotonically as a function of K. The “optimal” value of K can be chosen by identifying a “knee” in the curve. An inspection of Figure 3.5 shows that identification of a knee is not easily accomplished. This plot is typical for most characters, making the choice of K very difficult by this method. 3.4.1 Selecting An Appropriate Number of Clusters Davies and Bouldin [25] present a measure of cluster similarity for a set of clusters. Their measure can be used to find an “Optimal” value of K for which the cluster similarity is minimal, or equivalently, the cluster separation is maximal. 44 3'}: I, 2‘ rA. f.l\ C. L thud elk 280 I 1 I T I I T I I 270 - 260 - 250 . MSE 240 - 230 - 220 - 210 J 1 l 1 l l l l l 0 5 10 15 20 25 30 35 40 45 50 No. of Clusters Figure 3.5: Mean squared error (MSE) for different values of K, the number of clusters, for the character ‘m’. The cluster similarity measure is defined as follows: For cluster 2' with cluster center A,, population size T,-, and cluster members (X 1, . . .,XTi), the dispersion of the cluster is defined as T. 1 I s.- = {—ZIIXj—AJIQP/q, M (3.10) Ti jzl where A,- is the vector (a,1,...,a.,-K)t, and Xj is the vector ($31,. . .,:ij)t. The inter- cluster distance is defined as K Alij = {Z lak, — akjlp}1/p, p > 0. (3.11) k=l 45 From these two measures, the similarity between two clusters 2' and j is defined as Si+Sj i, 541:" Z , .12 and the similarity of a set of K clusters is defined as K kl-Em R(S,, 8,,M,, ), igj g N. (3.13) Similar statistics have appeared elsewhere in the literature [19, 26]. Generally, p and q are set to 2 in which case S,- becomes the standard deviation of cluster 2', and Mi]- becomes the Euclidean distance between the center of cluster 2' and the center of cluster j. 3.5 HMM-based Clustering The method of lexeme identification presented thus far in this chapter relies on clus- tering in a representation space which is based on the string matching measure. This seems appropriate if string matching also forms the basis of our classifier. However, in Chapter 5 we shall discuss the use of hidden Markov models (HMM) for classifica- tion. Ideally, lexemes should be defined using the same feature space as will be used to model these lexemes, and the same distance measure as will be used for recog- nition. A clustering of the data using the HMM likelihood as a distance measure between an HMM trained for a cluster, and a single character, can be defined such that a clustering can be done in HMM probability space. Given some initial set of 46 cluster metal.“ nicttlel train; 3.6 The at in Elia in a v. Es “nd 1 clusters, one can iteratively train one model for each cluster and then redefine the membership of each cluster by assigning each character example to the cluster whose model, within the same character class, gives the maximum likelihood score for that example. Details of this method will be described in Section 5.3. 3.6 Results The online handwriting database used in our experiments is particularly challenging in that it contains both discretely written and manually segmented cursive characters in a variety of handwriting styles. Some examples of these characters are shown in Figure 3.6. Experiments were run using a set of 17,947 alphanumeric characters for training, and 17,928 for testing, from a combined pool of data representing 21 writers and a total of 36 classes: 10 digits and 26 handsegmented lowercase cursive lettersl. An example of the cluster similarity measure for different values of K is shown in Figure 3.7. The final number of clusters chosen for each of the 36 character classes is shown in Table 3.1. Table 3.1: Number of clusters, K, chosen for 36 alphanumeric classes LClassfl0]1]2]3]4[5]6]7[8]9]a]b[c]d]e[f]g]h] l K ll4l3l316l3l4l7l3l4l7l3|3l3l3l3Wl3l3l Classh j[k]l]m n]o]p]q]r s]t u]v]w x]y]z] lK ll5|6l4l3l3l3l3l3l3l9l3l3l3l3l5l5l8l4l 1Data provided by the Pen Computing group at IBM Watson Research Center. 47 «L a o a a C. )Q, 5’ IL A I 6 l a 3 a k Q 4 k (N “r 0 26a 2 L14! (9 e a. a: Y 8 5’? Figure 3.6: Examples of characters in our database. 48 190,6 JJA al/ an! 3%? let” fete rLU/t rfi ¢47¢ 33% t{/ 5'53 55‘5 '77/ at? ‘\->.:°«o\\ s rm . v.“ ...,I“ » .1 4 it '5 v .1- flu: O 6"? -‘J[ .‘ .. Win, a. ‘ l 0-3 I I I I fl I T l 025 _ a 02 P ‘ I l 394: Class 7 d2.48 > 135: :...d2.11 <2 153: Class 5 . d2.11 > 153: Class 4 d0.79 > 173: :...aspoct > 3.91667: Class 1 aspect <= 3.91667: :...strokos in {3.4.5}: Class 2 strokes 8 2: ‘--:d7-52 (3 128‘ Class 7 Reference Character Set: 47.52 > 128: =---d9-17 <= 289: Class 5 ( 60.1, d0.79, d0.84, d0.51. d1.62. d1.23. d9 17 > 239= Class 7 d1.44, 41.9. 61.36. d2.11. d2.77, d2.68. =tr°k°8 ‘ 1= d2.15, d2.48, d2.20, d2.61, d3.6. 63.33, =--od7-16 <= 66= Class 7 d3.32, d3.47, d3.23, d3.44. d4.16, d4.33, d7-16 > 66= d4.39, d5.17, d5.5, d5.8, d6.3, d6.37, =- -d0 1 <' 325= d6.48, d6.23, d7.16. d7.52, d7.7, d7.31, =---d3-32 <= 53= Class 3 d8.5. d8.32, d8.2, d8.47, d8.27, d9.17, d3-32 > 53= Class 2 d9.32, d9.44. 69.7. d9.26. d9.55, d9.49, d0-1 > 325= d9.34, d9.43, d9.15 ) :...d3.6 > 220: Class 7 63.6 <= 220: :...d5.17 <= 153: Class 3 Global Features: : d5.17 > 153: Class 2 d3:6 > 232: ( aspect (aspect ratio of test char.), 3---d1-36 <3 78‘ Class 1 strokes (no. of strokes in test char.) ) d1.36 > 78: :...d6.3 <= 158: :...d8.47 <= 288: ' :...strokos = 2: Class 4 strokes in {3,4,5}: Class 0 strokes = 1: :...d3.44 <= 223: Class 1 . d3.44 > 223: Class 0 d8.47 > 288: :...d3.47 <= 320: Class 6 63.47 > 320: :...d0.1 <8 34: Class O : d0.1 > 34: Class 1 d6.3 > 158: Figure 4.2: An example of part of a decision tree using similarity features. Distances from a test character to a reference character are indicated in the tree by the name of the reference character. This is shown as day, where a: is the digit class, and y is the index of that reference digit from the training set. 55) lit tun; These features shall be referred to as the diflerence features (see Figure 4.3). When constructing a decision tree, at any node, n, we choose a pair of reference characters, i(n) and j(n), with corresponding feature Dailnwlnlv and a threshold TDczlnljln) for partitioning the training data based on the following rule: , traverse the left branch of the node n If Dc,i(n),j(n) < TD c,z'(n).j(n) otherwise, traverse the right branch of node n. Figure 4.4 illustrates these two feature representations. The difference features have the disadvantage of requiring two distance calculations per feature. However, as we shall see in Section 4.5, these features have the potential to produce superior classification results, and redundancies in the distances used by each feature keep the average number of distances that must be calculated to classify a test character relatively small. 4.4.3 Global Measurements It is often useful to use both global and local features of characters for classification. When using a nearest neighbor classifier we require some good method of combining these features to compute the inter—pattern distance. We use a weighted combination of global measurements (differences in the stroke counts) with local measurements (the string matching distance). The weights in this combination were determined empirically using a subset of the training data to evaluate the recognition accuracy 60 D. . Decision trso: d2.15-d6.23 <8 -44: :...aspsct > 3.91667: Class 1 aspect <8 3.91667: :...strokas in {3,4,5}: Class 2 strokes 8 2: :...d4.16-d7.52 > 101: Class 7 d4.16—d7.52 (3 101: Reference Character Set: :...d2.15-d2.11 <8 ~13: Class 4 d2.15-d2.11 > -13: Class 5 strokes 8 1: ( d0.1, d0.79, d0.84, d0.51. d1.62, d1.23. d1.44, d1.9, d1.36, d2.11, d2.77, d2.68, :...d3.6-d7.16 > 37: Class 7 d2.15, d2.48, d2.20, d2.61, d3.6. d3.33. d3.6-d7.16 <3 37: d3.32, d3.47, d3.23, d3.44, d4.16, d4.33, :...d5.l7-d1.23 (g '16: Class 3 d4.39, d5.17, d5.5, d5.8, d6.3, d6.37. d5.17-dl.23 > '16: d6.48, d6.23, d7.16, d7.52, d7.7, d7.31, :...d0.l‘d6.3 <3 84: Class 2 d8.5. d8.32. d8.2, d8.47, d8.27, d9.17, 69.32. d9.44, d9.7, d9.26, d9.55. d9.49, dO.1-d6.3 > 84: Class 3 d9.34, d9.43. 69.15 ) d2.15-d6.23 > -44: :...d1.44-d9.32 <8 -60: Class 1 d1.44-69.32 > -60: :...d6.3-d7.31 <= -304: Gl°ba1 F°atur°8= :...d0.1-d6.3 <8 11: Class O d0.1-d6.3 > 11: Class 6 ( aspect (aspect ratio of test char.), d6,3-d7,31 > -304; strokes (no. of strokes in test char.) ) :...d9.44-d8.27 <8 -73: :...d8.2-d9.7 > -25: Class 9 d8.2-d9.7 <8 -25: :...d0.79-d4.16 <8 -7: Class 5 . d0.79-d4.16 > -7: Class 8 d9.44-d8.27 > -73: Figure 4.3: An example of part of a decision tree using difierence features. The the reference pair used to make a decision at a node is labeled d$1.y1-dx2.y2, where 2:,- is the digit class, and y,- is the index of that reference digit from the training set. 61 F .J‘ All d“ Threshold / Threshold (a) (b) Figure 4.4: Similarity and difference features. (a) Similarity features: while training patterns a and b are grouped based on their similarity, to template X, examples 0 and d, although very different from each other, will be grouped together; (b) Difference features: training patterns on each side of the decision threshold (difference between template X and template Y) are more similar to each other than to patterns on the other side. for different values of the weights. Other global features, such as aspect ratio, are implicitly represented by the string matching measure since we maintain the original aspect ratio of the character during size normalization. A preferred method of determining how to combine global and local measurements would be to use an automated learning method. For decision tree feature sets, we have removed the effects of the number of strokes and aspect ratio from the distance measure by eliminating the stroke count difference from the measure, and by normal- izing the size of each character to a standard height and width. The stroke count difference and original aspect ratio are then presented to the decision tree as two additional features along with the “local” features. In this way we allow the decision tree to determine the relative importance of each feature. 62 4.5 Results This section compares classification results for both nearest neighbor and decision tree classifiers based on the full training set, edited training set, and cluster centers. All timing results were obtained using a 296MHz UltraSparc workstation. 4.5.1 Datasets Experiments were run using a combination of three different sets of data: (i) a set of 600 handwritten digits captured from 21 different writers, which will be referred to as DA; (ii) a set of digits, referred to as DB, which is an independent set of 360 digits taken from an unknown number of writers and has been shown to have a slightly different writing style distribution than DA [21]; (iii) a set of 35,875 lowercase letters, originally written in a cursive style, but manually segmented so that they are now available as individual isolated characters. This set, referred to as DC, was produced by the same group of 21 writers as DA. 4.5.2 Digit Recognition Digit classifiers were trained using the 600 digits in set DA, and tested using the 360 digits of set DB. Clustering was done for each of the 10 classes of the training set. Table 4.1 shows the number of clusters chosen for each of the digits using the following two methods: manually selecting K using the K vs. MSE graphs (e.g., Figure 3.5), or automatic selection based on Davies and Bouldin’s method [25]. For automatic selection, we consider only cluster counts greater than 2. 63 Table 4.1: Number of clusters, K, chosen for 10 digit classes by manual selection, and by automatic selection. [ DigitClass 10]1[2]3l4l5|6]718]9] Manually Selected K 4 5 7 6 3 3 4 4 5 10 Auto. SelectedK 4 3 3 6 3 4 7 3 4 7 Table 4.2: Digit recognition rates using the nearest neighbor classifier. Training Set No. Templates Throughput Recog. Rate Full Set (DA) 600 ~2 char/sec 91.10% Manually Selected K 51 "26 char/ sec 88.90% Auto. Selected K 44 ~31 char/sec 87.50% Edited 402 ”4 char/sec 91.39% Nearest Neighbor Classifier Representative templates were chosen, one for each cluster, based on their proximity to the center of the cluster. Table 4.2 shows the results of testing a nearest-neighbor classifier using four different training sets: the full set of 600 training examples, the set of 51 cluster centers where the number of clusters is manually chosen, the set of 44 clusters chosen by the Davies and Bouldin method, and the edited training set. As can be seen, both the cluster representation methods result in a significant reduction is classification time at the expense of lower classification accuracy. The edited set does not lead to a substantial reduction in classification time, however, it surprisingly results in a slightly better recognition accuracy. 64 Decision Trees The decision tree was implemented using the package C50 [86]. The reference char- acter set used in the definition of features (similarity or difference) to construct a decision tree can be any of the training sets used for the nearest neighbor classifiers: the set of cluster centers, the edited training set, or the full training set. In addition, two global features, number of strokes and aspect ratio, were not used in computing the distance measures, but were added as two additional features in the decision tree, as discussed in Section 4.4.3. Table 4.3 shows the results of the digit classification problem using decision trees where the reference templates are 51 cluster centers, the edited training set of size 402, and the full training set of 600 digits, respectively. Similarity and difierence features are shown for decision trees where the set of cluster centers is the reference set. Due to it’s large dimensionality and computational considerations, only the similarity features are shown when the reference set consists of the full training set, or the edited reference set. As can be seen in Table 4.3, the use of decision trees results in a great reduction in the average number of distances that must be calculated in order to classify a test digit. However, this comes at the cost of a significant reduction in recognition accuracy. Among the features based on cluster centers, however, it is apparent that the difierence features produce superior results compared to the similarity features, at the cost of a 48.2% increase in the number of distances that must be calculated per test sample. Table 4.4 shows how the results of Table 4.3 are changed when the effects of the 65 Table 4.3: Digit recognition rates using the decision tree. Cluster Centers Edited Full Set Similarity Difference Similarity Similarity Avg. Dist. Calcs 5.6 8.3 5.1 5.1 Throughput ” 77 char/sec ” 50 char/sec 7 67 char/sec “' 91 char/sec Recog. Rate 73.3% 79.2% 76.4% 80.3% Table 4.4: Digit recognition rates using the decision tree when global traits are treated as separate features. Cluster Centers Edited Full Set Similarity Difference Similarity Similarity Avg. Dist. Calcs 4.99 8.38 4.24 4.7 Throughput " 91 char/sec ” 59 char/sec 7 100 char/sec 7 83 char/sec Reco Rate 79.7% 81.1% 74.2% 71.7% global features (number of strokes and aspect ratio) are removed from the distance measures, and are added as two additional features in the decision tree. A comparison of Table 4.4 with Table 4.3 shows that, in the case of the cluster center reference sets, representing these global traits as separate features produces an increase in recognition accuracy along with a decrease in the average number of distances that must be calculated to classify a test digit since distances need only be calculated that correspond to tree nodes traversed during classification. However, when either the edited training set or the full training set is used, the reduction in the average number of distances calculated is achieved at the cost of recognition accuracy. One of the advantages of the C50 package is that it can successively train multiple tree classifiers from the same data, with each new classifier focusing on improving the misclassifications of the previously trained classifiers, and then combine them using a 66 .9uocuupanV-‘U ‘ociiua-O 5° .sz .‘>( c. :7“ £5.32. Caucus. ‘7‘ 100 I I I I I I I I I Similarity Features +— Difference Features -+—-- Edited -a-- Full Set -~---x- 80 - '1: c . .9 E: 8 '5 60 - O O 0 C S! .2 0 °, 40 - o z d > < 20 - o k l l l l J l l l 0 2 4 6 8 10 12 14 16 18 20 No. of Classifiers Combined 90 I T I T I Similarity Features +— Diflerence Features -+—-- 88 " Edited -a-- . chull Set --x---- ----- fg'x 35 .- (1133 V D a EJ-"x ........ x ' """"" l l """ ..xrx """""" X 84 - 4 o .. ’16 82 i- a: 8 IE- 80 - - C a c: 78 P a 76 L x": —i 74 ~ ‘3 1 72 — >é - 70 l l l l 1 0 0.02 0.04 0 06 0.08 0.1 0.12 Classification Time per Digit (seconds) Figure 4.5: Digit recognition by combining multiple tree classifiers using boosting. Trees are constructed using the similarity and difference features based on cluster center reference sets, and on the edited reference set, and similarity features based on the full training set. 67 Table 4.5: Digit recognition using boosted decision trees. The column “No. Tem- plates” indicates the number of reference templates available to the classifier, and the average number of templates used (number of distances that had to be calculated) per classification (shown in parenthesis). Reference Features N o. Throughput Recog. Set Templates Rate Cluster Centers (8113:1222); (£2322?) (31.451ave.) "34 char/ sec 88.6% Cluster Centers Iggfffezcceoriegihlerd? (31.32%) ~27 char/sec 88.1% ddddd .. (65.42.... dddd/ Pu“ Set (Effm ill—22:10:23?) (48.302ve.) ~22 Char/sec 864% technique called boosting [85]. Figure 4.5 shows the decrease in error rate for decision trees as the number of classifiers used in boosting is increased. This is shown for the similarity and difierence features based on the cluster center reference set, and similarity features based on the edited reference set and full set of training examples. As the number of classifiers that are combined is increased, so is the average time required to classify each character, while the error rate approaches some lower bound (see Figure 4.6 for an example). The best results are summarized in Table 4.5 for the similarity and difference features based on the cluster center reference set, and similarity features based on the edited reference set and full set of training examples. This table also reports the total number of templates available (the reference char- acter set), and the average number of templates used in the classification of a single character. It seems clear from Table 4.5 that the classifiers based on the cluster center refer- ence sets perform better than the classifiers based on the edited and full sets. Overall, 68 the similarity features based on the cluster center reference set gives the best results: 88.6% recognition accuracy with a throughput of approximately 34 characters per second. This classifier compares well against our best nearest neighbor classifier us- ing cluster centers which achieves 88.9% recognition accuracy with a throughput of approximately 26 characters per second (Table 4.2). 4.5.3 Alphanumeric Character Recognition Experiments were run using a set of 17,947 alphanumeric characters for training and 17,928 characters for testing, created by combining datasets DA and DC, and splitting this combined set into testing and training sets by random selection. Clustering was performed on each of the 36 classes of this training set. The distribution of templates retained using cluster centers is shown in Table 3.1. Nearest Neighbor Classifier Table 4.6 shows the results of the nearest neighbor classifier using the full training set, cluster centers where the number of clusters are automatically chosen, and the edited set. No significant increase in accuracy was found by using a K-nearest neighbor classifier (K = 2, K = 3) over a l-NN classifier in either the digit or alphanumeric classification cases. As can be seen, the cluster centers method does not work well at all in this alphanumeric classification case. This exposes a problem in using cluster centers as training examples in a nearest neighbor classifier: all the information about the size and shape of the cluster is lost which may result in decision boundaries not being properly represented. The edited set achieves a 33.7% reduction in classifica- 69 Table 4.6: Alphanumeric recognition rates (36 classes) using nearest neighbor classi- fier. Training Set No. Templates Throughput Recog. Rate Full Set 17,947 ~0.05 char/sec 88.92% Cluster Centers 144 ~6 char/ sec 45.41% Edited 11,867 ~0.08 char/sec 87.88% tion time while maintaining most of the recognition accuracy. However, the overall computational requirement using this set, 12.18 seconds per test character, cannot be considered practical. Decision Trees Figure 4.6 shows the results of classifying the 36—class alphanumeric character data using boosted decision trees based on the 144 cluster centers and using similarity features, difference features, and a combination of the two. Due to the high dimen- sionality of the difference feature set, a reduced set of these features was obtained by training a single tree on the full set of difference features using a small amount of randomly sampled training data, and observing which features were selected in the decision tree. This reduced the number of difference features from 20,736 (144 by 144) to 644. As Figure 4.6 shows, the decision tree classifier does a better job of discriminating these 36 classes through the use of it’s trained decision thresholds, compared to the simple nearest neighbor classifier. Although the difference features tend to provide a higher classification accuracy for a given level of boosting, the similarity features give a better time vs. accuracy trade-off. The best classification accuracy reported is 86.9% with a throughput of over 8 characters per second (see 70 Table 4.7: Performance of nearest neighbor vs. decision tree classifiers for 36-class alphanumeric character classification. The column “No. Templates” indicates the number of reference templates available to the classifier, and the average number of templates used (number of distances that had to be calculated) per classification (shown in parenthesis). Classifier No. Templates Throughput Recog. Rate NN - Full Set 17,947 ~0.05 char/sec 88.92% NN - Cluster Centers 144 ~6 char/sec 45.41% Decision Tree - Similarity 144 .. (23 trees combined) (100.1 ave.) 11 char/ sec 861% Decision Tree - Difference 144 - 0 (20 trees combined) (130.5 ave.) 8 char/ sec 869/0 Decision Tree - Combined 144 - (18 trees combined) (127.7 ave.) 8 char/ sec 869% Table 4.7). Comparing this to the nearest neighbor classifier using the cluster centers as a reference set, there is a 25.3% decrease in classification time, and compared to using the entire dataset as a reference set, there is a 99.3% decrease in classification time. In addition, the decision tree classifier retains 97.7% of the recognition accuracy achieved by the full dataset nearest neighbor classifier. 4.6 Summary We have demonstrated a method of online character recognition using non-parametric models which focuses on a representation of characters that models the different writ- ing styles found in the training set. These models are then used by a decision tree clas- sifier which yields good class discrimination. Through this character representation, we are able to obtain an 86.9% classification accuracy for a 36—class (alphanumeric characters) problem with a throughput of over 8 characters per second on a 296MHz 71 Sun UltraSparc workstation. There is a reduction in computation time of 99.4% as compared to the time requirements for a nearest neighbor classifier using the full set of available training examples. This is achieved by capturing the within-character class variability in terms of lexemes (clusters). Based on these experimental results, we draw the following conclusions: 0 Cluster centers are reasonable prototypes for the individual characters for a nearest neighbor classifier in the case of digit classification, where the ratio of number of clusters to training data is not too small (64646 = 0.073 in the automated case). However, in the alphanumeric classification case where this 144 17948 = 0.008), we require some additional information about ratio is very small ( the structure of these clusters. 0 Representing the categories by their border patterns allows us to reduce the number of patterns that must be stored by approximately one third with al- most no reduction in classification accuracy for the nearest neighbor classifier. However, since this data reduction is still related to the size of the original training set, classification time remains a function of the training set size. Clus— ter centers, on the other hand, are a function of the number of writing styles present in the data, and therefore not directly related to the size of the training set. 0 The use of cluster centers produces a higher recognition accuracy for decision trees. This is an indication that the cluster centers do contain the relevant information for classification. 72 In... ...-uugapv-rsnv Aminvfbfluuud.h — 5C .882 .‘>( 160 I i I I I r r Similarity Features +— Diflerence Features -+—-- . +3 __ 140 - Combined Features _ 120 ~ .555 . - Ave. No. of Distance Calculations 0 l l l l l l l 0 5 10 15 20 25 30 35 40 No. of Classifiers Combined 90 I I I I I I Similarity Features +— Diflerence Features -+__. Combined Features '8" 35 c a 0 iii a: 5 :E 80 - d C a 8 Q) a: 75 >- - 7o 1 l l l 1 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Classification Time per Character (seconds) Figure 4.6: Boosting for 36 alphanumeric classes. A comparison of decision trees built using similarity features and difference features, both based on the cluster center reference set, and a combination of the two feature sets. 73 Chapter 5 Parametric Models 5. 1 Introduction Parametric models over make some assumptions about how the patterns from indi- vidual categories are distributed, estimate the parameters of that distribution, and thereby attempt to represent the data in a compact form. Unlike non-parametric models, classification time is then not a function of the amount of training data used to estimate that distribution, but of the number and complexity of the models that represent the distribution. If our assumptions about the underlying form of the distri- bution are valid, then as the amount of training data is increased, the performance of the parametric approach will approach the “Optimal” value. Lexeme modeling allows us to represent complex class-conditional densities for individual categories as a com- bination of simpler distributions (lexeme distributions). In this chapter, we explore the advantages of constructing separate parametric models (hidden Markov models) for each lexeme. In addition, we describe another method of identifying lexemes in 74 the representation space defined by the hidden Markov models. 5.2 Hidden Markov Model Classifier We describe a handwritten character recognition system based on hidden Markov models. 5.2.1 Feature Extraction Each character example is preprocessed in the manner described in Section 3.2. We extract four different features at each sample point on the character: the change in the 2: coordinate from the previous point to the current point (Am), the y coordinate of the current point, and the sine and cosine of the angle of incline, 0, for the line segment between two consecutive points. The characters are modeled using hidden Markov models with continuous Gaussian observation densities. In order to keep the number of model parameters that must be estimated from growing too large, an attempt was made to de—correlate the four features using principal component analysis [42], thereby allowing the use of diagonal covariance matrices to represent state distributions. In addition, the feature dimensionality can potentially be reduced by eliminating those dimensions in the projected space that correspond to small percentage of the overall variance (very small eigenvalues relative to the total sum of eigenvalues for the system). Best recognition results were obtained when the feature dimensionality was reduced from 4 to 3 dimensions. It should be noted that principal component analysis, when applied to the full set of all feature vectors, does not 75 Figure 5.1: The t0pology of our hidden Markov models. Note that only single or double state skipping transitions are allowed. guarantee that the features will be decorrelated for the subset of this data that is assigned to an individual state in the HMM. 5.2.2 Hidden Markov Models In the simplest system, a single hidden Markov model is trained for each character class, using all the training data available for that class. These are left-to-right models in which each state contains a single continuous Gaussian distribution with a diagonal co-variance matrix to represent the feature observations for that state. In addition, each state allows for self-transitions and single-step transitions as well as transitions that involve skipping one or two consecutive states (see Figure 5.1). A test character is classified by identifying the model for which the probability of observing that character is the greatest. This approach can be enhanced by first categorizing the training data of each character class into different writing styles (lexemes), and training individual models for these lexemes. The individual strokes of multistroke 76 characters are concatenated into a single string, and are therefore represented by a single model. This method has the advantage of being robust to noise during data collection that may break a single stroke into multiple strokes. A character is then classified into class C36,) corresponding to the model which maximizes the aposteriori probability: Best = arg mEaJx P()\,-|O), (5.1) where Bayes Formula defines: P(0|’\i)P(/\i) P(0) P(/\i|0) = (5-2) in which 0 is a string of observations (feature vector) corresponding to the test character, A,- is a HMM trained for lexeme i, and P(O|)\,-) is the likelihood of 0 given that the correct model is /\,-. This likelihood is obtained using the Viterbi algorithm [30, 108]. Since P(O) is constant over all /\,-, we can make the following simplification (for classification purposes): POM-IO) % P(Ol)‘i)P(/\i)- (53) Furthermore, by assuming that the lexeme priors are equal, the decision rule can be reduced to P()\,-|O) z P(0|/\,). (5.4) 77 A character is then assigned to class CB,“ with Best = arg max P(O|)\,-), (5.5) which is the class defined by the model, out of the set of lexeme models, L, which has the maximum likelihood of producing O. 5.3 HMM-based Clustering The appropriateness of our string matching-based clustering method for identifying lexemes is not clear when our models, namely HMMs, are based on a different rep- resentation. A more appropriate clustering method would use the same distance measure as is used in classification. Note that our model is a parameterized stochas- tic model, and therefore the distance measure is probabilistic. While techniques of fitting clusters using stochastic models are well known when applied to non-temporal Gaussian mixtures [67], this remains largely unexplored for temporal data. In our approach, a method based on K-Means clustering was explored, pr0posed by Perrone and Connell [78]. A character is given the class of the hidden Markov model for which the Viterbi algorithm produces the largest likelihood. This probability will then be used to define the distance between a single character and a cluster of characters. Since this distance measure is not defined until clusters are created, we initialize the process by using our string matching-based clusters. Although this clustering does not occur in the same space as the hidden Markov models, there are similarities between the two representations in that both the string matching measure 78 and HMMs are based on stretching and compression of the temporal data. A hidden Markov model is trained for each cluster using patterns in that cluster as training data, forming lexeme models. Refinement then proceeds by reassigning each training example to the cluster whose lexeme model has the greatest probability of having produced that training example. The second stage is applied iteratively until there is no further refinement. The algorithm proceeds as follows: HMM-based Clustering N L is the total number of lexemes for class C. N is the total number of training patterns for class C. L,- 2 Initial Cluster Membership for Cluster 2', for all i, 1 g i S NL L?” = L,, for all i, 1 g i g M, Loop until (L,- = LfvEW, for all i, 1 _<_ i g NL) For each i, 1 _<_ i g NL Li : LINEW 19wa = {(0} )‘Lz’ = HMM_Train(L,-) Foreachi,1gi§N k = argmaxj P(O"|)\Lj), for all j, 1 S j g NL LIICVEIV ‘_ Oi 79 where L, is the set of members of the ith cluster out of N L total clusters from some character class C, and LfVEW for all i, 1 S i g N L defines the new cluster memberships after each iteration of the while 100p. A L1. is the set of trained parameters for a hidden Markov model to represent the cluster Li, and HMM_Train(S) is the algorithm which trains these parameters using a set of training data, S. O is the observed string of events for the ith character of class C in the training set, and P(Oi|/\Lj) is the probability that this character matches the model for cluster L, and is obtained using the Viterbi algorithm. N is the total number of training examples from class C. 5.4 Results An evaluation of our recognition system was conducted using three different datasets: a set of 12,291 handwritten digits, a set of 80,524 handwritten lowercase characters, and a set of 22,301 handwritten uppercase characters. These datasets were produced by a group of over 100 writers, and contains a combination of discretely written and handsegmented cursive characters. Each of these datasets was randomly divided into half training and half test data. Lexemes were identified within the training data using the string matching-based clustering method described in Chapter 3. Table 5.1 shows the recognition accuracies of our system obtained using a nearest neighbor classifier in which the full training set acts as templates, and three different levels of lexeme modeling: a template-based approach in which a single template per lexeme is used as the reference set in a nearest neighbor classifier, a single hidden Markov model per character class, and lexeme models built using data from lexeme identification. In 80 Table 5.1: Overall writer-independent results for three different classification prob- lems. Digits Lowercase Chars. Uppercase Chars. Classifier No. of No. of No. of Models Acc. Models Acc. Models Acc. 6148 97.74% 40,269 89.12% 11,158 84.29% String Matching Full Set String Matchmg 27 77.39% 84 50_11% 57 44.58% Reference Set Single Lexeme HMM 10 95.62% 26 84.43% 26 84.04% Mult. Lexeme HMM 27 98.0% 84 86.85% 67 89.47% the template-based approach, each reference template is selected as the character that is closest to the center of that lexeme’s cluster. All lexemes are identified and models are trained using each of the three training sets: digits, lowercase, and uppercase, and then each classifier is tested using the corresponding test sets. While the use of reference templates to represent the lexemes greatly reduces the computational requirements of the string matching measure, these templates do not represent the intra-class variance very well. On the other hand, hidden Markov models are well suited for this task and, in the case of both digits and uppercase characters, even outperformed string matching using the full set of training data as templates. In addition, these lexeme models significantly improve the accuracy over that obtained with a single model per character class. Overall, the best results of 98.0% recognition accuracy on digits, 86.85% accuracy on lowercase characters, and 89.47% accuracy on uppercase characters were obtained using a recognizer in which a hidden Markov model was trained for each lexeme. These models form a compact representation of the data, while obtaining similar recognition results to 81 that of using the full training set as templates in our string matching nearest neighbor classifier. The throughput of the HMM classifier is approximately 7.2 characters per second for digits, 1.0 characters per second for lowercase characters, and 1.8 characters per second for uppercase characters. These timings are reported on a 296MHz Sun UltraSparc workstation. 5.4.1 Lexeme Identification by HMM-based Clustering Perrone and Connell [78] integrated HMM-based clustering in the IBM Pen Tech- nologies Group handwriting recognition engine [72] to identify lexemes. This is an unconstrained handwritten word recognizer, based on HMMs in which the state ob- servations are represented with a mixture of Gaussian densities. These densities are taken from a pool of densities estimated over the entire feature space, making the representation slightly different than the Gaussian densities derived for our HMMs. These experiments showed a drop in error rate of as much as 8% within the first 2 iterations when tested on 4,920 words in unsegmented sentences taken from 188 different writers. Based on the success of this work [78], the advantages of identifying lexemes in our continuous parameter HMM model recognizer through the use of HMM-based Clustering was explored. Figure 5.2 shows the recognition accuracies obtained for the HMM-based clustering algorithm, using 40,269 training examples, and tested on an independent set of 40,255 lowercase handwritten characters from over 100 writers. Due to the computational requirements of this algorithm, experiments were limited 82 16.5 , , I I String Match -——- Baseline ----- rt Random 1 ----- 16 '\ .................... Random 2 .......... l. Random 3 ---—- ‘\ '\. 15.5 ‘-\ _ s. \. x. \ ‘. \- 15 \“ c-r .‘. 22 i, is ,‘ a: .‘ h \ 9 14.5 ........ e. - h ‘ V. | m x,‘ o\° .‘\ 14 ‘.\ """" '1 \- \_ ....... \. 13.5 ‘\ ,. ‘ \ ”’ “\‘ \ I _______ I/ \ \ 1” ......... 1” \‘\ _F“~. ’8’ ----------- ‘ ‘ ‘ 1'\' r ’r 7“" __________ ‘ .‘ 13 \\\\‘ I”,’ \“‘\T\‘.7"~~‘ ............ 12 5 l 1 l \./.’ l ‘ ..... ‘ llll 0 2 4 6 8 10 Iterations Figure 5.2: Performance of HMM-based clustering over 10 iterations. The “baseline” case shows the accuracies obtained at each iteration of the algorithm when the lexemes at iteration 0 are defined using string matching as described in Section 3.4. The results on this initial set of clusters is also shown as a horizontal line (labeled “String Match”) for comparative purposes. Three additional cases in which the lexemes at iteration 0 were randomly assigned are shown as “Random 1”, “Random 2”, and “Random 3”. 83 to 10 iterations. Iteration 0 represents the accuracy of the initial set of clusters. For the example labeled as “baseline”, these initial clusters have been defined in the string matching space as described in Section 3.4. As the figure shows, at iteration 2 the best recognition accuracy for the “baseline” experiment has been achieved where there is a 2.25% reduction in error rate from the initial set of lexemes. Since the algorithm may potentially be limited by poor initial clusters (which could cause the algorithm to converge toward a local minima), three additional clustering experiments were run in which the initial set of clusters are defined randomly. For each of these experiments, the number of clusters was set to be the same as the “baseline” case, and the cluster membership assignments were uniformly distributed across the training data. As can be seen in Figure 5.2, the random initialization trials quickly converge toward our best results after a few iterations, and, in the case of the “Random 1” and “Random 3” trials, obtain a clustering that consistently outperforms the “baseline” case at each iteration. While the difference in error rate between these random trials and the “baseline” trial are not very large, it does indicate that the HMM-based clustering algorithm can be used to obtain a good clustering without relying on a sophisticated method of initializing the clusters. The best overall error rate achieved is 12.51% obtained at iteration 10 of the “Random 3” trial. This is a 4.9% reduction in error rate compared to the error rate obtained using the string matching method of lexeme identification. 84 5.5 Summary We have presented a handwriting character recognition system based on hidden Markov models. Such parametric models can represent the distribution of training data more compactly than non—parametric methods. The identification of lexemes can be used to decompose the complex feature space of handwritten characters into multiple, simple spaces, which can be represented with simpler hidden Markov mod- els. The recognition accuracy of such a system has been shown to be superior to that modeled with a single HMM model per character class. In addition, we have proposed a method of refining these models, referred to as HMM-based clustering, in which the cluster membership is driven by the trained models themselves. This modification has been shown to reduce error rates by 4.9% on a lowercase character classification task. 85 Chapter 6 Information Fusion The fusion of different information sources can be of great advantage to a recognition system, providing a greater classification accuracy than any classifier built upon a single source of information. In this chapter, the use of classifiers built from different feature sets is explored in which the individual feature sets attempt to capture some unique information that is not present in the other feature sets. A method of com- bining the outputs of these classifiers is discussed, and results are presented on the problem of alphanumeric character recognition. In Chapter 7, this method will then be used to improve the accuracy, while reducing the computational requirements, for word recognition. 6.1 Introduction Information fusion is a field of study that has received much attention recently. When multiple sources of information exist regarding a decision making problem, the goal 86 is to make the best decision possible given these different sources. This is a challeng- ing problem since it requires handling potentially conflicting information, having an understanding of the uncertainties related to the sources of information, and possibly handling missing information. Classifier combination is a form of information fusion in which the sources of information are the outputs of a set of classifiers. Much work has appeared in the literature on classifier combination [39, 46, 49, 53, 97]. The type of combination method used is often driven by the type of information that is available at the output of the component classifiers. This may be a single classification label, a ranked set of labels, or a set of posterior probabilities. The success of a combination of classifiers depends on their degree of “independence”. Therefore, component classifiers trained from different features sets and/or using different modeling techniques are desired. Methods of bagging [12] and boosting [31, 85, 97] train different classifiers from a single source of training data. Bagging accomplishes this by subsampling the data with replacement, while boosting trains a set of classifiers in series, such that classifier i focuses on the misclassifications of the combination of classifiers 1 though i — 1. 6.2 Component Classifiers Classifiers were constructed for each of the five different feature sets shown in Table 6.1. As a preprocessing step the location of each character was normalized by first detecting the baseline of a character and then subtracting the baseline height from the character height, such that at the y coordinate of each sample point is taken 87 Table 6.1: Characteristics of the different classifiers used. r Name Local ] Crit. Pt. ] Spatial/Dir ] End—point ] COG ] Classifier Type HMM HMM NN N N NN Feature Type Dynamic Dynamic Static Static Static Size Normalized? Yes No Yes Yes Yes Multistroke Model? No No Yes Yes Yes with respect to the baseline. In order to increase the reliability of baseline detection accuracy, the baseline was detected for an entire horizontal line of written characters.1 This method is described in more detail in Section 7.2.1. 6.2.1 Local Features These features provide information on the dynamics of writing at a very high res- olution. They are the Arc, y, sin(l9), and cos(6) features described in Section 5.2.1. While these features are good for representing very local structures in the handwritten strokes, they do not necessarily represent the global structural properties well. 6.2.2 Critical Point Features The inability of the Local features to represent global structures has motivated a second dynamic feature set to focus on such structures. These features measure the curvature and orientation of stroke segments centered around a critical point, as well as the vertical location of the critical point. In addition, some local properties are measured within a neighborhood of the critical point. For each critical point, i, corre- sponding to an original sample point S (i), the features extracted can be summarized lThis baseline detection was done at IBM previous to our processing of the data. 88 as follows: 1. The angle, 6, between two line segments formed by connecting critical point i to i + 1 and i to i -— 1. The angle is measured such that 0 g 0 g 180. 2. sin(¢) and cos(¢), where gb is the angle between the x-axis and the direction in which the center of concavity of 6 is pointing. 3. Ar and Ay measured from critical point i — 1 to i + 1. 4. The vertical location, y, of the critical point, i. 5. A2: and Ay for each consecutive pair of two points from sample point S (Z) — 10 to sample point S (i) + 10. 6. sin(il)) and cos(t/i) for each consecutive triple, (k — 1, k, k + 1), of points from sample point S (i) — 10 to sample point S (i) + 10. if) is calculated as the angle formed by the line segments k to k — 1, and k to k + 1. The angles are measured such that 0 S 1/1 S 180. The features are pictorially demonstrated in Figure 6.1. Together, these features form an 83-dimensional feature vector for each resampled point, which is reduced to 60 dimensions by Principal Component Analysis [42]. For the computation of these features, characters were not size normalized in the manner described in Section 3.2. This is intended to make this feature set suitable for word recognition when character segmentation is not known. Size normalization is instead accomplished for an entire horizontal line of written characters in order to simulate the type of size information 89 that would be available for normalization if the character were to appear in a word. This method is described in more detail in Section 7.2.1. 6.2.3 Spatial/ Directional Features The previously described features make use of dynamic stroke information, but do not adequately capture the spatial relationships. Many offline character recognition features capture this type of information (see [20]), and therefore such features should compliment the information encoded in the online features. A set of features, moti- vated by the offline features described in Takahashi [102], have been designed for the recognition of online characters. The character is size normalized such that it’s bounding box is a standard size. For some characters, their small size and location can best be used to distinguish them from other characters, therefore we wish to preserve this information. This is accomplished by placing the following restrictions on the bounding box: 1. The bottom of the box cannot be higher than the baseline. 2. The top of the box cannot be lower than the midline. 3. The width of the box cannot be thinner than a minimum width which is one half the distance between the baseline and the midline. If this is the case, the box width is enlarged equally on both the left and right sides, such that it this minimum width. The bounding box is divided into a 5 x 5, grid. Features are calculated indepen- dently for each of these grid sectors, thereby capturing spatial location information 90 Figure 6.1: An example of Critical Point features: (a) calculation of structural fea- tures 0, and <15, (b) y (in relation to baseline shown), and critical point (Ax, Ay), (c) calculation of local neighborhood features Ax, Ay, and 2p. 91 about the properties measured. All strokes segments (from one sample point to the next) within a sector are quantized such that their direction is measured as either North-South, East-\iz’est, Northwest-Southeast, or Northeast-Southwest. For each of these directions, the total stroke length traveled in that direction is measured for each grid sector, forming a IOU-dimensional feature vector (5 x 5 sectors x4 directions). Figure 6.2 shows an example of this feature extraction method. One additional piece of information that can be made use of is the stroke seg- mentation. We extract a single 100-dimension feature vector for each stroke in the character, forming a (100 x N )—dimensional vector, where N is the number of strokes forming the character. Classification is accomplished through the use of a nearest neighbor classifier, in which an N -stroke character is only compared against N -stroke templates. Since a nearest neighbor classrfier requires a number of comparisons that increases linearly with the training set size, a way of identifying a more compact representation is desired. Vector quantization [33] is a method of reducing a set of training vec- tors to a smaller set of representative vectors, with an emphasis on minimizing some distortion that is introduced through this representation. The OLV Q1 algorithm of the Learning Vector Quantizer (LVQ) package [55] is used to obtain these represen- tative vectors. This is a self organizing map [54] which finds representative vectors by iteratively maximizing the nearest neighbor classification accuracy on the training set. 92 Character (1 Period Digit 4 - lst Stroke l / L .L // Digit /, - 2nd Stroke E21; (d) (6) Figure 6.2: An example of spatial/directional features extracted from the single stroke character (1, a “.”, and the two-stroke digit 4. Intensities indicate the magnitude of the features in each sector of the 5 X 5 grid in (a) horizontal direction, (b) vertical direction, (c) southwest-northeast diagonal, and (d) northwest-southeast diagonal. The original character is also shown (e) in relation to the baseline. 93 6.2.4 Stroke End-point Features These are simple features which focus on the spatial relationship between the starting and ending points of each stroke. After size normalization, a pair of features in calculated for each stroke forming that character: the A1: and Ay from the starting point of the stroke to the ending point of the stroke. This forms a (2 x N )-dimensional feature vector, where N is the number of strokes forming the character. Classification is accomplished through the use of a nearest neighbor classifier, in which an N -stroke character is only compared against N-stroke templates. LVQ is used to reduce the number of template comparisons required per classification, as described in Section 6.2.3. 6.2.5 Center of Gravity Features Another simple large-scale structural characteristic is the location of each stroke in the character. This is encoded as the center of gravity of each stroke, (:13, y). This forms a (2 X N )-dimensional feature vector, where N is the number of strokes forming the character. A nearest neighbor classifier is created using a set of quantized vectors trained using LVQ as in Section 6.2.3. Characters are first size normalized. 6.3 Classifier Combination Our classifier combination problem involves both HMM and nearest neighbor classi- fiers. Therefore, posterior probabilities are not available from all the classifiers, and at best they provide a set of ranked classification labels. Given a ranked list of N 94 class labels from a classifier, the possibility still exists of any class label being the true class label. We can express the probability that the actual class label is C,- as P(Ci|QJ(O)) : P(Cile,1(0)ij.2(O)i ' ° 'an.N(0))i 1 S 7' S N (61) where O is the observed pattern, and Qj,k(0) is the kth best class label in the ranking given by classifier j. Estimation of this posterior probability may not be feasible as it requires examples in which a classifier returns every possible ranking for each given truth value. Therefore, this probability can be simplified to P(Cile(0)) = P(Ci|Qj.l(O)in.2(O)a - - - 0111(0)), (5-2) for some value of M where M S N. Under the assumption that the information, QJ-(O), provided by different classi- fiers is statistically independent, the final classification is taken as the class C, which maximizes the following posterior probability: PC< 162.10 l). Fla car— 1 (6'3) P(C l0)= where R is the number of component classifiers involved in the combination. If we assume that these posterior probabilities do not deviate dramatically from the prior probabilities, it has been shown [53] that this can be expressed as P(C-|O (C)(fi 1+ 6,J-)~ (C,) + P(C,~) ZR: 6,], (6.4) i=1 i=1 95 where (5,]- << 1 is the deviation of the posterior probability, P(CilQJ-(O)), from the prior, P(C). This can then be further simplified as P(CiIO) = (1 - R)P(Ci) + Z: P(Cile(O))a (6-5) which under the assumption of equal priors, can be approximated by R P(iC IO) : Z( P( Ci IQJ(0 (6-6) j=l The summation form of this posterior probability is more robust to missing in- formation than the product form, since the product form will force the posterior probability to go to zero if a single classifier’s posterior probability is zero. This classifier combination technique is similar to techniques based on confidence measures in that P(CilQJ-(O) = C,) is a measure of confidence in the ability of classi- fier j to correctly classify a character of class C,-. However, this technique also makes use of the confidence in the common misclassifications of each classifier. For example, PC( ,IQJ-(O )2 C) is the probability that the true class 1S C when classifier j claims, in error, that it is class Ce, i 75 e. If this error occurs frequently, this information will still contribute (albeit, probably not as much as a correct classification) toward the combined posterior probability of the correct class. The advantage of this is that consistent misclassifications by a single classifier can still contribute to the correct combined classification. As a natural extension to this method, the classification QJ-(O) returned by a classifier j can be in the form of a lexeme identity rather than a character class 96 Table 6.2: Top 10 classification results on 57,541 alphanumeric characters. Case- dependent (62 classes) results are reported. Features Accuracy (% within top N) 1 2 3 4 5 6 7 8 9 10 Local 76.45 91.66 95.61 97.01 97.80 98.32 98.66 98.90 99.10 99.23 Critical 68.81 82.32 87.67 90.60 92.34 93.70 94.67 95.39 96.01 96.49 Point Spatial / 64.92 75.92 80.79 83.70 85.74 87.25 88.43 89.33 90.15 90.84 Directional Stroke 21.27 27.32 31.75 35.06 38.47 41.78 44.71 47.94 50.38 52.57 End-point Center of 23.60 33.41 38.83 43.85 48.45 52.79 56.45 60.0 62.82 65.21 Gravity identity. In the following experiments, QJ-(O) is the lexeme identity for the HMM classifiers, while it is the character class identity for the nearest neighbor classifiers. 6.4 Results Classification results for each of the classifiers described in the previous section are shown in Tables 6.2 and 6.3. Each classifier was constructed for the task of distinguish- ing between 62 different characters classes: 26 lowercase letters, 26 uppercase letters, and 10 digits. The training set consists of a total of 57,575 characters randomly selected from a pool of data written by over 100 writers, and results are reported on an independent set of 57,541 characters randomly selected from this same pool. Results are also reported when the case (uppercase vs. lowercase) is ignored (e.g., a classification of ’e’ or ’E’ are both considered the same). In most cases, this can be considered a very minor confusion since it will not change the meaning of a word. 97 Table 6.3: Top 10 classification results on 57,541 alphanumeric characters. independent (36 classes) results are reported. Case- Features Accuracy (% within top N) 1 2 3 4 5 6 7 8 9 10 Local 84.32 94.02 96.48 97.63 98.25 98.68 98.96 99.16 99.32 99.41 Critical 71.81 84.96 89.96 92.51 94.03 95.21 95.99 96.57 97.08 97.47 Point Spatial / 69.10 79.53 84.24 87.04 89.04 90.54 91.65 92.51 93.25 93.86 Directional Stroke 23.60 30.53 36.35 41.41 45.69 49.99 53.35 57.39 60.83 63.43 End-point Center of 26.23 37.74 43.97 49.68 54.77 59.59 63.74 67.55 70.57 73.06 Gravity The best results using a single feature set are obtained using the Local feature set, which achieve a 76.45% accuracy for case~dependent classification, and a 84.32% accuracy for case-independent classification. Also of note are the Critical Point fea- tures, for which the correct character is found within the top 4 hypotheses over 90% of the time, and the Spatial/Directional features, for which the correct character is found within the top 9 hypotheses over 90% of the time. It is apparent (as would be expected) that the much simpler Stroke End—point and Center of Gravity features are not very useful by themselves, however their potential to add a new piece of Information in a classifier combination can be considered. Our classifier combination technique can be used to combine the component clas- sifiers in any combination. Table 6.4 presents the classification accuracies for each of these possible combinations. For these experiments, M = 1 such that classification is based on only the single best classification from each classifier. It should be noted 98 that this technique can also be used for a component classifier, and has the poten- tial to increase the accuracy of that component classifier by focusing on its common misclassifications. These results are also shown in Table 6.4, and as can be seen, the component classifier accuracies are improved for all the classifiers with the exception of the classifier based on Stroke End-point features. The single best combination of two classifiers, is the combination of the Local features and the Critical Point features. While the Center of Gravity features did not produce an impressive classification accuracy when used alone, when added to any other combination of classifiers, it results in an increase in accuracy (except when combined with the other two offline features, in which case the accuracy remains the same). This indicates that these features may capture some information not present in the other features. The same is true of the Stroke End-point features when combined vvith any of the online features. However, the results are mixed when combined with other offline features, possibly indicating a reinforcement in the ambiguities of these classifiers. The best results are obtained when the two online classifiers are combined with the Spatial/Directional features and one or both of the other two offline classifiers, pr Oducing an overall recognition accuracy of 84.0%. Table 6.5 presents the case-dependent accuracies within the top N choices (1 g N S 10) for the five best classifier combinations shown in Table 6.4. It is apparent f1‘Onrl this table that when more than two choices are to be considered, the much Sinlpler combination of the Local, Critical Point, and Spatial/Directional features DQI‘forms the best. For this combined classifier, the correct classification falls within the t0p 10 choices approximately 99% of the time. 99 Table 6.4: Classifier combination accuracies on 57,541 alphanumeric characters. Case- dependent (62 classes) results are reported. Local Crit. Pt. Spatial/ Dir End-point COG Accuracy X 77.9% X 69.7% X 64.9% X 22.1% X 25.1% X 80.2% 78.4% X 78.0% X 79.0% X 73.8% 69.9% X 70.2% 64.7% 64.8% 29.0% 83.7% 81.1% 81.2% 79.0% 79.0% 78.6% 74.7% 75.1% 70.7% 64.7% 75.6% 79.4% 81.8% 84.0% 84.0% 84.0% ><><><>< ><><>< ><>< >< >< >< >< ><>< ><><>< >< ><><><><><>< ><><>< ><><>< ><>< ><>< ><>< ><><><><>< ><>< >< >4 >< ><><><><><><>< ><>< >< ><><><><>< ><><><>< >4 ><><>< 100 Table 6.5: Accuracies within t0p 10 choices for the 5 best classifier combinations. Classifiers N Local, Local, Local, Local, Local, Crit. Pt., Crit. Pt., Crit. Pt., Crit. Pt., Crit. Pt, Spatial / Dir, Spatial / Dir, Spatial / Dir, Spatial / Dir. COG, End-points, End-points. COG. End—points. COG. 1 84.02 83.96 83.97 83.72 81.75 2 91.85 91.65 91.73 91.57 91.11 3 95.02 95.03 95.10 95.22 94.13 4 96.42 96.54 96.54 96.89 95.59 5 97.19 97.29 97.31 97.54 96.35 6 97.61 97.65 97.73 97.96 96.87 7 97.92 97.96 98.01 98.30 97.25 8 98.14 98.24 98.25 98.54 97.61 9 98.31 98.48 98.47 98.75 97.88 10 98.51 98.69 98.59 98.92 98.04 The form of our classifier combination has the potential to make use of more than just the top choice from each classifier (the case for M = 1). However, as was mentioned in Section 6.3, it is not clear that incorporating knowledge for M > 1 is practical due to the rapidly increasing data requirements to train these probabilities as M increases. Table 6.6 presents the results of a classifier combination where M = 2, for one of our best classifier combinations, and each of the classifiers in this combination scored independently with this method. An examination of these I‘esults indicates that there is no increase in accuracy by the inclusion of the second best choice for either the Critical Point or Spatial/Directional classifiers (in fact, there is a reduction in accuracy). However, the Local features classifier obtains an increase in accuracy of approximately 1.5 percentage points. This improvement is also apparent in the combination of these three classifiers, which achieves an accuracy of 101 Table 6.6: Top 10 classification accuracies for classifier combinations with M=2. Classifiers N Local Crit. Pt. Spatial / Dir Local, Crit. Pt., Spatial / Dir. 1 78.43% 68.67% 64.12% 84.89% 2 89.80% 78.83% 76.12% 92.71% 3 91.98% 81.75% 80.71% 95.72% 4 92.65% 82.96% 82.98% 96.96% 5 92.99% 83.72% 84.62% 97.52% 6 93.17% 84.19% 85.82% 97.89% 7 93.27% 84.52% 86.76% 98.14% 8 93.32% 84.82% 87.44% 98.33% 9 93.36% 85.07% 87.93% 98.46% 10 93.40% 85.26% 88.41% 98.54% approximately 85%. In addition, the correct classification is found approximately 97% of the time in the top 4 choices. 6 . 5 Summary In this chapter, a method of information fusion through the combination of multiple Classifiers built on different feature sets has been presented. These feature sets have been designed such that they focus on different information about the characters, Such as viewing the character from an online vs. an offline perspective, the scale 0f structures that the features attempt to represent, the degree to which the char- acteristics captured are spatially dependent, and the relationship between different high-level structures such as the center of gravity of strokes and the distance between the endpoints of a stroke. Our classifier combination method focuses on the frequency 0f confusions for each classifier, and it has been demonstrated that this creates a po- 102 tential to improve the accuracy of even a component classifier. The ability of this method to allow classifiers to contribute to the combined classification, weighted by their certainties of each possible class, facilitates improvements in accuracy even for weak classifiers, such as those based on the stroke end—point, and center of gravity features, as long as they capture some useful information not captured by the other features in use. The best combined classifier accuracy achieves approximately 84.0% for the 62-c1ass alphanumeric character recognition problem (26 lowercase, 26 upper- case, and 10 digits). This is a 32% reduction in error as compared to the best single classifier accuracy. In addition, it was shown that the correct classification among the top 10 choices in approximately 99% of cases. This was further improved by tak- ing into account the top 2 choices from each classifier in the combination, to achieve an accuracy of approximately 85% with the correct class among the top 4 choices approximately 97% of the time. 103 Chapter 7 Word Recognition 7 .1 Introduction The main goal of handwriting recognition is to correctly recognize individual words selected from a large lexicon. While it is not the goal ‘of this thesis to produce a commercial grade word recognition system, it is of great interest to demonstrate the effects of lexeme modeling, lexeme-based writer-adaptation, and information fusion When applied to the word recognition problem. To this end, a word-recognition system has been constructed that is capable of recognizing words from a small vocabulary, and results are reported for 10 writers. The complexity of the inter-writer variations, however, are apparent in this data, and the challenge is to robustly handle such Variations. Section 7.2 presents an overview of the word recognition system, including a description of the preprocessing steps, the models used, and methods of reducing the search space. Characteristics of the word-level data used for these experiments are discussed in Section 7.3, and the experimental results are presented in Section 7.4. 104 7 .2 System Overview Figure 7.1 presents an overview of the word recognition system. Each handwritten item is presented to the system as a single horizontal line of pre-segmented words. In a sentence-level recognizer, word-level segmentation can be achieved by way of spatial clustering of the handwritten strokes for well separated words, and by extension of the single-word recognizer to handle word pairs, for pairs of words that are not well separated. However, for this study word segmentation is assumed to be complete and accurate. 7 . 2. 1 Preprocessing This section describes the preprocessing steps taken that are specifically for word recognition. Other preprocessing steps are described in Section 3.2. Certain preprocessing steps are first applied to the entire line of writing before recognition is attempted for each individual word in the line. In particular, these are the midline and baseline detection, and slant detection. Since short (possibly single Character length) words do not give a very robust estimation of these values, they are estimated for the entire line of writing and then individual words are normalized using these values. Baseline/Midline Detection and Normalization Normalization for the size and location of words presents a more difficult challenge t112m for individual characters. Since character models are used for recognition, we 105 / 7 7 — i- "'7 i— ' \\\\ ord1 word2 word3 word n) ‘\.___T_. // 1 1 O V’ l ,1 , ,r/ \\N \ i Baseline/Midline Detection/Correction Slant Detection/Correction _,-/"' / / \\. ‘\--\ éd/ V fix \\i\ /,,~”———- ‘7: \.\ “_"ffl ~“\\‘ . 1," f“- “\‘ /’/;’———.~-‘ \\ x [,1 \ I, \ -/ word1 word2 \“- word3 word n \ \‘\ 1’ X I‘\\ //" \ ‘\\ r / \\ I/A‘ \ \\_. @ A Admit Advancing African Age Almost America America’s American American! Americans Americans’ Americans. An And Antarctica. Anytime Are Area” Arrests Asia; Average BROWN Bankruptcy Barbers Be Before? Below Between Birth Bread. Bush. By C:\"j6\”zq|, Can Chicken Choosing City Color Come Countries Cox, Current DOG. Daily, Debt. Did Divorced) Divorced). Do Dues-paying Each Education} Europe Europe. Exceeds Exhibit Expectancy Experience. FOX Fact: Federal Female Finds For From Gallons Gets Gov’t. Graduates Greenland Growth Growth. Half Highest How I’ve Immigrants In Insurance Insurance. Is JUMPED Jan’s Jogging John Just Keep Kept Know LAZY Land Last Led Less Life Lives Longest Male Males} Married May Median Medical Men! Men, Miami Mike’s More Mr. Music Music. National Nationally, Nearby New News: Now. OVER Objectively Objectively, Occur Ocean Of On One Only Owned Park Per Person Population Prices Prisoners Projected QUICK Quarry Quarterly, Rates Recorded Reserves Sec- onds September? Shoppers Show Single Slowly, Smith Southern Southernmost Spend Spending States Statistics Student Sufficient THE Tenure; Than That The To Today, US. U.S.’s USA. United Uses Vice Visits Volunteer Was Water Water. When While Xenon Year Year, Year. Yearly, Years’ Years,] Years? York Zaire Zinc [In [Now] [down ‘German,’ ‘strategic’ a about above accounts added ads. adults adults? after ago. aims airplanes all-sports ancestry and annually. apple approx. are as average away be become billion blindness block break building-construction bulk bus-load by can, carbon children cigarette citizens claimed cm) coal come common companies consumption? conveyed countries? cut data. day declined decreased degrees, disk, documented don’t economy edged education effectively employment energy? equals exactly exceed expects fallen falls fields. filings fiscal fleet for frequent from fuel. gal. gas goes government granted grew guy halved harvest’s have helped home. households huge idled if ignite imports in in. include inert. info, invoke is isn’t jet job jobs just know lady! law likely little logs many master’s} memberships million mixed monoxide moral more most much music name? natural not now number of oil okay? on only ordered other outman over patents people people. percent performed phase p0pulation. possibly pounds. presently president presidents production provide public quickly radio rain ranks realize receive recently, restaurants revenue routine sales salt. she since sixteen sizable size slides small some spends stations. strikes successful support swiftly system tax) technically televisions ten than the their think thought three time; times times, to topmost total tourists tuna twenty typical union up up! value was water week. when which who why will with wood working world worse written year. years years, years. you {79, (Females, {Management {bachelor’s 118 >%%7//W/”/ I/fi/&& /f///7/ (a) W (b) Whori PUb“C le'Qm signs (C) 4%AQ WM®VZT TPlEN fgioN$ (e) Figure 7.6: Variations between writers for handwritten words. (a) Writer 1, (b) Writer 2, (c) Writer 3, (d) Writer 4, and (e) Writer 5. 119 gyros K ‘,_I Q? Figure 7.7: Some examples of handwritten words containing “Overwriting”. original strokes. Therefore, the search requirements would drastically increase. Cur- rently, overwriting is considered an open problem, and we do not explicitly attempt to handle it in our classifier. 7 .4 Results For these experiments, a set of models were trained to recognize characters from 93 different classes: 26 lowercase characters, 26 uppercase characters, 10 digits, and 31 symbols. Lexemes were identified within these classes using a set of 122,410 characters containing both discretely written handprinted and cursive characters. Table 7.2 shows the amount of data available and the number of lexemes identified for each character class. 120 Table 7 .2: The number of examples available and number of lexemes identified for each character class. Class Examps Lex Class Examps Lex Class Examps Lex A 1007 4 a 5013 7 ! 682 2 B 876 5 b 2287 3 ” 287 3 C 929 4 c 3811 3 # 86 3 D 950 3 d 3102 4 $ 93 2 E 1020 5 e 5930 4 % 91 5 F 833 6 f 2046 4 & 96 5 G 828 5 g 3037 5 ’ 683 2 H 759 3 h 2445 4 ( 181 2 I 824 7 i 4515 8 ) 195 2 J 733 3 j 1173 4 86 6 K 793 4 k 2088 5 + 86 2 L 830 3 l 3080 4 , 683 3 M 858 8 m 2865 3 - 578 3 N 878 6 n 4793 4 . 767 2 O 923 3 o 4136 4 / 90 2 P 848 2 p 2510 4 : 568 2 Q 809 5 q 1619 5 ; 184 2 R 905 3 r 4307 4 < 93 2 S 1055 3 s 4126 4 = 96 3 T 887 4 t 4383 5 > 93 3 U 775 4 u 2984 4 ? 682 4 V 742 3 v 21 73 3 @ 90 2 W 849 3 w 2156 5 [ 93 3 X 809 5 x 1826 5 \ 90 2 Y 812 4 y 1985 3 ] 93 2 Z 769 2 z 2134 4 “ 90 2 Class Examps Lex 0 1217 4 1 1203 3 Class Examps Lex 2 1210 2 , 90 4 3 1338 5 { 86 6 4 1333 4 l 90 2 5 1339 4 } 86 3 6 1348 5 - 86 3 7 1106 6 8 1105 4 9 - 1092 4 121 Table 7.3: Word recognition accuracies using curvature/orientation features and a single model per character class. Writer No. Test Accuracy Examples Case Sensitive Case Insensitive #1 571 57.1% 59.9% #2 614 60.9% 62.7% #3 577 41.9% 44.9% #4 605 55.2% 59.8% #5 610 57.2% 60.0% #6 599 47.9% 49.4% #7 609 34.3% 39.6% #8 591 30.1% 32.1% Total 4776 u = 48.1% p = 51.1% a = 10.8% a = 10.6% Table 7.3 shows the word recognition accuracies obtained using only the Critical Point features, and where only a single model is defined per character class (1 lexeme). Both case sensitive (e.g., an uppercase ‘A’ and a lowercase ‘a’ are treated as different character classes), and case insensitive accuracies (e.g., ‘A’ and ‘a’ are both treated as the same character class) are reported. Table 7.4 shows the word recognition accuracies obtained, where the number of lexemes is as shown in Table 7.2. A comparison of Table 7.3 and Table 7.4 show a drop in average error rate of 32.0% for case-dependent word recognition, and 35.4% for case-independent word recognition. In addition, the recognition accuracies using multiple lexemes achieves a higher accuracy than the single lexeme classifier in the case of every writer with the exception of writer #2. The throughput of this system is approximately 0.34 words per second on a 296MHz Sun UltraSparc workstation. As can be seen from these tables, there is a lot a variance in the recognition 122 Table 7.4: Word recognition accuracies using curvature / orientation features and mul- tiple lexeme models per character class. Writer No. Test Accuracy Examples Case Sensitive Case Insensitive #1 571 79.9% 83.9% #2 614 57.5% 58.5% #3 577 72.3% 77.3% #4 605 69.4% 75.2% #5 610 84.1% 87.9% #6 599 48.2% 49.4% #7 609 51.7% 57.8% #8 591 55.2% 58.2% Total 4776 ,u = 64.7% 11 = 68.4% a = 12.6% a = 13.3% accuracies obtained between writers. Much of this can be attributed to the extent a writer’s lexemes are represented by the set of writer-independent models in our system. Other reasons may be due to the current limitations of our system to decode certain characteristics. For example, the handwritten words produced by Writer #8 contain a very large amount of overwriting. Currently, there is no special facilities for handling this type of writing, and our system is doomed to fail for most of these words. An examination of Table 7.4 reveals that there is a large variation in the recog- nition accuracy for individual writers. Our best results are for Writer #5, with an 84.1% accuracy for case-dependent word recognition, and our worst results are for Writer #6, for which this accuracy is 48.2%. Poor results such as this may be due to either a large class overlap inthe lexemes used by this writer, or due to a lack of examples of the writer’s lexemes in the writer-independent training set. If we choose 123 to ignore differences in the case of the letters, these accuracies are increased by 3.7 percentage points on an average. For some writers, such as Writers #3, #4, and #7, this produces a significant increase in accuracy, indicating that the writer forms some uppercase or lowercase characters that are very similar to their complementary case equivalent. In addition, Writer #8, which was identified as having a large amount of overwriting, obtained a case-dependent recognition accuracy of only 55.2%. Figure 7.8 shows some examples of correctly classified and misclassified words. In a word recognition system, it is also of interest to evaluate the quality of the alternative word identities that a word recognizer can produce. If the top word choice is incorrect, and the user wishes to correct this word in his document, it would be more convenient for the user to be able to select the correct word from a list of top choices rather than re-enter the word by pen or keyboard. In addition, the ability of the system to place the correct word among the top N word alternatives is crucial if we are attempting to improve the accuracy of the top choice through a cascading of classifiers, or the classifier combination techniques discussed in Chapter 6. Table 7.5 presents the percentage of cases in which the correct word identity is ranked among the top N choices by the Critical Point features, for 1 g N S 10. As can be seen in Table 7.5, the number of correct word classifications found in the top 10 best choices is much more promising. An interesting case is Writer #4, which jumps from an accuracy of 69.4% for the top choice, to 88.1% within the top 10. Our best results is for Writer #5, which achieves an accuracy of approximately 95% within the top 8 choices. Most writers which did not receive a good top choice accuracy (writers #2, #6, #7, and #8) still remain the worst performers when the 124 oil % 2&1 Finds 79 10 /WVL// will Longest (b) Figure 7.8: Some word recognition examples: (a) correctly classified words and (b) misclassified words and the incorrect label assigned by the classifier. 125 Table 7.5: Percentage of cases in which the correct word is within the top 10 choices for words from different writers. (The total number of words in our lexicon is 483). Writer Accuracy (%) Within Top N 1 2 3 4 5 6 7 8 9 10 #1 79.9 84.9 86.9 87.6 87.7 87.9 88.1 88.4 88.8 89.0 #2 57.5 62.7 66.0 68.2 69.4 70.5 71.8 73.0 73.9 74.4 #3 72.3 79.9 82.0 82.8 83.9 85.1 85.1 85.6 86.0 86.1 #4 69.4 79.8 83.5 84.5 86.3 86.8 87.3 87.4 87.6 88.1 #5 84.1 89.8 91.8 92.5 93.4 93.9 94.4 94.6 94.6 94.8 #6 48.2 58.6 60.1 61.8 62.6 63.8 64.9 65.8 67.3 67.8 #7 51.9 63.1 66.7 68.5 70.0 72.2 73.4 74.4 75.7 75.9 #8 55.2 63.3 68.2 72.1 73.6 75.1 76.1 77.0 77.3 77.7 top 10 choices are considered. This would indicate that there is some fundamental characteristics about their writing that are not being captured by our system. This may be due to either (a) certain system implementation restrictions, or (b) a lack of representation of the writer’s lexemes in our writer-independent set of lexemes. 7 .5 Summary In this chapter, a system for the recognition of handwritten words has been described. This system makes use of character models that represent 93 different classes (10 dig- its, 26 uppercase and 26 lowercase characters, and 31 symbols), and a grammar which describes how these character models can be chained together to form the words in the lexicon. Results were presented for 8 writers tested on a writer-independent recognizer. The ability of this recognizer to capture the large variations in the hand- written characters was shown by the identification and modeling of lexemes in the writer-independent data. For these experiments, a coarse-level feature set, the Crit- 126 ical Point features, was used to provide the first stage classification and character segmentation due to their accuracy vs. complexity tradeoff. This achieved a case- dependent recognition accuracy of between 48.2% and 84.1% for an average of 64.7% on the 8 different writers. In addition, the percentage of time that the correct classi- fication is within the top 10 choices indicates that a second stage classification using the Local features could greatly improve these results. 127 Chapter 8 Devanagari Script Recognition The ability of our feature sets, lexeme identification method, and information fusion technique to generalize to other languages is explored in this chapter. Devanagari is a script used for several major languages such as Hindi, Sanskrit, Marathi and Nepali, and is used by more than 500 million peOple. Unconstrained Devanagari writing is more complex than English cursive handwriting due to the possible variations in the order, number, direction and shape of the constituent strokes. In addition, the large size of the Devanagari alphabet makes a keyboard interface difficult to use, thereby increasing the importance of alternative yet more natural computer interfaces. In this chapter, an experiment in the recognition of online Devanagari characters is presented, in which a combination of classifiers is used to obtain a good classification accuracy as described in Chapter 6. 128 °< \_ \\ \r. why]? at: EVE dg é\ rfl ‘E [ Figure 8.1: A sentence written in Devanagari. 8. 1 Introduction Although much work has been reported in the literature for off-line recognition of Devanagari script [5, 16, 17, 98, 99], to the best of our knowledge, this is the first time an on-line recognition of Devanagari has been attempted. Devanagari script is a two-dimensional composition of symbols attached to the top, bottom, left or right of a main character. These composite characters are joined together by a horizontal line, called “Shirorekha” (see Figure 8.1). The Devanagari alphabet is based on approximately 93 symbols: 13 vowels, 36 consonants, 28 pure consonants (half-forms) and 16 modifier symbols. In addition, a number of conjunct forms exist in which a pure consonant is combined with a consonant to form a new character. Figure 8.2 shows forty major Devanagari characters. These form the basis of many other symbols in the alphabet. Since the script composition in Devanagari is two-dimensional and the number of symbols is large, its input using a keyboard is cumbersome. This makes the on—line pen computing environment very attractive for Devanagari input. Text in Devanagari is similar to cursive English writing. A word needs to be seg- mented and decomposed into recognizable units. It usually consists of three vertical regions: a mid-region containing main characters, an upper modifier symbol region and a lower modifier symbol region. In this study, we have considered only the main 129 characters that occur in the mid-region. An off-line recognition system usually makes use of a histogram to separate these strips as was accomplished for English writing in Section 7.2.1. The representation of Devanagari characters as online data makes stroke segmen- tations explicit, which can be used to assist the process of segmenting a word into characters. However, the degree of variations in shape, order, and direction of De- vanagari characters is very large. Compared to cursive English, Devanagari contains a larger number of strokes. In addition, the strokes are combined and broken based upon the writer’s style of writing which is usually acquired at the time of learning the script. For example, the same stroke may be written from top-to-bottom/left-to-right or bottom-to—top/right-to—left by two different writers. Similarly, the Shirorekha may be broken and become part of down-going/up-going strokes. As an additional compli- cation in the online stroke space, a part of the stroke may be retraced in the opposite direction. These variations are very common as can be seen from the sample on-line data (Figure 8.3). Some of these characters cannot be correctly recognized even by humans without using contextual information. 8.2 System Overview The Devanagari recognition task is accomplished through the use of a combination of 5 different classifiers, in which each classifier’s error characteristics determine it’s level of contribution to the final classification. This group of classifiers is made up of two hidden Markov model classifiers, which focus on two different levels of local features 130 “k3.” (06) a“ “cha (11) 7‘} “Ta” (16) “ta.” (21) ‘3hha” (36) K “'7, l (02) H “kha” (07) {X “Chha” (12) if “Tha” (17) CC 77 sa (37) 3 (t ’3 ll (03) IT “ha” (38) 3 H 3’ 1111 (04) er gha (09) (C ,1 3T “jha” (14) a “Dha” (19) er “dha” (24) u- “bha” (29) “kSha” (39) fl“ ” (15) “Na” (20) “nan “jfia” (40) Figure 8.2: Forty major Devanagari characters with English pronunciations and cat- egory numbers. 131 3r 1' 3' 97373731—37§§7§?§7T‘23-3 3" Q at 3535577??? 13125;??? EEFQRETFC. a Ir er Wmfiafe?fljmrwflgggfi 3’ a' a: ayfisr—SF-M—J/aJIIW/Q‘Z‘G‘Q 51’ 3T 3T figflfifiggfjp—ggfiydl’33gg—T 3’ a: 3‘ 7FPrafiaass§§<§§ '6‘ “T ET EEEZEVIG’J,wma’Fflfl—z 9: :- er 7137(2723‘21 G?§T3%U0wgr 3T ‘T 93‘ figflia‘fi’w’q’gs’fl “£13???er 3- if q- agagayfiwgglifl’fiflwar 1T 1‘ a" flfl‘fizr’ZT 31172—2 Efifim’g 3’ 31’ W Q’s—aarwjflfir’srfi’qfiavfi F E at 27175442“? 9‘ 7973??) flag-72:2»:— 263595—17 Figure 8.3: Some handwritten examples of the 40 different Devanagari characters. 132 Table 8.1: Poorly discriminated character pairs. [ Character Pair ] Distinguishing Characteristic ] '71] "CT top of upper left—most curve and (31) (26) curvature of lower left corner al./l (31) (11) gap in vertical line on right side ’97 i the almost non-existent vertical (31) (11) stroke in 31 :1 _,‘ (30) (08) curve at top of lower left stroke A m v A 09% O v circle at start of lower left stroke 9] $1 circle at start of lower left stroke A [\3 4). v A CD © v Z L (21) (25) circle at start of lower left stroke 133 01g3'702f g70313043". }§§‘05Q L’ 06%(433'; 2707254371'33’757221’ 08j’l—Q WW‘TJ'75‘fl—09chf—UUIOSF’3 llflJlgiw v‘g’wtfia'Mf/zgfifip‘ej 15%;;91653175g5185g lggadomw’rmUT'UTmFVJrTQR’ETF‘ 2287734723? d: 249‘ 5-9725817 “74 8 S 4 1] 1] «C l ;\ 27 q“ 7““ 28 29? 391 ".14“ 30 y? T] 1] ‘W‘ 32 fl :1: 257 T] Figure 8.4: Prototypical examples of each lexeme from each of the 40 Devanagari characters. along the trace of the on-line sample point sequence, and three nearest neighbor classifiers that focus on off-line features. In order to better represent the complex intra-class variations that exist, lexemes are identified within each character class, and these are represented by separate models (for HMM classifiers). A typical example of each of these writing styles (the pattern closest to the cluster center) can be seen in Figure 8.4. An examination of the handwritten characters in Figure 8.4 reveals some of the complexities of the Devanagari classification problem. Poor writing, and possibly input device errors have made the third and fourth examples of character 07 unread- able, along with the third example of 38, the third example of 31, and the fourth 134 Table 8.2: Feature sets used in each classifier. Classifier No. Feature Type Features 1 on-line dz, y, sin(0), 003(9) for each sample point 2 on-line curvature and orientation for each critical point 3 off-line stroke direction histogram for each box of 5x5 grid 4 off-line (11: and dy from beginning to end of each stroke 5 off-line center of gravity of each stroke Table 8.3: Five different classifiers used in classifier combination. Classifier No. Model Accuracy 1 Hidden Markov Model 69.2% 2 Hidden Markov Model 43.1% 3 Nearest Neighbor 77.4% 4 Nearest Neighbor 44.7% (1.3% rej) 5 Nearest Neighbor 40.7% example of 26. In addition, the fourth example of 09 and the first example of 24 are virtually identical, while the third example of 07 and the third example of 37 seem to be better representatives of each other’s character class, demonstrating the large intra-class variability. Some other character pairs with low levels of discrimination are shown in Table 8.1. It should also be pointed out that certain character pairs, in particular characters 10 and 15, are known to be difficult for humans to distinguish in isolation. Tables 8.2 and 8.3 give a summary of the different classifiers and feature sets used. Since the handwritten strokes are captured as a sequence of (:23, y) sample points, we can make use of this trace of the writing to capture information on how the strokes were formed. Classifier 1 extracts features for each equidistant resampled point along the curve of writing. These features are the Local features described in Section 6.2.1. 135 Figure 8.5: Features for classifiers 1 and 3. The “+” marks the center of the character. Features are calculated between every pair of sample points. Classifier 3 quantizes 6’ and measures the total distance traveled in each direction for each box in the grid. Classifier 2 uses the following five features extracted at each local (:r,y) extrema point, p;, (referred to as critical points) not including the endpoints of a stroke: 1. The angle, 6C, formed between the two segments, one from the previous critical point, p;_1, to the current point, and one from the current to the next critical Point, Pm, 0 S 9c S 180- 2. The orientation of the curve, is captured by the sine and cosine of the angle of incline for the normal to the tangent of this critical point, 3. The change in :r and the change in y from the previous critical point, 1),-_1, to the next critical point, 1),-+1. This is a variation on the Critical Point features described in Section 6.2.2. Classifier 3 attempts to capture the directional properties of the written strokes, local to different spatially separate parts of the character. Classifier 4 and Classifier 5, two additional off-line classifiers, are respectively the stroke endpoint and center of gravity features 136 described in Sections 6.2.4 and 6.2.5. Figure 8.5 presents an example of the Local and spatial/directional features extracted from a handwritten Devanagari character. As discussed in Section 6.3, the final classification can be taken as the class C,- which maximizes the following posterior probability: P(CilO) = ZP(Cile(O))i (8-1) 1:1 where 0 is the observed character, QJ-(O) is the function which the jth classifier uses to classify an observed character, 0, and P(C,]Q,(O)) is the probability that the true class of O is class C,- given that classifier j has classified the character as class Qj(0). These conditional probabilities are estimated by examining the resubstitution accuracies for the HMM classifiers, and the leave-one-out accuracies for the nearest neighbor classifiers. 8.3 Results We have attempted to recognize the 40 major Devanagari characters (Figure 8.2) that occur in the core-strip of the writing. For this study, the modifier symbols and conjunct formation are not considered. Five samples of each character from twenty different writers were collected using an IBM/ Cross CrossPad. It should be noted that all our subjects have been living in the United States for some time and are only casual writers of Devanagari. The sample data has a mixture of writing styles as the subjects come from different geographical regions in India. A combination of hidden Markov models and nearest neighbor classifiers are used 137 for classification, capturing different levels of on—line and off-line information. Three samples of each writer are used for training and the remaining two samples have been used for testing. A recognition rate of 69.2% is achieved using a traditional HMM- based on-line recognizer. An examination of the substitution errors, revealed that the HMM model did not capture many of the structural properties needed for classifica- tion. This motivates the use of multiple classifiers built on features of different scales in an attempt to capture some of the structural properties of the characters. Our classifier was tested using a total of 1600 characters from 40 classes: 2 samples for each character, independent of the training data, from each of the 20 writers. Table 8.3 shows the accuracies obtained for each classifier independently. The combined classifier achieves an accuracy of 86.5% with no rejects. Since these classifiers can be run in parallel, the classification time is defined by the most time-consuming classifier, which is Classifier 1 due to its use of a large number of local on-line features. The currently unoptimized code processes characters at a rate of approximately 1.5 characters per second on a 250MHz UltraSparc. The contribution of the two on-line classifiers can be easily underestimated if we look at the recognition accuracies of individual classifiers (Table 8.3). While Classifier 3 has the best single classifier accuracy, the addition of the other two off-line classifiers (Classifier 4 and Classifier 5) does not increase the overall accuracy unless we also include at least one of the two on-line classifiers (Classifier 1 or Classifier 2). However, any of the off-line classifiers can be used in combination with Classifier 1 for a significant increase in recognition accuracy. (between 2.3% and 6.8% above the highest single classifier accuracy of the two). In the full combination of five classifiers, 138 our poorest performance is on character 31 which is often confused with character 22. This appears to be a reasonable confusion. Other common confusion pairs are (25, 21), (19, 17), and (29, 30), each of which are structurally similar. Table 8.4: Percentage of test characters for which the correct class falls within the top 10 choices. N 1 2 3 4 5 6 7 8 9 10 Accuracy 86.5% 92.1% 94.7% 95.6% 96.8% 97.5% 98.2% 98.3% 98.5% 98.7% The accuracies reported here are on discrete characters. Many of the common confusions can be corrected through the use of context when these characters are found in a word (of course, this requires a correct segmentation of a word into constituent characters). The potential benefits of such a method depend on how frequently the correct character class can be found as the second or third choice. Table 8.4 shows the percentage of times the correct classification is found within the top 5 choices. The successful use of context to select between the top two or three character choices has the potential to significantly boost the performance of our classifier. 8.4 Summary A system for the recognition of unconstrained online Devanagari characters has been presented. This is the first such work reported to the best of our knowledge. A com- bination of classifiers capturing both on-line and off-line features has been described yielding a classification accuracy of 86.5% with no rejects. In addition, approxi- mately 95% of the time the correct classification can be found in the top 3 choices. 139 Therefore, further improvements in performance are expected by making use of the word-level contextual information. In Chapter 6 this approach was shown to produce good results on the English character recognition problem, and therefore these re- sults indicate the ability of the recognition system to handle different languages and symbols. 140 Chapter 9 Writer-Dependent Systems 9. 1 Introduction In the previous chapters, results have been presented on writer-independent hand- writing recognition tasks. In other words, a recognition system was trained on the handwritten data of several writers for the purpose of achieving a good recognition accuracy on a large number of writers. However, the accuracy of a recognition system, for any particular writer, can be improved if character models are adjusted to better match the characteristics of that writer’s handwriting rather than using the more generalized models trained for a large number of writers. Such a writer-dependent system can be constructed for a user in one of two ways: by collecting enough data from that user to create robust models of his handwriting, or by adapting the models of a writer—independent recognition system to better fit the handwriting of those users from whom a relatively small amount of data bas been collected. For practical purposes, a handwriting recognition system designer may want to 141 specify the minimum amount of data that a user must supply for training his models. While a user may be willing to dedicate up to a couple of hours in providing training data if there is some assurance that it may improve recognition accuracies, we cannot expect individual users to provide the same amount of training data as in the writer- independent case. In addition, if the within-class variability in a user’s writing can be characterized by more than one style, it may be appropriate to separate these styles into different lexeme models. An increase in the number of models creates a demand for more training data, and therefore we should ensure that we use the available data to the best of our ability. In this chapter, we explore the role of lexeme modeling for writer-dependent hand- writing recognition. This includes the issue of whether or not it is appropriate to model a single writer’s handwriting with multiple models per character class, as well as how knowledge of the lexemes present in a writer-independent dataset can be used to assist the writer-adaptation process. 9.2 Writer-Dependent Lexeme Identification It is, given enough data, possible to identify lexemes within a writer’s handwritten data using the same method as the writer-independent case, however the amount of data required form good clusters within a character class make this method infeasible for the experiments presented here. 142 9.3 Writer-Adaptation If we assume that each writer can be expected to provide only a small amount of data during the training phase, this data may not be sufficient to identify lexemes or train multiple models per character. Therefore, we perform writer adaptation by identifying which writer-independent lexemes best match the writing style of a particular writer. These relatively small number of models are then retrained using the writer’s data. Our goal is to make an efficient use of the data that is available by using writer-independent models to compensate for the small amount of writer- dependent data. Our method of adaptation (previously introduced in [23]) begins by classifying each input character provided by the writer into one of the writer-independent lexeme categories. All labeled characters provided by the writer are compared against the writer-independent lexeme models of the same class (e.g., all the ‘a’ models), and the most similar model gives the lexeme identity. In this way, each writer-specific character is given a label corresponding to one of the writer-independent lexemes. Counts, "Liv are kept, for each lexeme L,, of the number of times a character has been identified as lexeme Li. Lexeme models are then retrained using the newly labeled lexeme data from the user. If the number of examples available from the writer for a particular lexeme is less than a threshold n7», the corresponding writer-independent lexeme is retained instead. Figure 9.1 gives a flow diagram of the proposed recognition system. Lexeme mod- eling is done using the full set of writer—independent data, and is broken into two 143 Writer—Independent Training Set Writer—Dependent Training Set \ | \ l \ [—Preprocessing \\ I \ l \ ‘1 \\ Strin Matchin Lexeme . . Stage 1 9 9 labeling ———-y- mm Training 1 Clustering I . . Writer-Independent Writer-Dependent Stage 2 M Training Lexeme Models Lexeme Models Figure 9.1: An overview of the model training process. Dashed lines show the flow of events for writer adaptation. The number of states has reduced from 16 to 14. stages. Stage 1 performs an initial clustering of the lexemes using a string matching measure and a squared-error clustering algorithm, as described in Section 3.4. Stage 2 defines a continuous hidden Markov model for each of these clusters, using the data within each cluster as the training data. At this point, we have a writer-independent set of lexeme models. These models are then adapted to individual writers in the dataset. In Figure 9.1 the dashed lines show the flow of events for this writer adap— tation process. 144 0.47 0.65 0.66 0.52 0.48 0.36 0.34 0.52 0.56 0.40 0.45 0.52 0.25 0.65 0.68 0.37 / / \\ / ‘ /’ ‘ \ ’M‘ '77“_ / 7‘ \ / \ /"‘ “\\ ,/'7 ‘ \ / “—7 \\ '/"“\ ./— \ \1 . ‘. I . "~ 7] \ Ill .. - R ‘ v _7‘ \‘ \ \\ .‘I/ ‘ y 1’ i r I’ r , i r V V V r r’ \ V x V t V b A ’ 1~ »\ 15* /A -» ,- ~ /~ ‘ / / ._ / \‘ / \ / \ , \_ /’ \. . ' 1 . . ‘ ‘ , i l ; l ] ’ ‘ ‘. g . . 1 r . x . « ~. ,‘ x I -, 1 - ,I . \ :5 \‘N _.’ \\ ~ / ’\.\ / ’ / \\ .7; \, _ / \_ ’1’ \ ,_ 4/ ‘\< / \\_ I \\~ ,/ \i': / ’\y\ / ’\ /! \\ [,4 .\ .4 \. :‘ ‘ ’7 \\ J2" \ 1'77 “-.\ Mi,”\ "-1" \\...15’}’\ ““““ fin .477 \ _1 7 ‘ \_1 “1h ‘\ /F:\ 2/ O 63 0.53 035 0.34 0.48 0.52 0.64 066 0.48 0.44 0.60 0.55 0.48 075 0.35 0.32 Writer-Independent Model 0.62 0.58 0.54 0.54 0.51 0.43 0.33 0.62 0.55 0.55 0.27 0.46 0.62 0.54 "“7" -- ~‘ ’- _fii- ‘ " ’ N. ,.—‘—1 A 1 .. 1 ~ / 5“ ’“ ,— — -\ 1~— '- .’ \‘ l» \\ [7’ \‘ ‘/ \. ./ ”x ./' .\. / \V \‘ ./ \ i/ \ [I‘/' \ ./ \1 / \\| /" —'\ 1 ‘ 1 1 l 1 ! 1 l ‘ " .‘y ’1 ; '1 .' a 1" l- J ’1 / ’\\\\ 4 I i 1’ l’ I ' i ' ”i ’ I? ‘4 2“ I "l "x I i \\ o" 6. / . ’ .‘ r gr ‘1 I -. r 1‘ , 5’ r 3‘ r a) 7 ‘br ~. “.1 r x i _ 2’ ‘\ ,x" r \_ ‘2 ,/' / \, / “._ x, x, \, ,/‘ /” \ \ ,. l. /' \ 1' 7 1 ' k x '. " I‘~- "" K /II‘ 3 /l, 3 '1 ‘K ”4 ’\\ / '1 4 . ,r \ 1 x. .. ,x \ _. . ,1 . ,/ . x ,. \, _,' x, / . _ .___ . ,, -1. . 4 7 ’ . . , -. -. _. , I, :r/ 0 6 \\ ,/.,)\ 1,4 .\ .1 \\ r ’4" \\ f'\ ;\/\ .7;\\ /-jf\ :r\ 2N ,I,’ \\_ 0:66" 6.46. 0:46 0.46 6.46 6.57 667 6.36 6.45 645 0.73 6.54 038 Writer-Dependent Model Figure 9.2: The parameters of a hidden Markov model trained for a lexeme of the character ‘e’ before and after writer adaptation. 9.4 Results Figure 9.2 shows a hidden Markov model for a lexeme of the character ‘e’ before and after writer adaptation has been performed. This shows that model parameters have been modified to adapt to the user’s handwriting. Figure 9.3 shows the set of lexemes that were identified for the digits written by “Writer #5”, along with the number of his training examples that were identified with each lexeme. Due to the small amount of digit data available from each writer, and the time required to train models, evaluation of writer adaptation was conducted using the resubstitution method. In addition, the number of writers in this experiment was limited to six writers whose training data contained examples of all ten digit classes. The performance of both the writer-independent recognition system, and the newly adapted writer-dependent system are shown for each of the six writers in Table 9.1. 145 /\ / ’ f} r 2 "’ K/ i // f (:24; Q 9 8 1 9 2 6 / z 7* ‘ 1/" /1\:". K, “\ /_‘ ,1 7 I/ \1/ ah/ ‘-< \_+/ I 5 3 3 2 4 3 1,3 Q Q / L/ / 5 8 9 Figure 9.3: Lexemes identified in digits written by “Writer #5”, and the number of occurrences matching that lexeme in that writer’s data. Note that for this writer, we have identified 15 lexemes (out of a total of 26 lexemes for the writer independent data). For a comparison, a recognition system was trained in which the writer’s data was used to train a single model per character class. As was done in the multiple lexeme model case, character classes for which there was not enough data to train a model were represented by the corresponding writer-independent model for that character. Table 9.2 shows the accuracies obtained for writer-adaptation of lowercase charac- ters for 11 different writers. The data from each writer has been randomly split into independent training and test sets. An Open question in writer adaptation is whether individual writers can be said to have more than one writing style for a particular character. In the context of our classification problem, this can be interpreted to mean the following: do disparities exist within a character class for a single writer, such that the character class can be more adequately modeled using multiple lexeme models? Table 9.2 shows that, for every writer, the writer-dependent models produce better recognition accuracies than the writer-independent models. In addition, an 146 Table 9.1: Writer-dependent digit recognition accuracies using different writer- adaptation methods. User No. Digits Writer Independent SingleVIE/grierhgiI/Ifillgefigemes Writer #1 95 94% 100% 100% ’riter #2 107 91% 99% 100% Writer #3 108 86% 96% 98% Writer #4 64 91% 91% 94% Writer #5 77 81% 94% 95% Writer #6 111 90% 100% 100% | Overall I 562 [ 88.8% | 97.2% ] 98.2% | important observation is that there is never a loss of accuracy by using multiple lex- eme models and in most cases this resulted in a significant increase in accuracy. One exception to this is Writer #2, in which the multiple lexeme model accuracy is less than the single lexeme model accuracy by a single character error. This is not con- sidered a significant difference. Overall, the use of multiple writer-dependent lexeme models reduced the error rate by an average of 16.3% over that achieved using a single writer-dependent model per character class. Furthermore, this is a 53.8% reduction in error compared to the classification accuracies obtained by the writer-independent recognizer. The best improvement occurred for “Writer #20”, with a 50.7% reduction in the error rate compared to the single writer-dependent model case, and a 70.8% reduction in error compared to the writer-independent case. 9.4.1 Word Recognition The ability of our writer adaptation technique was next tested for the word recognition system described in Chapter 7. Writer-independent models used in the adaptation 147 Table 9.2: Lowercase letter recognition accuracies using different writer-adaptation methods. P No. Chars Writer Writer Dep. Acc. (%) Writer Train / Test Indep. Single Mult. Acc. (%) Lexeme Lexemes #1 1155/ 1125 84.82 91.11 92.89 #2 1121 / 1094 83.75 95.34 95.25 #3 1190 / 1171 68.53 82.66 84.54 #4 1117/ 1117 70.73 83.26 84.15 #6 1247/ 1223 69.11 80.95 82.33 #7 1177/ 1158 84.15 92.40 94.04 #13 1223 / 1195 75.06 85.69 88.20 #17 1070 / 1049 85.37 91.42 92.66 #18 1339 / 1312 85.06 92.61 93.98 #19 1279/ 1253 55.25 74.38 78.21 #20 1250 / 1226 78.23 87.11 93.64 [Overallll3168 / 12923] 76.16 86.85 88.99 process were trained on 122,410 examples from 93 character classes. A set of discretely written characters in a handprinted style, and a second set in a cursive style were collected from each writer in the experiment. Table 9.3 shows the amount of data that is available from each writer. Each data set from each writer covers all 93 classes, with a minimum of 5 examples per class for each handprinted and cursive style characters in almost all cases. This is a very small amount of data for training. Experiments conducted by Subrahmonia et al. [100] indicate that a minimum of 500 words of training data, or as much as 2000 words of training data could be required to train adequate writer-dependent models for a word recognition system which makes use of models of 93 different character classes. If we assume an average of 5 characters per word, this would be approximately 2,500 to 10,000 characters required to train these models. As can be seen in Table 9.3, we have significantly less data than this. 148 Table 9.3: Number of character samples available from different writers of both hand- printed and cursive styles. ' Writer No. Handprint No. Cursive #1 487 480 #2 487 489 #3 443 464 #4 487 490 #5 442 322 #6 473 432 #8 476 469 Furthermore, the data is split into at least two different styles, handprint and cursive, effectively dividing the data available to any lexeme model into less than half of the total data for that character. Tables 9.4 and 9.5 show the word-level accuracies for the same writers used in experiments of Chapter 7, for both case-dependent and case-independent classifica- tion. Due to the absence of character-level data for Writer #7, this writer has been omitted from these experiments. A simple modification has been made to the algo- rithm to accommodate for a lack of adequate training data: a copy of all original writer-independent models are retained along with the newly retrained models to form the final recognition system. This is to allow the system to capture writing styles that may not have occurred in the training data simply due to their relatively low frequency of occurrence. Nonetheless, those lexemes for which models have been retrained should still dominate classification when a test pattern is an example of that lexeme. The first thing that is apparent from Tables 9.4 and 9.5 is that any attempt to retrain a single lexeme model for each character using just the newly acquired data 149 Table 9.4: Case-dependent word recognition accuracies after writer-adaptation. Writer No. Test Examps Writer Indep. Acc. Writer Dep. Acc. Single Lexeme Mult Lexeme #1 571 79.9% 45.2% 79.2% #2 614 57.5% 12.9% 58.8% #3 577 72.3% 36.4% 74.7% #4 605 69.4% 28.3% 72.4% #5 610 84.1% 47.9% 83.8% #6 599 48.2% 17.7% 50.8% #8 591 55.2% 47.2% 62.8% [ Total [ 4132 | 67.1% 33.8% 69.4% Table 9.5: Case-independent word recognition accuracies after writer-adaptation. Writer No. Test Examps Writer Indep. Acc. Writer Dep. Acc. Single Lexeme Mult Lexeme #1 571 83.9% 47.6% 83.2% #2 614 58.5% 13.8% 59.8% #3 577 77.3% 38.6% 79.4% #4 605 75.2% 31.4% 78.0% #5 610 87.9% 49.8% 87.7% #6 599 49.4% 19.5% 51.9% #8 591 58.2% 49.4% 66.0% Total 4132 70.6% 35.9% 72.8% 150 Table 9.6: Writer-adaptation for a single writer with a larger set of training examples. Case-independent accuracies are reported. User No. Train No. Test Writer Writer Dependent Characters Words Independent Single Lex.[Mult. Lex [#5 ml 2450 ] 612 1 72.1% I 62.1% ] 78.6% J produces very bad results. This is due to a lack of training data, and therefore this experiment can be used to really test the limitations of our writer-adaptation method. Writer adaptation achieves an average reduction in error rate of 7.0% for the case-dependent word classification task, and 7.5% error rate reduction for the case-independent task. While this is not a large improvement in performance, given the very small amount of data available for adaptation, it is still significant. Perhaps even more interesting is the fact that the recognition rate only drops by a miniscule amount due to adaptation in two cases, writers #1 and #5. The best results obtained were for writer #8, which achieved an 18.7% reduction in error. In order to demonstrate the scalability of the writer-adaptation algorithm when tested on words, a new set of training characters were collected from a single writer. Rather than requesting both cursive and handprint characters, the writer was allowed to write in only his preferred style (which was handprint), and a much larger quantity of data was collected than the previous word-level experiments. Table 9.6 presents the results of this experiment. While there is still not enough training data to construct a single model per character class that outperforms the writer-independent recog- nizer, the use of the writer-adaptation algorithm to form multiple lexeme models per character class can be seen to reduce the error rate by 21.7%. 151 9.5 Summary We have presented a hidden Markov model based system that performs writer- adaptation by making use of a writer-independent set of writing style models (lex- emes) to identify different writing styles in the writer-dependent data. These models are then used to train new models for this subset of lexemes. Lexemes that are present in the writer’s data, but that do not contain an adequate number of training exam- ples are replaced with the writer-independent models. This technique is compared to a system in which a single model per character class is trained from the writer’s data. The results presented show that, on average, error rate reductions of 37.5% for digits from 6 writers, and 16.3% for lowercase characters from 11 writers are achieved compared to accuracies achieved using a single writer-dependent model. Other ex- periments, in which a very small amount of data was available for the adaptation process, produced word-level error rate reductions of 7.0% for case-dependent word classification, and 7.5% for case-independent word classification compared to writer- independent classification accuracies. Moreover, even with such a small amount of data available for adaptation, no significant increase in the error rate is ever intro- duced by this adaptation method. In addition, the scalability of this method for word recognition was demonstrated on a single writer with a significantly larger, but still small, training set. This produced a 21.7% reduction in error rate compared to the writer-independent case. 152 Chapter 10 Conclusions Many challenges need to be overcome in the field of online handwriting recognition, if handwriting is to evolve into a reliable method of data entry. In particular, a recog- nition system must be able to accommodate the large variations that exists between writing styles. The adaptation of models to the writing style(s) of a single writer has the potential to greatly increase accuracies for that writer. However, the amount of data a single writer is typically willing to provide is not enough to build robust models. Therefore, an adaptation method must best use knowledge it has gained from a large set of writers to assist it in creating robust models for individual writers. In addition, the combination of multiple sources that capture different types of in- formation about the handwritten characters is becoming a common way of capturing and integrating different types of variations in the writing. Finally, these concepts are just as important to languages other than those based on the Roman alphabet. In particular, Devanagari script is used by hundreds of millions of writers, and is potentially very well suited for a pen interface. This Chapter presents a summary of 153 the contributions made in this thesis, along with a discussion of some future areas of research is likely to advance the state-of—the—art. 10.1 Contributions The contributions of this thesis can be summarized as follows: 0 A method of identifying different character writing styles (referred to as lexemes) in online data has been presented that makes use of a string matching measure and a squared error clustering algorithm. This method also makes use of a cluster dispersion measure to automatically choose an apprOpriate number of lexemes. o A decision tree classifier was constructed which makes use of distances to pro- totypical lexemes to make decisions at each node of the tree. This method has been shown to achieve a classification accuracy of 86.9% on the 36—c1ass a1- phanumeric character recognition problem (10 digits, 26 lowercase characters). The use of a decision tree provided a 99.4% reduction in classification time as compared to a full nearest neighbor classifier, while retaining 97.7% of the recognition accuracy. 0 A hidden Markov model classifier was constructed in which a single model was trained for each lexeme class, where the lexemes have been based on the string matching measure. This method has been shown to achieve a recognition accuracy of 98.0% on digits, 86.85% on lowercase characters, and 89.47% on 154 uppercase characters. These results are a significant improvement over those obtained when only a single model is trained for each character class: 95.62% on digits, 84.43% on lowercase characters, and 84.04% on uppercase characters. In addition, the classification accuracy for the combined 62-class problem has been shown to be 76.45%, with the correct classification choice within the top 3 choices over 95% of the time. A method of defining lexemes in the hidden Markov model probability space was demonstrated. This method performs a K-Means clustering in which the distance of a pattern to a cluster is the likelihood that the pattern was generated by a HMM trained to model that cluster. It was shown that, compared to the clustering method based on the string matching measure, a 4.9% decrease in error rate can be achieved. Moreover, the algorithm has been shown to obtain a good clustering, similar to the best clustering using the string matching measure, without the requirement of any sophisticated cluster initialization. A method of combining multiple classifiers based on both online and offline features that capture different levels of structural characteristics of handwriting was demonstrated. This method was demonstrated for 5 different classifiers that capture different online and offline properties of the writing, as well as different scales of structural information. The focus of this technique is on the misclassification characteristics of each classifier, and therefore this approach also improves the accuracy of individual classifiers. The best classifier combi- nation achieves a recognition accuracy of 84.0% on the 62-class alphanumeric 155 character recognition problem. This is a 32% reduction in error compared to the best single classifier accuracy. The application of character lexemes to the recognition of words was explored. An experiment involving a lexicon of 483 words made up of 93 different character classes (10 digits, 26 uppercase and 26 lowercase characters, and 31 special symbols) was conducted. The use of lexemes was shown to reduce the error rate by 32.0% for case-dependent word recognition and 35.4% for case-independent word recognition. Final writer-independent classification accuracies obtained on 8 different writers ranged from 48.2% to 84.1% with an average of 64.7%. The application of our recognition system to the classification of Devanagari script was explored. This system achieves an accuracy of 86.5% on 40 common Devanagari characters with the correct classification occurring in the top 3 choices approximately 95% of the time. This is the first time any result has been reported on online Devanagari recognition to our knowledge. A writer-adaptation technique was described in which knowledge about lexemes that have been identified for a large number of writers are used to help identify lexemes within a single writer’s handwritten data. This method requires very little data from the writer for which the models are adapted. Experiments on lowercase handwritten characters show that average error rate reductions of 53.8% as compared to writer-independent classification accuracies, and 16.3% as compared to classification using a single writer—dependent model are achieved. The latter result indicates that a single writer tends to write in multiple styles 156 that can best be described using separate models. In the extreme, it was shown that for less than 10 examples per character class divided among a minimum of two different writing styles, our adaptation technique can still produce error rate reductions for word recognition by an average of 7.5%, and has been shown to reduce error rates by as much as 18.7% with this small amount of training data. For a larger training set, this method has been shown to reduce the error rate by 21.7%. In no instance did the adaptation technique increase the error rate of the recognition system by a significant amount. 10.2 Future Work This section presents some areas in which our work could be extended to improve the online handwriting recognition accuracy. 10.2.1 Classifier Combination for Word Recognition Our current word-level results make use of the curvature/orientation features. An improvement on these results should be possible through the application of a com- bination of classifiers. While this is difficult during the word recognition process, due to the fact that character segmentations are unknown until recognition is com- plete, combined classifier classification scores could be used to reliably rescore the top Classification choices from the current classifier. 157 10.2.2 Devanagari Script Recognition Results have been presented on Devanagari script recognition for a set of 40 characters. For these results to be applicable for Devanagari word recognition, we require the following: o A method of segmenting Devanagari words into characters. Alternatively, an approach that accomplishes concurrent segmentation and recognition could be attempted, such as was accomplished for English word recognition. The large number and order of strokes in Devanagari makes this a challenging problem however. 0 The extension of our models to handle modifiers. This will increase the com- plexity of the character models. a A word-level size and location normalization procedure may be required if seg- mentation and recognition will be done concurrently. Our current method of English word normalization might be of use for this task with some minor ad- justment. For example, the Shirorekha should be easily found, providing a reliable estimate of the mid-line. 10.2.3 The HMM-based Clustering Algorithm In Section 5.3, a method was shown in which the K-Means algorithm was used to define a clustering algorithm in which a hidden Markov model represents each cluster. This approach, referred to as the HMM-based clustering algorithm, has been shown to 158 produce a reasonable clustering on randomly initialized examples where the number of clusters is predefined. However, as is the case with most clustering algorithms, the potential exists for this algorithm to converge toward a local minima if the initial set of clusters is poorly chosen, and therefore never achieve an solution near the globally optimal solution. This can be alleviated by avoiding making any “hard” cluster membership decisions in the early iterations of clustering, and gradually allowing patterns to commit more and more to a single cluster. Methods of simulated annealing [52], deterministic annealing [40, 90, 107], and self annealing [28, 88] have been applied to non-temporal Gaussian mixtures, and may be useful techniques when properly adapted to temporal data. Figueiredo [28] has presented a method in which the number of clusters is defined by starting with an initially large number of clusters, some of which will be annihilated during the clustering process. 10.2.4 Writer-Adaptation The writer-adaptation approach described in Section 9.3 has been shown to produce increases in classification accuracy at the word level when only a very small amount of training data is available from the user. An evaluation of this method for a slightly larger training set has produced very good results for the character recognition prob- lem, and a single instance of the word recognition problem. In order to test the full potential of this adaptation technique for word recognition, an evaluation using a somewhat large set of training data from each of a large set of writers is of interest. In addition, a method of selecting character examples from the writer-independent 159 training set to supplement the data available for writer-dependent model training could be implemented. Once lexeme labels are established for the new writer’s data, character examples could be randomly selected from the data used to train the writer- independent model of that lexeme, thereby selecting supplementary examples from the writer-independent data that are very similar to each writer-dependent pattern. 160 BIBLIOGRAPHY 161 Bibliography [1] http://www.artcomp.com [2] Special Issue on Scientific Visualization, IEEE Computer, vol. 32, no. 12, Dec. 1999. [3] LR. Bahl, P.F. Brown, P.V. deSouza, and R.L. Mercer, “Maximum Mutual Infor- mation Estimation of Hidden Markov Model Parameters for Speech Recognition,” Proc. ICASSP ’86, Tokyo, Japan, pp. 49-52, Oct. 1986. [4] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 5, no. 3, pp. 179-190, March 1983. [5] V. Bansal and R.M.K. Sinha, “On how to Describe Shapes of Devanagari Charac— ters and Use Them for Recognition,” Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, pp. 410-413, Sept. 1999. [6] H. Beigi, “Pre-Processing the Dynamics of On-Line Handwriting Data, Feature Extraction and Recognition,” Proc. 5th Int. Workshop on Frontiers in Handwriting Recognition, Colchester, England, pp. 255-258, Sept. 1996. [7] H. Beigi, K. Nathan, C. J. Clary, and J. Subrahmonia, “Size Normalization in On-Line Unconstrained Handwriting Recognition,” Proc. ICIP’94, Austin, Texas, pp. 169-173, Nov. 1994. [8] E.J. Bellegarda, J .R. Bellegarda, D. Nahamoo, and KS. Nathan, “A Probabilistic Framework For On-Line Handwriting Recognition,” Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, Buffalo, New York, pp. 225-234, May 1993. [9] Y. Bengio, Y. LeCun, C. Nohl, and C. Burges, “LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition,” Neural Computation, vol. 7, no. 6, pp. 1289- 1303, June 1995. [10] S. Bercu and G. Lorette, “On-Line Handwritten Word Recognition: An Ap- proach Based on Hidden Markov Models,” Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, Buffalo, New York, pp. 385-390, May 1993. 162 [11] J.J. Brault and R. Plamondon, “Segmenting Handwritten signature at Their Perceptually Important Points,” IEEE Trans. Pattern Analysis and Machine In- telligence, vol. 15, no. 9, pp. 953-957, Sept. 1993. [12] L. Breiman, “Bagging Predictors,” Technical Report 421, Dept. of Statistics, Univ. of California at Berkeley, 1994. [13] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. [14] D.J. Burr, “Designing a Handwriting Reader,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 5, no. 5, pp. 554-559, Sept. 1983. [15] K.F. Chan and D.Y. Yeung, “Elastic Structural Matching for On-Line Handwrit- ten Alphanumeric Character Recognition,” Proc. 14th Int. Conf. Pattern Recogni- tion, vol. 2, Brisbane, Australia, pp. 1508-1511, Aug. 1998. [16] B. B. Chaudhuri and U. Pal, “A Complete Printed Bangla OCR System,” Pattern Recognition, vol. 31, no. 5, pp. 531-549, May 1997. [17] BB. Chaudhuri and U. Pal, “An OCR System to Read Two Indian Language Scripts: Bangla and Devanagari,” Proc. 4th Int. Conf. Document Analysis and Recognition, Ulm, Germany, pp. 1011-1015, Aug. 1997. [18] K.-F. Chan and D.-Y. Yeung, “Elastic Structural Matching for On-line Handwrit- ten Alphanumeric Character Recognition,” Proc. 14th Int. Conf. Pattern Recogni- tion, Brisbane, Australia, pp. 1508-1511, Aug. 1998. [19] AK. Jain and J. Coggins, “A Spatial Filtering Approach to Texture Analysis,” Pattern Recognition Letters, vol. 3, pp. 195-203, 1985. [20] SD. Connell, “A Comparison of Hidden Markov Model Features for the Recogni- tion of Cursive Handwriting,” Master’s Thesis, Dept. of Computer Science, Michi- gan State University, May 1996. [21] S. Connell and AK. Jain, “Learning Prototypes for On-Line Handwritten Dig- its,” Proc. 14th Int. Conf. Pattern Recognition, Brisbane, Australia, pp. 182-184, Aug. 1998. [22] S. Connell and AK. Jain, “Template-based Online Character Recognition,” to appear in Pattern Recognition. [23] S. Connell and AK. Jain, “Writer Adaptation of Online Handwritten Models,” Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, pp. 434-437, Sept. 1999. [24] J. Crettez, “A Set of Handwritten Families: Style Recognition,” Proc. 3rd Int. Conf. on Document Analysis and Recognition, Montreal, Canada, pp. 489-494, Aug. 1995. 163 [25] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224-227, April 1979. [26] RC. Duda and PE. Hart, Pattern Classification and Scene Analysis, Wiley- Interscience, 1973. [27] R. P. W. Duin, D. de Ridder, and D. M. J. Tax, “Experiments With Object Based Discriminant Functions; A Featureless Approach to Pattern Recognition,” Pattern Recognition Letters, vol. 18, no. 11-13, pp. 1159-1166, 1997. [28] MAT. Figueiredo and AK Jain, “Unsupervised Selection and Estimation of Finite Mixture Models,” to appear in ICPR’2000, Barcelona, Sept. 2000. [29] MAT. Figueiredo, H.M.N. Leitao, and AK. Jain, “On Fitting Mixture Models,” in Energy Minimization Methods in Computer Vision and Pattern Recognition, M. Pellilo and E. Hancock, eds., pp. 54-69, Springer Verlag, 1999. [30] CD. Forney, “The Viterbi Algorithm,” Proc. of the IEEE, vol. 61, pp. 268-278, 1973. [31] Y. Freund and RE. Shapire, “Experiments With a New Boosting Algorithm,” Proc. 13th Int. Conf. Machine Learning, pp. 148-156, 1996. [32] http: / / www.wiley.co.uk / college / busin / icmis / oakman / outline / chap12 / misc / graffitihtm [33] R. M. Gray, “Vector Quantization,” Readings in Speech Recognition, Alex Waibel and Kai-Fu Lee, eds., Morgan Kaufmann, pp. 75-100, 1990. [34] SA. Guberman, I. Lossev, and A.V. Pashintsev, “Method and Apparatus for Recognizing Cursive Writing from Sequential Input Information,” US Patent WOW/07214, March 1994. [35] W. Guerfali and R. Plamondon, “Normalizing and Restoring On-Line Handwrit- ing,” Pattern Recognition, vol. 26, no. 3, pp. 419, March 1993. [36] I. Guyon, P. Albrecht, Y. Le Cun, J .S. Denker, and W. Hubbard, “Design of a Neural Network Character Recognizer for a Touch Terminal,” Pattern Recognition, vol. 24, no. 2, pp. 105-119, Feb. 1991. [37] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet, “UNIPEN Project of On-Line Data Exchange and Recognizer Benchmarks,” Proc. 12th Int. Conf. Pattern Recognition, Jerusalem, Israel, pp. 29-33, Oct. 1994. [38] SC. Hinds, J.L. Fisher, and DP. D’Amato, “A Document Skew Detection Method Using Run-Length Encoding and the Hough Transform,” Proc. 10th Int. Conf. Pattern Recognition, Atlantic City, NJ, pp. 464-468, June 1990. 164 [39] T.K. Ho, J .J . Hull, and SN. Srihari, “Decision Combination in Multiple Classifier Systems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66-75, Jan. 1994. [40] T. Hofmann and J. Buhmann, “Pairwise Data Clustering by Deterministic An- nealing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 1, pp. 1-14, Jan. 1997. [41] J .M. Hollerback, “An Oscillation Theory of Handwriting,” Biological Cybernet- ics, vol. 39, pp. 139-156, 1981. [42] H. Hotelling, “Analysis of A Complex of Statistical Variables Into Principal Components,” JEP, vol. 242417-441, pp. 498-520, 1933. [43] J. Hu, A. S. Rosenthal, and M. K. Brown, “Combining High-Level Features with Sequential Local Features for On-Line Handwriting Recognition,” Proc. Italian Image Process. Conf., Florence, Italy, pp. 647-654, Sept. 1997. [44] J. Hu, M. K. Brown, and W. Turin, “HMM Based On-Line Handwriting Recog- nition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 10, pp. 1039-1045, Oct. 1996. [45] X.D. Huang and MA. Jack, “Semi-Continuous Hidden Markov Models For Speech Signals,” Readings in Speech Recognition, Alex Waibel and Kai-13h Lee, eds., Morgan Kaufmann, pp. 340-345, 1990. [46] Y.S Huang, C.Y. Suen, “A Method of Combining Multiple Experts for the Recog- nition of Unconstrained Handwritten Numerals,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 1, pp. 90—94, Jan. 1995. [47] AK. Jain, L. Hong, S. Pankanti, and R. Bolle, “An Identity-Authentication System Using Fingerprints,” Proc. of the IEEE, vol. 85, no. 9, pp. 1365-1388, Sept. 1997. [48] AK. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988. [49] AK. Jain, R.P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Re- view,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000. [50] AK. Jain and D. Zongker, “Representation and Recognition of Handwritten Digits using Deformable Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 12, pp. 1386-1391, Dec. 1997. [51] http://www.cic.com [52] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by Simulated Anneal- ing,” Science, vol. 220, pp. 671-680, 1983. 165 [53] J. Kittler, “On Combining Classifiers,” IEEE Trans. Pattern Analysis and Ma— chine Intelligence, vol. 20, no. 3, pp. 226-238, March 1998. [54] T. Kohonen, “The self-organizing map,” Proc. of the IEEE, vol. 78, no. 9, pp. 1464-1480, Sept. 1990. [55] T. Kohonen, J. Kangas, J. Laaksonen, and K. Torkkola, “ LVQ_PAK: A Program Package for the Correct Application of Learning Vector Quantization Algorithms,” Int. Joint Conference on Neural Networks, Baltimore, pp/ 1725-730, June 1992. [56] A. Kosmala, J. Rottland, and G. Rigoll, “Improved On-Line Handwriting Recog- nition Using Context Dependent Hidden Markov Models,” Proc. 4th Int. Conf. Document Analysis and Recognition, Ulm, Germany, pp. 641-644, Aug. 1997. [57] K.-F. Lee, “Hidden Markov Models: Past, Present, and Future,” Proc. Eu- rospeech’89, vol. 1, Paris, pp. 148-155, Sept. 1989. [58] SE. Levinson, L.R. Rabiner, and M.M. Sondhi, “An Introduction to the Appli- cation of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition,” Bell System Technical Journal, vol. 62, no. 4, pp. 1035-1074, April 1983. [59] X. Li, R. Plamondon, and M. Parizeau, “Model-based On-line Handwritten Digit Recognition,” Proc. 14th Int. Conf. Pattern Recognition, Brisbane, Australia, pp. 1134-1136, Aug. 1998. [60] X. Li and D.-Y. Yeung, “On-line Handwritten Alphanumeric Character Recog- nition Using Dominant Points in Strokes,” Pattern Recognition, vol. 30, no. 1, pp. 31-44, Jan. 1997. [61] X. Li, M. Parizeau, and R. Plamondon, “Segmentation and Reconstruction of On-Line Handwritten Scripts,” Pattern Recognition, vol. 31, no. 6, pp. 675-684, June 1998. [62] S. Madhvanath and V. Govindaraju, “Serial Classifier Combination For Hand- written Word Recognition,” Proc. 3rd Int. Conf. Document Analysis and Recogni- tion, vol. II, Montreal, Canada, pp. 911-914, Aug. 1995. [63] E. Mandler, R. Oed, and W. Doster, “Experiments In On-Line Script Recogni- tion,” Proc. 4th Scandinavian Conf. Image Analysis, pp. 75-86, June 1985. [64] S. Manke, M. F inke, and A. Waibel, “NPen++” A Writer Independent, Large Vocabulary On-Line Cursive Handwriting Recognition System,” Proc. of the 3rd Int. Conf. Document Analysis and Recognition, Montreal, Canada, pp 403-408, Aug. 1995. [65] S. Manke and U. Bodenhausen, “A Connectionist Recognizer for On-Line Cursive Handwriting Recognition,” Proc. ICASSP’94, vol. 2, Adelaide, Australia, pp. 633- 636, April 1994. 166 [66] SS. Marwah, S.K. Mullick, R.M.K. Sinha, “Recognition of Devanagari Charac- ters Using A Hierarchical Binary Decision Tree Classifier,” Proc. IEEE Int. Conf. Systems, Man and Cybernetics, New York, pp. 414-20, 1984. [67] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John Wiley & Sons, New York, 1997. [68] K. Mohiuddin and J. Mao, “A Comparative Study of Different Classifiers For Handprinted Character Recognition,” in Pattern Recognition in Practice IV, E.S. Gelsema and LN. Kanal, eds., Elsevier Science, The Netherlands, pp. 437-448, 1994. [69] G. Nagy, “Twenty Years of Document Image Analysis in PAMI,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 38-62, Jan. 2000. [70] VS. N alwa, “Automatic On-Line Signature Verification,” Proc. of the IEEE, vol. 85, no. 2, pp. 215-239, Feb. 1997. 71] KS. Nathan, J.R. Bellegarda, D. Nahamoo, and E.J. Bellegarda, “On-Line Handwriting Recognition Using Continuous Parameter Hidden Markov Models,” Proc. ICASSP’QS, vol. 5, Minneapolis, MN, pp. 121-124, May 1993. [72] KS. Nathan, H.S.M. Beigi, H. Subrahmonia, G.J. Clary, and H. Maruyama, “Real-Time On-Line Unconstrained Handwriting Recognition Using Statistical Methods,” Proc. ICASSP’95, vol. 4, Detroit, MI, pp. 2619-2623, May 1995. [73] F. Nouboud and R. Plamondon, “On-Line Recognition of Handprinted Charac- ters: Survey and Beta Tests,” Pattern Recognition, vol. 25, no. 9, pp. 1031-1044, Sept. 1990. [74] L. O’Gorman and R. Kasturi, Document Image Analysis, IEEE Computer Soci- ety Press, 1995. [75] Palm Computing, “Graffiti Reference Card,” [Online] Available http://www.3com.com, Oct. 1999. [76] http: //www.paragraph.com [77] I. Pavlidis, R. Singh, and N .P. Papnikolopoulos, “On-Line Handwriting Recog- nition Using Physics-Based Shape Metamorphosis,” Pattern Recognition, vol. 31, no. 11, pp. 1589-1600, Nov. 1998. [78] MP. Perrone and SD. Connell, “K-Means Clustering For Hidden Markov Mod- els,” to appear in 7th Int. Workshop on Frontiers in Handwriting Recognition, Amsterdam, The Netherlands, Sept. 2000. [79] R. Plamondon, D. Lopresti, L.R.B. Schomaker, and R. Srihari, “On-Line Hand— writing Recognition,” Encyclopedia of Electrical and Electronics Engineering, J .G. Webster, ed., vol. 15, pp. 123-146, Wiley, 1999. 167 [80] R. Plamondon and F.J. Maarse, “An Evaluation of Motor Models of Handwrit- ing,” IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 5, pp. 1060-1072, May 1989. [81] R. Plamondon and SN. Srihari, “On-Line and Off-Line Handwriting Recog- nition: A Comprehensive Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 63-84, Jan. 2000. [82] L. Prevost and M. Milgram, “Automatic Allograph Selection and Multiple Clas- sification for Totally Unconstrained Handwritten Character Recognition,” Proc. 14th Int. Conf. Pattern Recognition, Brisbane, Australia, pp. 381-383, Aug. 1998. [83] L. Prevost and M. Milgram, “Non-supervised Determination of Allograph Sub- classes for On-line Omni-scriptor Handwriting Recognition,” Proc. 5th Int. Conf. Document Analysis and Recognition, Bangalore, India, pp. 438-441, Sept. 1999. [84] W. Postl, “Detection of Linear Oblique Structures and Skew Scan in Digitized Documents,” Proc. 8th Int. Conf. Pattern Recognition, Paris, France, pp. 687-689, Oct. 1986. [85] J.R. Quinlan, “Bagging, Boosting, and C45,” Proc. 13th National Conference on Artificial Intelligence, Portland, OR, Aug. 1996. [86] J .R. Quinlan, C45: Programs for Machine Learning, Morgan Kauffman, 1993. [87] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989. [88] A. Rangarajan, “Self Annealing.” in Energy Minimization Methods in Computer Vision and Pattern Recognition, M. Pellilo and E. Hancock, eds., pp. 229-244, Springer Verlag, 1997. [89] G. Rigoll, A. Kosmala, and D. Willett, “A New Hybrid Approach to Large Vocabulary Cursive Handwriting Recognition,” Proc. 14th Int. Conf. on Pattern Recognition, Brisbane, Australia, pp. 1512-1514, Aug. 1998. [90] K. Rose, E. Gurewitz, and G. Fox, “A Deterministic Annealing Approach to Clustering,” Pattern Recognition Letters, vol. 11, no. 11, pp. 589-594, Nov. 1990. [91] P. Scattolin and A. Krzyzak, “Weighted Elastic Matching Method for Recogni- tion of Handwritten Numerals,” Vision Interface ’94, pp. 178-185, 1994. [92] M. Schenkel, I. Guyon, and D. Henderson, “On-Line Cursive Script Recog- nition Using Time Delay Neural Networks and Hidden Markov Models,” Proc. ICASSP’94, vol. 2, Adelaide, Australia, pp. 637-640, April 1994. [93] L.R.B. Schomaker and H.-L. Teulings, “A Handwriting Recognition System based on the Properties and Architectures of the Human Motor System,” Proc. Int. Workshop on Frontiers in Handwriting Recognition, Montreal, pp. 195-211, April 1990. 168 [94] L.R.B. Schomaker, E.H. Helsper, H.L. Teulings, and G.H. Abbink, “Adaptive recognition of online, cursive handwriting,” Proc. 6th Int. Conf. Handwriting and Drawing, Paris, fiance, pp. 19-21, July 1993. [95] G. Seni, R.K. Srihari, and N. N asrabadi, “Large Vocabulary Recognition of On- Line Handwritten Cursive Words,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 757-762, July 1996. [96] AW. Senior, J. Subrahmonia, and K. Nathan, “Duration Modeling Results For An On-Line Handwriting Recognizer,” Proc. ICASSP’96, Atlanta, Georgia, pp. 3483-3486, May 1996. [97] RE. Shapire, Y. Freund, P. Bartlett, and W.S. Lee, “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods,” The Annals of Statis- tics, vol. 26, no. 5, pp. 1651-1686, 1997. [98] R.M.K. Sinha and V. Bansal, “On Devanagari Document Processing,” Proc. Int. Conf. on Systems, Man and Cybernetics, Vancouver, BC, pp. 1621-1626, Oct. 1995. [99] R.M.K. Sinha and H.N. Mahabala, “Machine Recognition of Devanagari Script,” IEEE Trans. Systems, Man and Cybernetics, vol. 9, no. 8, pp. 435-441, Aug. 1979. [100] J. Subrahmonia, K. Nathan, and M. Perrone, “Writer Dependent Recognition of On-Line Unconstrained Handwriting,” Proc. ICASSP ’96, vol. 6, Atlanta, Georgia, pp. 3478-3481, May 1996. [101] J. Makhoul, T. Starner, R. Schwartz, and G. Chou, “On-Line Cursive Hand- writing Recognition Using Speech Recognition Methods,” Proc. ICASSP’94, vol. 5, Adelaide, Australia, pp. 125-128, April 1994. [102] H. Takahashi, “A Neural Net OCR Using Geometrical and Zonal-Pattern Fea- tures,” Proc. 1st Int. Conf. Document Analysis and Recognition, pp. 821-828, 1991. [103] CC. Tappert, “Adaptive On-Line Handwriting Recognition,” Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, pp. 1004-1007, July-Aug. 1984. [104] CC. Tappert, “Cursive Script Recognition by Elastic Matching,” IBM Journal of Research and Development, vol. 26, no. 6, pp. 765-771, June 1982. [105] CC. Tappert, C.Y. Suen, T. Wakahara, “The State of the Art in On-Line Hand- writing Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 8, pp. 787-808, Aug. 1990. [106] H. Teulings and L. Schomaker, “Unsupervised Learning of Prototype Allographs in Cursive Script Recognition,” From Pixels to Features III: Frontiers in Handwrit- ten Recognition, S. Impedovo and J. Simon, eds., North-Holland, pp. 61-73, 1992. [107] N. Udea and R. Nakano, “Deterministic Annealing EM Algorithm,” Neural Networks, vol. 11, pp. 271-282, 1998. 169 [108] A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically 0p- timal decoding algorithm,” IEEE Trans. Information Theory, vol. IT-13, pp. 260- 269, 1967. [109] L. Vuurpijl and L. Schomaker, “Finding Structure in Diversity: A Hierarchical Clustering Method for the Categorization of Allographs in Handwriting,” Proc. 4th Int. Conf. Document Analysis and Recognition, Ulm, Germany, pp. 387-393, Aug. 1997. [110] T. Wakahara, “On-Line Cursive Script Recognition Using Local Affine Trans- formation,” Proc. 9th Int. Conf. Pattern Recognition, pp. 1133-1137, Nov. 1988. [111] T. Wakahara, “Adaptive Normalization of Handwritten Characters Using Global/ Local Affine Transformation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1332-1341, Dec. 1998. [112] T. Wakahara, H. Murase, and K. Odaka, “On-Line Handwriting Recognition,” Proc. of the IEEE, vol. 80, no. 7, pp. 1181-1194, July 1992. [113] KY. Wong, R.G. Casey, and F .M. Wahl, “Document Analysis System,” IBM J. Research and Development, vol. 26, no. 6, pp. 647-656, Nov. 1982. [114] LS. Yaeger, B.J. Webb, and RF. Lyon, “Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the New- ton,” AI Magazine, pp. 73-89, Spring 1998. [115] K. Yamasaki, “Automatic Allograph Categorization Based on Stroke Clustering for On-Line Handwritten Japanese Character Recognition,” Proc. 14th Int. Conf. Pattern Recognition, Brisbane, Australia, pp. 1150-1152, Aug. 1998. 170