LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE . DATE DUE DATE DUE JUN o 7 1999 ' \ y * . ‘ me Wipes-p.14 CURSIVE SCRIPT POSTAL ADDRESS RECOGNITION By Pmsun Sinha A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Computer Science 1997 Large I main was A scheme 13." dmami extraction based on cl All Ol'eniet at the 133 I Optical Chafélcter lg level TGCOgn a word-let}: is PIOpOSQd. are Obtainec rates on the rglowed. ABSTRACT CURSIVE SCRIPT POSTAL ADDRESS RECOGNITION By Prasun Sinha Large variations in writing styles and difficulty in segmenting cursive words are the main reasons for cursive script postal address recognition being a challenging task. A scheme for locating and recognizing words based on over-segmentatidn followed by dynamic programming is prOposed. This technique is being used for zip code extraction as well as city-state recognition in our system. Results have been reported based on cursive script images from the United States Postal Service(USPS) database. An overview of the cursive script postal address recognition system under development at the IBM Almaden Research Center, is presented. Optical Character Recognition(OCR) systems are usually trained to minimize character level errors, which does not necessarily guarantee the best results for word level recognition. A scheme for combining character-level optimized classifiers, using a word-level optimization function, to achieve higher word level recognition rates is proposed. The optimum values for the parameters of the combination function are obtained using the gradient descent method. Improvements in word recognition rates on the USPS database by using word-level optimized ensembles of classifiers are reported. TO My Family iii gadt Addr DL ‘ the in Slqa SQVa ACKNOWLEDGMENTS I would like to acknowledge all the people who have assisted me throughout my graduate studies at Michigan State University and my work on Cursive Script Postal Address Recognition at the IBM Almaden Research Center. I am extremely grateful to my advisors Dr. Jianchang Mao, Dr. Anil Jain and Dr. Weng, for their guidance and encouragement throughout my Master’s and for the invaluable time that they spent with me on the thesis work. Also, I would like to acknowledge Dr. K. M. Mohiuddin, Mr. S. Gopisetty, Dr. S. Madhvanath and Mr. T. Truong for their encouragement and support during my stay at IBM Almaden. I would like to dedicate this thesis to my family who have constantly encouraged me to achieve new heights. Without their love, patience, understanding, and support, I would not have finished this thesis. I would also like to acknowledge the help and support I have received from my friends at Michigan State University. iv LIST 0] LIST 0? 1 Intro 1.1 The 2 Syste 2.1 lntr 2.2 MO, 2.2.1 c 2.2.2 In N 3v CA) ’1;- L'bgb,.“__ E.“ CO [‘3 N’ .00 '00 to ‘ ' Iv i9 N '9 :9 to Co 0 (<2 A: go up _r\:- _- (000 (f) p...‘ H I—n *—‘ O DIS? to 0 no to pg) Ix: ‘ro h—l 5—. H ' ' CA) I) H TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES ix 1 Introduction 1 1.1 Thesis Outline ................................ 4 2 System Overview 5 2.1 Introduction .................................. 5 2.2 Modules .................................... 7 2.2.1 Camera Setup ................................ 9 2.2.2 Image Binarization ............................ 9 2.2.3 Address Block Location (ABL) ...................... 9 2.2.4 Skew Detection and Correction ...................... 10 2.2.5 Line Separation ............................... 10 2.2.6 Hand Written or Machine Printed ..................... 10 2.2.7 Connected Components Analysis ..................... 11 2.2.8 Slant Detection and Correction ...................... 11 2.2.9 Over-segmentation ............................. 11 2.2.10 Optical Character Recognition (OCR) .................. 12 2.2.11 Zip-code Extraction ............................ 13 2.2.12 City-State Recognition .................. . ......... 13 2.2.13 Street Level Recognition .......................... 14 2.2.14 Destination Address Encoding ....................... 14 2.2.15 Video Coding System ........................... 15 2.3 Contributions of the Thesis ......................... 15 3 Character Classifier 17 3.1 Introduction .................................. 17 3.2 Features .................................... 18 3.3 Character Classifiers ............................. 21 3.3.1 Cost Function ............................... 22 3.3.2 Activation Functions ........................... 23 3.3.3 Neural Network Layout .......................... 25 3.4 Character Level Classification Performance ................. 27 3.5 Summary ................................... 27 V 4 Over- 41 Intr 12 ()VI 43 Res 5 Zip C 51 lntr 92 Zip 521. (3 i3 ZR) 531 I) 5412 In 54 ZR) 341 ii 95 .Exp> 96 lixn 6 Cthn 61 lntr 62 (39] 63 lieh 6.4 R98: 65 SUI: 7 Com] Tl lntn T2 lVor 7.3 Gm, Irai I315} (:or ‘1 7.... fl fl ‘J C) U. at... vi 4 Over-segmentation 29 4.1 Introduction .................................. 29 4.2 Over-Segmentation Algorithm ....................... 31 4.3 Results ....... . .............................. 37 5 Zip Code Extraction 38 5.1 Introduction .................................. 38 5.2 Zip Location and Zip Recognition as Shortest Path Problems ....... 43 5.2.1 One Step vs. Two Step Model ....................... 46 5.3 Zip Code Location .............................. 47 5.3.1 Dynamic Programming Algorithm .................... 49 5.3.2 Implementational Details ......................... 50 5.4 Zip Code Recognition ............................ 52 5.4.1 HMM based approach for zip code recognition .............. 54 5.5 Experiments and Results ........................... 61 5.6 Extensionto9-digits..........................'... 63 6 City-State Recognition 65 6.1 Introduction .................................. 65 6.2 Generalized Dynamic Programming Scheme ................ 67 6.3 Rejection Mechanism ............................. 72 6.4 Results ..................................... 73 6.5 Summary ................................... 74 7 Combining Classifiers to Improve Word Level Recognition 76 7.1 Introduction .................................. 76 7.2 Word Level Cost Function .......................... 78 7.3 Gradient Descent Method .......................... 79 7.4 Training Setup ................................ 81 7.5 Different Combinations of Neural Networks ................. 82 7.6 Computation of the Partial Derivatives .................. 89 7.7 Summary ................................... 91 8 Summary, Conclusions and Ehture Research 93 8.1 Summary ................................... 93 8.2 Conclusions .................................. 94 8.3 Future Research ................................ 95 3.1 C] 4.] PD 5.1 CO 3.1 4.1' 5.1 5.2 5.3 5.5 LIST OF TABLES Character level performance: Results of character level recognition on a training and test set for a digit classifier, an alphabetic classifier and a generalized classifier is presented in this table for two difierent error functions. ................................. Frequency of Graphemes per Character: This table shows the aver- age number of graphemes found by our over-segmentation algorithm for each of the lower case alphabetical characters. These figures were com- puted using 1000 cursive postal address images taken from the USPS database ................................... Confusion Table: The column labels stand for the results of the numeric classifier and the three rows stand for the three most confusing digits for those results. For example, the column with label “3” shows that when the numeric classifier returns “3”, then besides 3, the true digit could be “5”, “7” or “2” with probabilities decreasing respectively. Transition Probability Table: This table Shows the transition prob- ability from the first to the second most significant digit in a 5 digit zip code (@142, where i = {1..10} and k = {1..10}). An entry in row with header i and column with header j represents the probability of the digit pair (i, j) occuring at the beginning of the 5-digit zip code. . Probability of a digit in the i-th place in a 5-digit zip code: This table shows the probability of a digit to be in the i-th place in the 5- digit zip code. The columns correspond to the 10 digits and the rows correspond to the 5 positions in the 5-digit zip code. The first row corresponds to the initial probabilities 7r,-J-. For example, the probability of the Ist digit in a 5-digit zip code being “8” is 6% ........... Zip Recognition: Out of a database of 562 addresses, 548 were accepted, out of which the zip was correctly located in 492 addresses (460 from best path and 32 from the second best path). This table represents the coverage of recognizing the correct zip code for the correctly located zips (492). ................................... Zip Extraction: Out of a database of 562 addresses, 548 were accepted. This table represents the coverage of extracting the correct zip code. vii 28 37 54 56 57 63 6.1 Cc 7.1 Th viii 6.1 Coverage for City-State-Zip recognition: The system was tested on 805 cursive script images taken from the USPS database. 632 ( 78. 51 7o ) of those addresses were accepted. The numbers shown in the table are rates over the accepted addresses. .................... 7.1 The Two Individual Classifiers: The two neural networks used for performing experiments on word level training are summarized here. The training data consists of 5305 words and was used for training the individual networks and also for estimating the parameters of the combination function. The test data consists of 441 7 words. Both these sets of words were obtained from cursive script images from the USPS database. ................................. 7.2 Results of word level training: The training data refers to 5305 words from the USPS database. The test data refers to 4417 words also from the USPS database. The training data was used to find the values of the parameters of the combination functions ............... 74 82 92 2.1 2.2 3.1 3.2 3.3 3.4 4.1 4.2 4.3 5.1 5.2 LIST OF FIGURES The training phase of the system The System for training the Neural Network classifiers is shown here. The starting and transition proba- bilities of characters and digits are also estimated in the training phase. System overview The input to the system is the gray scale image of the mail—piece and the output is the destination address encoding ..... Features based on contour directions ................. An example of segments 3-1, so and s1 for the 20 cutting point features: ................................. A l-hidden layer feed-forward neural network ........... 7 20 22 The stande sigmoid function f (x) = (3:313:53: This plot is for [3 = 1 24 Upper and lower contours: a is the leftmost point on the contour and b is the rightmost point on the contour. The part of the contour from the point a to the point b while traversing the contour in the clockwise direction is called the upper-contour and the remaining part is called the lower-contour. In the figure the upper-contour is denoted by the smooth part of the contour and the dotted part refers to the lower-contour. . . Features on the upper-contour: The three main features on the upper contour used to detect cuts are shown in this figure. The relevant cuts are denoted by an arrow in the upper contour and a corresponding arrow in the lower contour. (a and b) Smooth valleys. (c and d) Sharp valleys. (e and f) Horizontal stretches. ................. Some examples of over-segmentation using the technique dis- cussed in this chapter: The figures from (a) to (e) are the original words and the corresponding results after over-segmentation are shown from ( f ) to (j). These words were taken from the USPS database. Some examples of addresses from real mail-stream (a) The zip digits are not isolated from the state name. (b) Zip digits are broken due to poor scanning or due to badly written digits. (c) The zip digits are touching. ........... - .................... The last line of an address: This is the last line of the address shown in Figure 5.1(c). We will work with this example to illustrate the concepts in this chapter. .............................. ix 32 34 36 4O 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.1 6.2 X The result of over-segmentation: The graphemes obtained after over- segmentation of the address line shown in Figure 5.2 are shown sepa- rated by vertical lines. .......................... Graphemes selected for zip extraction: It is found that the 5-digit zip code, if present on an address line can usually be located in the last ten graphemes of that address line. Hence, the last 10 graphemes from Figure 5.3 have been selected. ...................... One step model : The 11 vertices correspond to the 10 graphemes from Figure 5.4. For clarity, all the edges have not been shown. The best path is shown by bold lines and all the edges from the leftmost vertex are shown by dashed lines. The best path locates as well as recognizes the 5-digit zip code. The first three graphemes are deleted and are marked by d ’s. ................................... Two step model : The 11 vertices correspond to the 10 graphemes fmm Figure 5.4. For clarity, all the edges have not been shown. The best path is shown by bold lines and all the edges from the leftmost vertex are shown by dashed lines. The best path locates the 5-digit zip code. The first three graphemes are deleted and are marked by d’s and the 42 42 45 five graphemes and segments in the best path are marked #1 through #5 45 Edges in the dynamic programming table: The four types of edges are numbered 1 through 4. (1) It denotes that a grapheme corre- sponds to a digit. (2) It denotes that two adjacent graphemes are being combined to match with a digit. (3) It denotes that three adjacent graphemes are being combined to match with a digit. (4) It represents that a grapheme is being deleted ...................... Zip code location: The table for the graphemes selected in Figure 5.4 is shown in this figure. As the slack in this case is 10 — 5 = 5, the table has 6 (= 1 + Slack) rows, and 6 (= 1 + #zip digits) columns for the 5 zip digits. The best path through the table from the cell (1,1) to the cell (6, 6) is shown. The path shows that the first three graphemes are deleted (vertical strokes), the 4th and 5th are combined to form the first digit, the 6th and 7th are combined to form the second digit and the 8th, 9th and 10th represent the remaining three digits of the 5-digit zip code. .................................. The Hidden Markov Model A: The nodes in this figure correspond to the states of the HMM and the arrows correspond to the non-zero transition probabilities. The rows of states correspond to digits from 0 to .9, and the columns correspond to the 5 observations obtained using the dynamic programming technique. .................. Example of a city-state—zip image (a) The line containing the city name ( Coral Spgs), state name (FL) and zip code (33071). (b) Results of segmentation. ............................. The dynamic programming table: It shows the best path for the image shown in Figure 6.1 ............................ 48 51 55 68 68 xi 6.3 Utilizing common prefixes to save computation in the dynamic programming scheme: Words w and 1112 have the common prefix p. The slacks corresponding to the two words are L1 and L2 respectively. The first row and the column are marked by dotted lines. The shaded portion of the table represents the part of the table whose computations are common for both the words and hence can be reused. ....... 7.1 The globally trained OCR : The results of the two classifiers are com- bined to produce the final OCR results. The parameters of the combi- nation function will be optimized using a word level criteria. ..... 7.2 The combination function: The function f combines the output of the two neural networks ( classifiers in general) and produces the output vector y. The parameters of the function 0i’s are obtained by word level training using the gradient descent method. ............ 72 78 83 Chapter 1 Introduction Postal address recognition deals with extraction of the delivery point encoding, e.g., 11-digit zip code, from scanned images of postal addresses [1]. These images have a wide variations in terms of style of writing, whether it is machine-printed, hand- printed or cursive, the kind of instrument used for writing, the size of the characters, etc. Some of the addresses are misspelled and some images are of very poor quality. These are some of the reasons why postal address recognition is a challenging task. Once the delivery point encoding has been extracted, the letter can be routed without manual intervention. This also implies that an incorrect extraction of des- tination address encoding may result in a letter being delivered to a wrong address. Hence, minimization of the error rate is very critical and minimization of the reject rate is desirable. For this we need to recognize the street address, the city name, the state name and any zip code, if present. Since there are only a fixed number of values for each of these fields, lexicons of city names, state names, street names and zip codes are used for recognition. This del an automat whole mail research pr< research cer fiat“ mails i is the least be the part POSIal mech 5.1316111 for a This the: The System ”Coiling. lt PfOblem is I in hand‘m‘tt r:hariiCttErs ar 2 This delivery point encoding is used by sorting machines at the post Offices for an automatic sorting of mails. It has been suggested that the consideration of the whole mail chain (the components of the automated system) can lead to a coherent research program. A survey of research projects conducted at the French post office research center (SRTP) of printed and handwritten addresses for small envelopes and flat1 mails is given in [2]. It has been noted that the aspect of mail processing which is the least developed is sorting the mail in delivery order. This is also supposed to be the part which consumes the most manpower [3]. A summary of the history of postal mechanization in Japan is provided in [4]. Srihari [5] has described a complete system for assigning 9—digit zip codes to US mail-pieces. This thesis focuses on recognition of US postal addresses written in cursive script. The system is currently being designed to extract the 9-digit destination address encoding. It can later be extended to arrive at the 11—digit zip code, if required. The problem is related to Ofir Line Cursive Script Word Recognition. Large variations in handwritten words as well as overlaps and interconnections between neighboring characters are the major hurdles in solving this problem. Following are some prevalent techniques for off-line cursive handwriting recognition:- 0 HMM2 on unsegmented words [6] : One of the common approaches is the sliding window approach where features are extracted from within the window and finally an HMM is used to hypothesize the word. This is not a practical approach for large lexicons. lmagazines, big enveIOpes etc. 2Hidden Markov Model o 0V6 taint hari [8516 also 0 Ove isbe Inert extra 3 o Over-segmentation followed by HMM: In this scheme the segments ob- tained after segmentation are combined using an HMM. Chen, Kundu and Sri- hari [7] have developed a variable duration HMM technique which has been tested on unconstrained handwritten words. Some other research groups have also prOposed similar approaches [8, 9]. o Over-segmentation followed by Dynamic Programming: This scheme is being used by the CEDAR group at SUNY Buffalo [10] and also has been used by Gader et. a1. [11]. The performance depends on the results of over- segmentation. The dynamic programming technique tries to combine the over- segmented characters and then matches those characters against the words in the available lexicon. As opposed to the Over-segmentation followed by HMM method, over-segments are physically combined in this scheme, and features are extracted by the underlying OCR engine. The postal address recognition system being deveIOped at the IBM Almaden Re- search Center is based on the Over-segmentation followed by Dynamic Programming approach. This thesis deals with several components of this system and it is related to the problem of recognition of cursive script addresses. The design and implementation of the over-segmentation algorithm, and the dy- namic programming technique for locating and recognizing a word was done as a part of this thesis. The dynamic programming technique was used successfully for zip code recognition and for city and state names recognition. Street number, street name and PO. box number recognition is currently under testing, but they are all based or We discc guarante Street n2 neural DE tried. Tb 4 based on the same technique. At the heart of this technique is the OCR engine. We discovered that optimization of the character level recognition for OCR may not guarantee higher word level recognition rates, which is our ultimate goal (City name, Street name, etc. can be all viewed as words). Hence, several ways of combining neural networks to optimize word level criteria based on word level recognition were tried. These experiments resulted in several word level optimized classifiers. 1.1 Thesis Outline The outline of the thesis is as follows. Chapter 2 presents an overview of the postal address recognition system. The features used for character recognition and the neu- ral networks used as classifiers are discussed in Chapter 3. The over-segmentation algorithm is described in Chapter 4. Chapter 5 presents the technique used for rec- ognizing zip codes in US postal addresses. A similar technique for recognizing city names and state names is presented in Chapter 6. A new technique for combining the results Of individual classifiers based on word level recognition is described in Chapter 7. Chapter 8 presents the summary, conclusions and some ideas for future research. (Ihaj Syst 21 I 0‘“ POSta and it is 5 Step is to the Addre The in Some pre_] hEht back, qUalit}; Cit binarized . Separated 1 identify the 30311129,, Chapter 2 System Overview 2.1 Introduction Our postal address recognition system has images of mail-pieces coming in as inputs and it is supposed to produce the destination address encoding as output. The first step is to locate the destination address block (DAB) on the image, which is done by the Address Block Location (ABL) module. The image obtained from the camera is a gray level image and it is passed through some pre-processing stages. Since the address is usually written as dark pixels on a light background, it is believed that a binary image, provided the image is of good quality, does not result in a Significant loss of information. Hence, the image is binarized which is followed by skew correction. The lines in the image are then separated using a line separation algorithm. The address block is then analyzed to identify the address as machine printed or hand written. The machine printed text is recognized using a different technique, which is outside the scope of this thesis. This — O is follor compor segmer elimin the co City-st for In; 6 is followed by connected components analysis and Slant correction of the connected components. The large connected components are over-segmented using the over- segmentation algorithm, discussed later in the thesis. The next module extracts the 5 or 9—digit zip code from the address, if present. Currently, the module rejects the mail piece if the zip code is not extracted and sends it to the Video Coding System (VCS), which is discussed later. The ordered list of zip codes produced by this module is then compared against a lexicon of valid zip codes. This process typically eliminates more than half of the extracted zip codes. The city and state names for the corresponding zip codes iS then enlisted and is matched to the address. Both the city-state recognition and the zip recognition use a dynamic programming technique for matching lexicon entries with a given sequence of over-segments. The city-state recognition module reorders the list of city-state pairs. The first two entries from the reordered list are used to generate all possible street names for those cities. Currently, the same dynamic matching scheme is under test for street level recognition. The matching scheme uses a classifier (OCR module), which in our case is a 1-hidden layer neural network. Better classifiers based on word level training have also been designed (Chapter 7). The recognition modules use various lexicons such as zip code dictionary, city-state dictionary, etc. All the rejects from these modules are sent to the VCS where the destination address encoding is entered manually. For every word, we define the probability of a character to be at the beginning of a word as the starting probability of that character, and the transition probability from a character CI to a character 02 is defined as the probability of occurance of c2 following city-state depender the posit In the and com; segmente to form c normalize are then The t mulling l 1.: F151”?- 2.1 4 7 following CI in a word. These probabilities are computed separately for street names, city-states and zip-codes. Also the transition probabilities for zip codes are position dependent, which means that the transition probability from c1 to c2 also depends on the position of the character CI in the word. In the training phase of the system, the system learns the neural network weights and computes the starting probabilities and the transition probabilities. The over- segmented characters are manually truthed. These over-segments are then combined to form characters which are then normalized in size. Features are extracted from the normalized bitmaps and the neural networks (digit-only, alphabet-only and general) are then trained using the back propagation algorithm. The training phase of the system is Shown in Figure 2.1 and the testing or the running phase of the system is depicted in Figure 2.2. Figure 2.1: The training phase of the system The System for training the Neural Network classifiers is shown here. The starting and transition probabilities of char- acters and digits are also estimated in the training phase. 2.2 Modules This section describes the modules of the system in some more detail. I‘mfllpiECE I a. .33 l - - - - .8280 3.98:3 l cos-3.5..— 8222 3.8.5: mm mm m m a“. U... 3...... _ _ _ «328:6 _ 5:85:56 33. was 1‘25 (3 Figure 2.2: System overview The input to the system is the gray scale image of the mail-piece and the output is the destination address encoding .5 2.3m 8.5.2.2.: 825.23. 4 Tait—om 2....— _ £8882..— SE...O _ 85.5 e...— fi .22: a 23......8... 9.525 d u L Hagzseml 8......me b zoo pm a 83322.. 95.5... 2.2.1 A camera attached 1 code on t bar-code i address re 2.2.2 ] A binary i adaptive t. threshold 1 neighborin 2.2.1 Camera Setup A camera is used for taking an image of the mail-piece. A bar-code scanner is also attached which tries to detect if the destination address is already encoded as a bar- code on the mail-piece (which is usually the case for non-personal mails). If the bar-code is not present on the mail-piece then only the image is sent to the postal address recognition system. 2.2.2 Image Binarization A binary image has sufficient information for extracting the destination address. An adaptive thresholding [12] technique is used for binarization. In this technique, the threshold used for binarization at each pixel is a function of the grey levels of the neighboring pixels. 2.2.3 Address Block Location (ABL) Given an image of a mail-piece, this module (locates the destination address. Though the source address is supposed to be written on the upper left corner of the mail- piece, many mail-pieces do not follow this convention strictly and hence, it may be confused with the destination address. Presence of stamps, post-marks, printed logos, advertisements and other information on the mail-piece make ABL a difficult task. The technique for ABL proposed by Yu, Jain and Mohiuddin [13] is being used in our system. 2.2.4 Skew is correcte one shif 2.2.5 It is als into lin denote line. 10 2.2.4 Skew Detection and Correction Skew is detected in an address block using projections in several directions. It is then corrected using the classical technique of two shifts along the horizontal direction and one shift along the vertical direction. 2.2.5 Line Separation It is also done using horizontal projection of the image. The separation of the image into lines is done by a labeling algorithm which labels the pixels where the labels denote the line numbers. All the pixels having the same label are put in the same line. 2.2.6 Hand Written or Machine Printed There are several differences between hand written and machine printed text. In machine printed text, the baselines are usually straight, the uppercase characters are of uniform height and the lowercase characters are also of uniform height. Also, the spacing between characters or connected components is fairly uniform in machine printed text as opposed to hand written text. Because of these differences in the two kinds of texts, different techniques are used for hand written and machine printed address recognition. Hence, we need to distinguish between the two types of text. Our system employs a simple discriminator based on the distribution of the heights and widths of connected components [14]. 227 C The conne pixels in t] 2328 f The slant taken so I helps in f step is la The: direction Cetrtenm ytanG. ; (lOne efli( 11 2.2.7 Connected Components Analysis The connected component analysis is done by a two-pass labeling algorithm. The pixels in the same connected component are assigned the same label. 2.2.8 Slant Detection and Correction The slant is computed by projecting black pixels along several directions. Care is taken so that only straight segments contribute to the computation of the slant. This helps in forming a sharp peak in the direction of the true slant. This slant detection step is fast as it needs to make only a single pass of the black runs. The slant correction is performed by shifting the rows of the image in the horizontal direction. Let the detected slant angle be 0" and the distance of the row from the center-row of that line be y, then the shift required by that row is given by x = ytan 0. Since, the image is stored in the form of horizontal black runs, shifting is done efficiently by changing the starting and ending points of the black runs. 2.2.9 Over-segmentation As discussed before, machine printed text typically has several uniform features such as uniform spacing between characters, uniform size for lowercase characters and uniform size for uppercase characters. Since these uniformities are not present in hand written text, it is difficult to expect a perfect segmentation algorithm. Hence, in our system we use an over-segmentation algorithm, followed by a dynamic programming technique for combining the over-segments to form characters. 12 The goal in this step is to recognize all the segmentation points at the cost of some false alarms. The lesser the number of false alarms, the better the over-segmentation scheme. Our over-segmentation scheme is based on the contour features of the connected components. While following the contour in the clockwise direction from the leftmost to the rightmost pixels, smooth valleys, sharp valleys and horizontal stretches are identified in a Single traversal. The lowest point in the valley and the middle of the horizontal stretches are used as segmentation points. The corresponding points on the lower contour are subsequently found and then screening is done to drop some cuts1 which can not possibly be separating two different characters. This process is explained in detail in the next chapter. 2.2.10 Optical Character Recognition (OCR) Our system uses l-hidden layer neural networks as character classifiers. If a given word is known to contain only digits (e.g, zip code) or only characters then the digit-only or character-only classifier is used for better recognition. All these neural networks are 1-hidden layer neural networks and have been trained using the back propagation algorithm. We have also developed and tested combinations of more than one neural networks which Optimize the word level classification error. This work is presented in Chapter 7. 1a point on the upper contour and a point on the lower contour together represent a cut. 2.2.11 A zip code lexicon-ass used. But difficult tc by produc A dm based tecl try to lOC; the bottor is used to COdES is I CitI"-State no Zip is : later. 2.2.12 13 2.2.11 Zip-code Extraction A zip code (5 or 9 digits) has the city or state encoded in it. We know that a lexicon-assisted recognition typically has a better performance than if a lexicon is not used. But the number of city-state pairs is so large (roughly 40,000) that it becomes difficult to directly take advantage of it. The zip extraction stage simplifies this task by producing a list of possible zip codes using a lexicon for valid zips. A dynamic programming implementation of a Hidden Markov Model (HMM) based technique is used for zip location and for identifying the digit boundaries. We try to locate a 5-digit zip code or a 9-digit zip code in the address lines starting from the bottom most line and going upwards. If a zip code is found then a digit recognizer is used to enlist the top choices for the zip code (5-digits or 9-digits). This list of zip codes is then screened using the lexicon of valid zip codes and a lexicon of possible city-state names is generated dynamically. Currently, the system rejects the mail if no zip is recognized and it is handled by the Video Coding System (VCS), discussed later. 2.2.12 City-State Recognition The entries in the dynamically generated city-state lexicon are matched against a set of over-segments which is hypothesized to contain the city and state names. The matching process is very similar to the one used for locating and recognizing the zip code. AS a result of this dynamic matching process the lexicon of city-state names is re-ordered based on the match values. 2.2.13 1 We have pr: all the possi technique a The strt nized. lf 9' the street 1: the middle 15 typically {150 so; be taken or 14 2.2.13 Street Level Recognition We have proposed for using the top two city names from the previous step to generate 'all the possible street names. Then these street names can be matched using the same technique as was used for city-state recognition. The street number can be recognized in the same way as the city-state is recog- nized. If we have a database of valid street numbers for every street then it will aid the street level recognition. But, since the street number and name is present towards the middle of the address, it is difficult to locate it compared to the zip code, which is typically found at the end of the address. Also some addresses have post box numbers but no street numbers which have to be taken care of. 2.2.14 Destination Address Encoding The 5-digit zip code encodes the city, state, postal sectional center facility (SCF) and the post office. The four-digit add-on number identifies a geographic segment within the five-digit delivery area such as a city block, an office building, an individual high- volume receiver of mail, or any other unit that would aid efficient mail sorting and delivery. This 9—digit zip code is referred to as the Destination Address Encoding, since it encodes the destination address. USPS has added two more digits to the 9- digit zip code mainly for purposes of identifying address upto the apartment number, but our system is presently being designed for generating upto 9-digit zip code only. 2.2.15 V This sysrern image of the encoding is 1 street level le supports set-e it easier for t 2-3 Co This thesis h. titular, the fc thesis. 15 2.2.15 Video Coding System This system handles all the mail-pieces rejected by any of the earlier modules. The image of the mail-piece is displayed on the terminal and the destination address encoding is done manually. The operator has access to city, state, zip code and street level lexicons, which aid in a manual encoding of the address. Also, the system supports several image operations such as rotation, zooming and binarization to make it easier for the operator to read the address. 2.3 Contributions of the Thesis This thesis has made contributions to several major modules of the system. In par- ticular, the following modules have resulted from the work done in the course of this thesis. 0 Over-Segmentation: The algorithm was designed and implemented as a part of the thesis. 0 Dynamic Programming and HMM: This scheme was developed for matching a given lexicon of words with a given word image. A generalized version of it has been implemented and is being used for zip code, city name, state name, street name, street number and PO. box recognition. 0 Zip Code Extraction: When a 5-digit zip code is requested, the above technique returns several possible 5-digit zip codes. Similarly, for a 9-digit zip code, the system returns several possible choices. A lexicon of valid zip codes and the mate. given 0 City- used the ' dyn. 0 “'0 OC 16 matching confidences are used to arrive at a list of possible zip codes for the given address. 0 City-State Recognition: The valid zip codes extracted in the previous step are used to generate a list of possible city and state names which are matched with the hypothesized subsequence containing the city-state names, using the same dynamic programming technique. 0 Word Level Training of the Classifiers: This is implemented as a part of the OCR module to maximize word-level recognition. Chapter 3 Character Classifier 3.1 Introduction OCR has been a very active field of research since the mid 1950’s. The two main subtasks in designing an OCR are (i) to select a feature set which has a good dis- criminationpower, and (ii) a mechanism to use the feature set to arrive at the final classification of the input pattern. Factors like size and composition of the training data, the initial values of the parameters used for learning and the error estimation criteria also effect the design of a classifier. An excellent survey of feature extraction methods is provided in [15]. Since the character classifiers can be designed in so many different ways, there doesn’t seem to be any clear winner. Researchers have suggested using combinations of classifiers to utilize the discriminant power of individual clas- sifiers. Mao and Mohiuddin [16] have suggested using multiple classifiers for better OCR results. A study by Madhvanath and Govindaraju [17] suggests that for large lexicons, using a serial classifier is a good option. 17 Thedyna city and sta‘ Emmsl ontheperh smmhd 18 The dynamic programming technique used for extracting zip codes or recognizing city and state names uses an OCR system to generate a posteriori probabilities for segments. The overall performance of the matching scheme, therefore, depends a lot on the performance of the OCR system. In this chapter the OCR method used in our system is discussed. The issue of combining classifiers is discussed in Chapter 7. 3.2 Features From the input camera, we obtain a gray scale image, but at a highlresolution and for good quality images, a binary image does not have Significantly less information than the gray scale image. We need to extract features for a given grapheme or segment (a combination of graphemes). These graphemes are obtained by the over-segmentation scheme discussed in Chapter 4. Various kinds of features such as those based on projection, template matching, image transform, moments, etc. have been tried for cursive script graphemes and segments. We have found that the following feature set consisting of 108 features is a good feature set for the 1-hidden layer neural net classifiers used as character classifiers in our system. These 108 features can be categorized into the following two types. 0 88 Contour Direction Features [18]: These features are extracted by dividing the normalized 24 x 16 image (input normalization reduces intra-character vari- ability, thereby simplifying character recognition) into diagonal and rectangular zones and computing histograms of boundary directions in these zones as Shown 19 in Figure 3.1. The vertical projection has four slices and the others have six slices (22 slices in total). The image is scanned by 2x2 masks which have be- tween one and three black pixels to detect primitive features in each of the zones. The primitive features are grouped into four categories based on the contour directions (horizontal, vertical and two diagonal directions). Histograms are computed by summing the number of primitive features detected along the four contour directions for each of the 22 zones. This results in a total of 88 features (22 slices x 4 directions). These features are then normalized by dividing the 88 features by the maximum possible values of those features. FEE!!- EEHE / ........- / \ ”Fm-4WXZZSfim / V i .' ‘ Figure 3.1: Features based on contour directions o 20 Cutting Point Features: The shape of a character is typically influenced by the preceding and the following graphemes. These 20 features which were suggested by Mao [14] add some contextual information. Let so be the segment for which features are being computed, 3-1 be the pre- ceding : ht joini The fol figure 3.2: Let cl OVEI~5 COmm POints of 30 z are co 8 feat] 20 ceding and 31 be the following graphemes. Let (3,, 3,) be the segment obtained by joining s,- and sj. The following 4 features capture the relative Size information :- width(so) width(so) height(so) height(so) width(so,sl)’ width(s_1,so)’ height(so,sl)’ height(s_1,so) Figure 3.2: An example of segments 3-1, so and 31 for the 20 cutting point features: Let c1, c2, c3 and c4 be the cut points, if so was connected on both sides before over-segmentation. c1 and c2 are the points on the upper contour and the lower contour respectively at the boundary of s_1 and so. Similarly, c3 and c4 are the points on the upper contour and the lower contour respectively at the boundary of so and 31. Let c,- represent either c1 or c2. For each c,-, the following features are computed (if so was not connected to the left Side before segmentation, these 8 features are all set to 0) :- and where the origin is the upper left corner of _ ___£i£__ _£i_-ll__ height(s-1 ,so) width(s— 1 .50) ’ (3-1: 30)‘ — W and W, where the origin is in the upper left corner of (3-1). Simila' compt‘ 8 fear 3.3 C hlultilaye; 0f applies with diiie nets-Ort-é network. ‘1 Class‘iE mpUt tc 21 Similarly, let cj represent either c3 or c4. For each c], the following features are computed (if so was not connected to the right side before segmentation, these 8 features are all set to 0) :- - Ma and m, where the origin is the upper left corner of (30a 31)- - $5527) and mitt—(T), where the origin is in the upper left corner of (31). 3.3 Character Classifiers Multilayer feed-forward networks have been used for classification in a large variety of applications and also have been used extensively for OCR. Several architectures with different number of hidden layers and interconnections are possible for neural networks. The feature set used as the input determines the performance of the neural network. One Simple way is to use the neural network both as a feature extractor and a classifier, by giving a low resolution (scaled down) binary image of the character as input to the neural network. A 2-layer standard feed-forward network with an arbitrarily large number of hid- den units has been shown to be capable of forming an arbitrary decision bound- ary and approximate any continuous non-linear function. It automatically builds an internal representation of patterns although it is often dificult to interpret these representations. The neural network classifiers used in our system are standard feed- forward neural networks with one hidden layer (see Figure 3.3) trained using the back-propagation algorithm [19, 20]. In our 51 Simple Stru section, 22 (I) (2) ’4 "at "o' ’1 ——>0 :0 :O———> ___.. O A...) O -.‘l O _ \V/ \Y/ Q Q” Q”. .\ /.\ ,.,o ,o_. ‘0 -O—- Input Hidden Output Layer Layer Layer .____.,0 ____..0 Figure 3.3: A l-hidden layer feed-forward neural network In our system we are using 1-hidden layer neural networks for OCR, mainly for its simple structure and fast trainability. It uses the feature set discussed in the previous section. 3.3.1 Cost Function Let {(x"),d(1)), (x(2),d(2)), - --,(x(P),d("))} be a set of p training patterns (input- output pairs), where x“) E R" is the input vector in the n-dimensional pattern space, and d“) E [0, 1]”, an m-dimensional hypercube. For classification purposes, m is the number of classes. In our case, p is the number of characters used for training. Let y“) be the output vector of the neural network corresponding to the input vector x“). These notations have been taken from [21]. The squared error cost function (13(1)) most frequently used in the Artificial Neural Network literature is defined as :- Eé” = net—dew” (3.1) The Kull The Ku be inter-pro. to the com WWI prc when the I “lien , function a, we are usir 23 (1) 1 p (1) i=1 The Kullback-Leibler distance E (2) is given by :- m _ dlfi) E?) = ng’nogfi (3.3) '=l yj p Em - $23le (3.4) i=1 The Kullback—Leibler distance assumes that the output of the neural network can be interpreted as an a posteriori probability for the input normalized image to belong to the corresponding class. Since the softmax function (defined later) normalizes the output probabilities to 1, it is used as the activation function at the output layer when the Kullback-Leibler function is used as the error cost function. When the Squared Error Criteria is used as the error cost function, the activation function at the output layer in our system is the sigmoid function. In all our networks, we are using the sigmoid function as the activation function at the hidden layer units. 3.3.2 Activation Functions Let us consider a network with no hidden layers where {2:1, 2:2, - - . ,xn} denote the n units of the input layer and {y1, y2, - - - , ym} denote the m units of the output layer. Let wij be the weight of the edge connecting x,- to y,-. The hidden layer units and "the output layer units of our neural networks use the following kinds of activation 7 functions :- 0 Sigma; (3.6). lthsus Since ‘ networ bm th. the So: at the L‘: . .gnre 34 P 24 functions :- 0 Sigmoid Function: The standard sigmoid function is given by f (x) in Equation (3.6). n f (2 wijx, + too), where wo is a constant (3.5) i=1 1 f (x) = (1 + (443)) ,where 6 is the slope parameter (3.6) S H It is used as the activation function in the hidden layers in our neural networks. Since the sigmoid function lies between 0 and 1, the outputs of the neural network units using the Sigmoid function can be interpreted as a probability, but the problem is that all the outputs don’t sum to 1. This is taken care of by the Softmax function. In our networks we are using the Sigmoid function also at the output units when the Squared Error criteria is used. lllllllll 999999999 O-‘NOD‘hUIGVQCD-i I .5 C .5 0 Figure 3.4: The standard sigmoid function f (x) = m: This plot is for 5 = 1 o Softmax Function: The standard softmax function is given by g(x) in Equation (3.9). Fror The Kulll nerve. Leibl 25 n h,- = Z wijx, + woj, where woj is a constant (3.7) i=1 93' = glhj) (3'8) h chi 3 9 9( j) m ( ) From the above equations we can derive the following properties:- :99.) = 1 (3.10) 0