”mu .. ,— 2-4.. - 41.»: 3—, _ .- 23.... . ‘: 2;: , 4:“; ‘.’ 'O 331-" u» - up». d .. "3‘1 u .0 . dlv w '3‘. ;.. ...- J. u—o‘e‘ 513‘ ' $5.57 hfl (-J‘ ‘r ' § .' .: --~« 7 M ’33: . -,-- 4,»: a V fiégh‘g ‘ 1 o a . rum. ’ a"; b 5’4“ “ “21%?2é' :7 a r . fifty .1 u u . ‘ Rf if; JEN»; . THESIS 2 lllllllllllllllWilliwill 3 01688 4664 This is to certify that the thesis entitled XANTHOMONAS PATHOVARS IDENTIFICATION THROUGH A NEURAL NETWORK-BASED GENOMIC FINGERPRINT CLASSIFICATION SYSTEM presented by Fei Ni Tuang has been accepted towards fulfillment of the requirements for M.S degree in Biosystems Engineering Major professd Date W Zé/ /476> 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State Unlverslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE MAR 2 3. 20W 1/” WWW“ XANTI-IOMONAS PATHOVARS IDENTIFICATION THROUGH A NEURAL NETWORK-BASED GENOMIC FINGERPRINT CLASSIFICATION SYSTEM By Fei Ni Tuang ATHESIS Submitted to Michigan State University In partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Agricultural Engineering 1998 ABSTRACT XANT H OMONAS PATHOVARS IDENTIFICATION THROUGH A NEURAL NETWORK-BASED GENOMIC FINGERPRINT CLASSIFICATION SYSTEM By Pei Ni Tuang A genomic fingerprint classification system was developed to identify 63 Xanthomonas pathovars. Three sets of genomic fingerprints generated from repetitive DNA sequence-based polymerase chain reaction (rep-PCR), using BOX, ERIC and REP primers, were used in this research. In addition, a fourth set of BER fingerprints was formed by linearly combining the BOX, ERIC and REP fingerprints. Mean and wavelet filter techniques were used to reduce noise on the fingerprints. Several backpropagation neural network (BPN) classifiers were trained using the BOX, ERIC, REP and BER original fingerprints and filtered fingerprints. Both mean and wavelet filtering helped improve the recognition rates. Wavelet filtering was better at reducing misclassification error rates, and mean filtering was better at reducing false rejection error rates. The average top-2 recognition rates of BOX, ERIC, REP and BER BPN classifiers were 95%, ‘93%, 92% and 98%, respectively. By combining the results of the BOX, ERIC and REP BPN classifiers with the lowest misclassification error rates, a top-1 recognition rate of 95% was achieved together with a misclassification error rate of 0.57% and a false rejection rate of 4.3%. Copyright by FBI NI TUAN G 1998 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my committee members: Dr. Evangelyn Alocilja (Biosystems Engineering), Dr. John Gerrish (Biosystems Engineering), and Dr. John Weng (Computer Science) for their insight, guidance and support in the course of this study. My sincere thanks to Jan Rademaker and Dr. Frans de Bruijn of MSU DOE-Plant Research Laboratory for providing the genomic fingerprint data and giving me the opportunity to learn the rep—PCR genomic fingerprinting technique in the laboratory. I am also appreciative for the funding provided by Dr. Robert von Bemuth, through the Manure Management and Nutrient Balance for Dairy project, and the fellowship provided by the College of Agriculture and Natural Resources. I would like to thank the faculty, staff and students of the Department of Agricultural Engineering for providing a challenging learning environment. Special thanks to my friends at MSU: Thomas Moen, Tse-chia Yu, Erica Yang, Ismail Kavdir, Pankaj Jagtap, Jim Schaper and Takako Inagaki for their support and encouragement. Finally, a very special thank you to my parents and my entire family for their never-ending guidance, support, encouragement and love. iv TABLE OF CONTENTS LIST OF TABLES .......................................................................................................... vii LIST OF FIGURES ........................................................................................................ viii LIST OF ABBREVIATIONS .......................................................................................... xi INTRODUCTION ............................................................................................................. 1 Objectives .............................................................................................................. 3 CHAPTER I LITERATURE REVIEW .................................................................................................. 4 1.1 Bacterial Taxonomy and Classification System .............................................. 4 1.2 The genus Xanthomonas ................................................................................. 7 1.3 rep-PCR Genomic Fingerprinting ................................................................... 8 1.4 Digital Image Processing ............................................................................... 10 1.5 Backpropagation Neural Network ................................................................. 14 1.6 Performance Evaluation ................................................................................ 18 1.7 Systems Analysis ........................................................................................... 19 CHAPTER H MATERIALS AND METHODS ..................................................................................... 22 2.1 Data Collection .............................................................................................. 22 2.2 Systems Analysis ........................................................................................... 24 2.2.1 Data Modeling ....................................................................................... 25 2.2.2 Process Modeling ................................................................................. 29 2.3 Data Processing .......................................................................................... 31 2.3.1 Data Input ......................................................................................... 31 2.3.2 Image Filtering .................................................................................. 32 2.4 BPN Classifier Design ................................................................................ 33 2.4.1 Data Partitioning ................................................................................ 33 2.4.2 BPN Training .................................................................................... 35 2.5 BPN Classifier Testing ............................................................................... 37 2.6 Performance Evaluation ............................................................................. 38 2.7 Graphical User Interface ............................................................................. 39 CHAPTER III RESULTS AND DISCUSSION ................................................................................... 41 3.1 Image Filtering ........................................................................................... 41 3.2 BPN Training ............................................................................................. 46 3.3 BPN Classifier Testing ............................................................................... 48 3.4 Perfonmnce Evaluation ............................................................................. 50 3.5 User Scenario ............................................................................................. 59 CHAPTER IV SUMMARY AND CONCLUSIONS ........................................................................... 68 CHAPTER V RECOMMENDATIONS ............................................................................................. 7O APPENDICES ............................................................................................................ 71 Appendix A. rep-PCR Genomic Fingerprint Data ............................................ 72 Appendix B. Database and Data Files .............................................................. 86 Appendix C. Program Listing ........................................................................... 93 LIST OF REFERENCES ........................................................................................... 104 vi LIST OF TABLES Table 1.1 Data Modeling Concepts and Definitions ..................................................... 21 Table 2.1 Notations of Data Flow Diagram .................................................................. 29 Table 3.1 BPN Training Results (Set 1) ....................................................................... 46 Table 3.2 BPN Training Results (Set 2) ....................................................................... 47 Table 3.3 BPN Classifiers Performance Evaluation ..................................................... 51 Table 3.4 Performance of Combined BPN Classifiers .................................................. 52 Table 3.5 Bacterial Strains Wrongly Identified or Rejected by BEROO-l BPN ............. 53 Table 3.6 Bacterial Strains Wrongly Identified or Rejected by DAT02B-l BPN .......... 54 Table 3.7 Bacterial Strains Wrongly Identified or Rejected by DATOBE-l BPN ........... 55 Table 3.8 Bacterial Strains Wrongly Identified or Rejected by DAT02R-1 BPN .......... 56 Table 3.9 Bacterial Strains Wrongly Identified or Rejected by Combined BOX, ERIC and REP BPN Classifiers ........................................ 57 Table 3.10 Speed Performance of rep-PCR Genomic Fingerprint Classification System .................................................................................. 58 Table A] List of rep-PCR Genomic Fingerprint Data ................................................. 72 Table B.1 rep-PCR Genomic Fingerprint Database Properties ..................................... 86 Table B.2 BER Fingerprint Database Properties .......................................................... 88 vii LIST OF FIGURES Figure 1.1 Wavelet Multilevel Decomposition Tree ....................................................... 12 Figure 1.2 Model of a Neuron ......................................................................................... 14 Figure 1.3 Structure of a BPN ......................................................................................... 15 Figure 1.4 Systems Development Life Cycle .................................................................. 19 Figure 2.1 BOX-PCR genomic fingerprint (LMG 12141) .............................................. 23 Figure 2.2 BER Fingerprint (LMG 12141) ..................................................................... 23 Figure 2.3 Entity Relationship Diagram of rep-PCR Fingerprints .................................. 27 Figure 2.4 Entity Relationship Diagram of BER Fingerprints ........................................ 28 Figure 2.5 DFD of rep-PCR Genomic Fingerprint Classification System ...................... 30 Figure 2.6 DFD of Data Input ......................................................................................... 31 Figure 2.7 Scaling Function and Wavelet Function of db8 ............................................. 32 Figure 2.8 DFD of Image Filtering ................................................................................. 33 Figure 2.9 DFD of Data Partitioning ............................................................................... 34 Figure 2.10 The Architecture of a BPN Classifier .......................................................... 35 Figure 2.11 DFD of BPN Classifier Testing ................................................................... 37 Figure 2.12 A Tree Representation of the BPN Classification Preformance Analysis 39 Figure 3.1 BOX Fingerprint (LMG 12141) Filtered with Mean Filter ........................... 41 Figure 3.2 ERIC Fingerprint (LMG 12141) Filtered with Mean Filter ........................... 42 Figure 3.3 REP Fingerprint (LMG 12141) Filtered with Mean Filter ............................ 42 viii Figure 3.4 Reconstructed Image, Approximation and Details before Thresholding (BOX Fingerprint — LMG 12141 with db8 Wavelet at Level 4 Decomposition) .......................................................................................... 43 Figure 3.5 Detail Coefficients and The Associated Threshold for Noise Reduction (BOX Fingerprint - LMG 12141 with db8 Wavelet at Level 4 Decomposition) .......................................................................................... 43 Figure 3.6 BOX Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 3 Decomposition .............................................................................. 44 Figure 3.7 ERIC Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 3 Decomposition .............................................................................. 44 Figure 3.8 REP Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 3 Decomposition .............................................................................. 44 Figure 3. 9 BOX Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 4 Decomposition .............................................................................. 45 Figure 3.10 ERIC Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 4 Decomposition ............................................................................. 45 Figure 3.11 REP Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 4 Decomposition ............................................................................. 45 Figure 3.12 BER Fingerprint (LMG 12141) Filtered with db8 Wavelet at Level 3 Decomposition ............................................................................. 46 Figure 3.13 DATOOB BPN Training and Generalization Sum Square Errors ................ 48 Figure 3.14 BPN Output Scores of a Bacterial Strain with an Unique Identification (DATOOB-l — LMG 471) .......................................................................... 49 Figure 3.15 BPN Output Scores of a Bacterial Strain with No Identification (DATOOB-1 — LMG 471) .......................................................................... 50 Figure 3.16 BPN Output Scores of a Bacterial Strain with Two Identification (DATOOB-l - LMG 471) .......................................................................... 50 Figure 3.17 A Snapshot of rep-PCR Genomic Fingerprint Classification System ......... 59 Figure 3.18 Data Input Screen ..................................................................................... 60 Figure 3.19 Data Selection Screen ............................................................................... 61 ix Figure 3.20 Image Filtering Settings Screen ................................................................. 61 Figure 3.21 Filtering Output Screen ............................................................................. 62 Figure 3.22 Target Pathovar Screen ............................................................................. 63 Figure 3.23 Training / Testing Sets Screen ................................................................... 63 Figure 3.24 Non-target Pathovar Screen ...................................................................... 64 Figure 3.25 BPN Settings Screen ................................................................................. 64 Figure 3.26 Classification Setting Screen ..................................................................... 65 Figure 3.27 BPN Output Screen ................................................................................... 66 Figure 3.28 BPN Classification Analysis Screen .......................................................... 67 Figure 3.29 Classification Comparison Screen ............................................................. 67 Figure A] Sample rep-PCR Genomic Fingerprint Data File (First Two Strains) .......... 84 LIST OF ABBREVIATIONS DNA .................................................................................... Deoxyribonucleic acid PCR ................................................................................ Polymerase chain reaction rep-PCR ...................................................... Repetitive DNA sequence-based PCR BPN ..................................................................... Backpropagation neural network GUI .................................................................................... Graphical user interface Pathovar ..................................................................................... Pathogenic variant bp ............................................................................................................. Base pairs IE ...................................................................................... Information Engineering ERD ............................................................................. Entity relationship diagram DFD ........................................................................................... Data flow diagram SSE ............................................................................................... Sum square error SQL ............................................................................. Structured Query Language xi INTRODUCTION Bacteria are one of the smallest life forms on Earth, but they have significant impacts on large organisms and on the ecosystem as a whole. Some bacteria, such as the Xanthomonas Species, are plant pathogens. They can cause diseases in a wide variety of agriculturally and economically important plants. Xanthomonas infections of fruits and vegetables cause considerable loss in yield (Hayward, 1993). Identification of the bacterial species and pathogenic variants (pathovars) will lead to disease control, and prevention of losses in crop yield. In addition to its crucial role in pathological studies, bacterial identification also plays an important role in ecological studies, biodiversity assessment and microbial screening for industrial application. Among the latest developments in bacterial taxonomy are: (1) molecular identification techniques based on genotypic features and (2) computer-assisted classification with numerical analysis (Vauterin et al., 1993). Repetitive deoxyribonucleic acid (DNA) sequence-based polymerase chain reaction (rep-PCR) genomic fingerprinting technique is capable of rapidly generating large quantities of highly discriminated DNA fingerprints. The fingerprinting method, coupled with computer-assisted cluster analysis, was found to successfully classify and identify Xanthomonas pathovars (Rademaker et al., 1997). In cluster analysis, a Similarity matrix of the different fingerprints is used to generate a dendrogram. In the dendrogram, strains of bacteria belonging to the same pathovar are clustered together. A new bacterial strain is identified by comparing its fingerprint to fingerprints of other pathovars in a reference library. The bacterial strain is assigned to the pathovar with the closest match based on the dendro gram (Rademaker and de Bruijn, 1997). Backpropagation neural network (BPN) is a type of artificial neural network that is suitable for use as a pattern classifier. The BPN is a feedforward network made up of a collection of interconnected nodes or processing elements (Castleman, 1996). It can be trained to recognize and classify complex patterns. Using a supervised training approach, training data consisting of input and desired output pairs are presented to the network repetitively, and the connection weights are adjusted until the error between the network output and desired output converges (Haykin, 1994). A network can be used to recognize patterns for which it is trained. The use of rep-PCR genomic fingerprints with BPN can be an efficient approach to identify newly isolated bacterial strains. A trained network can identify bacterial pathovars based on its internal representation, without comparing them to other fingerprints in the reference library. The massive computation needed to learn the classifications is done only once during the training phase. The advantage of using the BPN approach over the cluster analysis approach will be appreciated when the size of the reference library and the number of strains that needs to be identified are large. Noise and variability in the genomic fingerprints are factors affecting the accuracy of BPN classification. Noise is introduced in the process of fingerprinting, and is caused by some physical, chemical and biological factors in PCR reactions, gel electrophoresis and photography. Filtering techniques such as local averaging mean filter and wavelet analysis offer possible noise reduction. Classification accuracy may be improved with the use of these filtering techniques. Since the amount of data to be organized and processed for BPN training, testing, and classification, is large, an information system is essential. Graphical visualization capability is also necessary for visualizing fingerprint patterns and examining within class variations. In addition to a classifier component, a good classification system should have an integrated database, data processing and data visualization components, and a graphical user interface (GUI) that links every component together and serves as a means for the user to communicate with the system. This will lead to a more efficient and standardized classification procedure since every component is built into the system. Objectives The objectives of this research are: 1. To build a rep-PCR genomic fingerprint classification system that is rapid, accurate and easy to use. 2. To investigate the effect of noise reduction on classification accuracy. 3. To evaluate the performance of the several BPN classifiers in identifying Xanthomonas pathovars. Chapter 1 LITERATURE REVIEW 1.1 Bacterial Taxonomy and Classification Systems Taxonomy is the Study of the classification of living things (King and Stanfield, 1997). It involves the study of relationships among living things and their arrangement into categories. Bacterial classification is important for several reasons. Firstly, it serves as a means of summarizing and cataloguing information about an organism. Secondly, it forms the basis for the recognition of new isolates. Thirdly, it provides an insight into the origin and evolutionary pathways of bacteria and higher organisms (Priest and Austin, 1993). Fourthly, it is fundamental in understanding ecological processes in a biosphere with diverse species and complex biological interactions. Fifthly, it plays an important role in disease control and microbial screening for industrial applications (T owner and Cockayne, 1993). To effectively meet the classification objectives, a classification method must satisfy the following three criteria: high stability, good predictivity and high objectivity (Priest and Austin, 1993). While biological classifications are subject to change in time (this is unavoidable due to the discovery of Species or the collection of new data), there are still some measures that can be taken to enhance classification stability. Firstly, all relevant objects should be included before trying to form groups. Secondly, the significant characteristics of the objects and all their variations should be included in the study (Pankhurst, 1991). To enhance predictivity and to include more generalizations about the taxa involved, classifications should be based on high information content. To increase objectivity, the development of a classification Should be empirical, reproducible and scientifically based (Priest and Austin, 1993). While the development of a classification system is often data dependent, these criteria and measures provide some general guidelines. The development of bacterial taxonomy can be traced back to the seventeenth century in the work by Morison in 1672 and Ray in 1688 (Pankhurst, 1991). The early classifications were based on morphological information as obtained from light microscopy. Following the interest in morphology, pure culture techniques were developed to study bacterial metabolism and physiological information (Priest and Austin, 1993). The traditional bacterial classifications were mainly based on phenotypic properties such as biochemical, nutritional and physiological features. In the 19603 and early 19705, numerical taxonomy was developed to provide more objective analysis and to automate the process (Goodfellow et al., 1997). Some computer-assisted numerical classification methods include multi-access and hypertext key, expert systems (Edwards and Morse, 1995), matching and clustering using similarity coefficient, maximum likelihood classification, discriminant analysis (Pankhurst, 1991) and neural networks (Weeks and Gaston, 1995). With computer-assisted classification, a comparison of large numbers of features and strains is allowed (Vauterin etal., 1993). The numerical methods greatly enhance classification accuracy and efficiency. In a survey paper by Wilkins et al. (1996), two statistical methods, including parametric Gaussian multivariate statistical method (GAUSS) and non-parametric K- nearest neighbour method (KNN), and four artificial neural network paradigms, including multilayer perceptron networks (MLP), learning vector quantization networks (LVQ), radial basis function network (RBF) and asymmetric radial basis function network (ARBF), were compared for their ability to identify phytoplankton species from flow cytometry data. From the results of the studies, all classifiers had similar identification success, with recognition rate within 3.4% of each other. The relative merits of each classifier were compared. The MLP was found to be reliable, robust, sirnple-to-use and rapid in operation, but had a longer training time in comparison with other methods. Training is not required for the KNN classifier. However, it is slow in operation and requires storage of a large representative training set (Wilkins et al., 1996). The MLP is the most popular class of multilayer feed-forward network, which is also called backpropagation neural network (BPN) after its back-propagation learning algorithm (Jain et al., 1996). . Recent advances in molecular biology make possible the classification of bacteria using detailed genotypic properties. Some nucleic acid methods, such as DN A-DN A hybridization, DNA-RNA hybridization and nucleic acid sequencing, are widely used in microbial taxonomy (Macdonell and Colwell, 1985). In addition, rep-PCR technique was also found to be reliable, reproducible and rapid in generating highly discriminated fingerprints for bacterial identification and classification (Rademaker and de Bruijn, 1997). In contrast to the traditional classifications that are based mainly on a few phenotypic properties, the molecular identification and typing methods lead to more general and natural classifications of bacterial Species (Towner and Cockayne, 1993). 1.2 The Genus Xanthomonas Xanthomonas species are plant-associated bacteria. Xanthomonas infections occur on at least 124 monocotyledonous and 268 dicotyledonous plant species (Leyns et al., 1984). Some examples of the host plants are cotton (Zomorodian and Rudolph, 1993), rice (Mew, 1993), mung bean (Vidaver, 1993), soybean (Hokawat and Rudolph, 1993), citrus fruit (Stall and Civero, 1993), cabbage, broccoli, cauliflower (Schaad and alvarez, 1993), forage grasses (Leyns, 1993), tomato, pepper (Stall, 1993), apricot, cherry, almond, peach, nectarine, plum (Civerolo and Hattingh, 1993), poplar (Ride, 1993), strawberry (Rat, 1993), sugarcane (Rott, 1993), barley, rye, wheat (Duveiller and Maraite, 1993) and mango (Pruvost and Manicom, 1993). Most of these are important agricultural crops and fruits grown in different parts of the world. The disease symptoms include necrosis, gummosis and vascular or parenchymatous diseases on leaves, stems or fruits of the plants. The disease may cause significant loss in yield (Hayward, 1993). Xanthomonas cells are Gram-negative rods, measuring 0.2-0.6].tm by 0.8-2.9um (Swings et al., 1993). When cultivated on common growth media, most Xanthomonas strains form yellow, water-insoluble pigments called Xanthomonadins (Swings et al., 1993). Most Xanthomonas species have a polar flagellum and are strictly aerobic chemoorganotrophs (Leyns et al., 1984). The genus Xanthomonas was proposed by Dowson in 1939 and is classified under the superkingdom Prokaryota, kingdom Monera, subkingdom Eubacteria, division Gracilicutes, phylum Proteobacteria and subclass Gamma (Margulis and Schwartz, 1998; Stackebrandt et al., 1988). Seven species are recognized within the genus Xanthomonas, namely: X. albilineans, X. axonopodis, X. campestris, X. fragariae, X. populi, X. maltophilia (not a plant pathogenic species) and X. oryzae. Nearly every new species and subspecies or pathogenic variant (pathovar) of Xanthomonas was named after the host plant (Vauterin et al., 1993). As a result, some of classifications may not reflect genomic relationships. Xanthomonas was reclassified by Vauterin et a1. (1995) based on a comprehensive study of 183 strains using DNA-DNA hybridization technique. The genus was shown to comprise 20 genomic species. Both the genomic relationships and the needs of plant pathologists for a rational nomenclature are taken into account in the new classification (Vauterin et al., 1995). 1.3 rep-PCR Genomic Fingerprinting rep-PCR genomic fingerprinting is a DNA amplification-based technique for generating bacterial fingerprints that can be used for species identification. It consists of the following steps: (1) bacterial cells and DNA isolation, (2) repPCR amplification, (3) gel electrophoresis, (4) gel image capture and (5) image processing and analysis (Rademaker and de Bruijn, 1997). Polymerase chain reaction (PCR) was devised by Mullis in 1984 for amplifying specific DNA sequences (Watson et al., 1992). A PCR cycle consists of three stages, namely, strand separation, hybridization of primers, and DNA synthesis (Stryer, 1995). To prepare the mixture for PCR, two oligonucleotide primers, a DNA polymerase, and all four deoxyribonucleoside triphosphates (dNTPs) are added to the target DNA. When heated at a temperature of about 94°C for one minute, the double-stranded DNA molecules separate completely, forming single strands that become the templates for the primers and DNA polymerase. The solution is then cooled to about 50°C, a temperature that allows the primers to anneal to the complementary sequences in the DNA molecules. With the primed templates generated, the temperature is then raised to about 65°C, the optimal temperature for the DNA polymerase. 65°C is maintained for eight minutes for the DNA synthesis to proceed. These three steps are repeated for 30 cycles. All newly synthesized strands serve as templates in each successive cycle; DNA sequences are amplified as a result. The amplification is a billionfold after 30 cycles (Watson et al., 1992; Stryer, 1995; Rademaker and de Bruijn, 1997). In rep-PCR, DNA primers complementary to naturally occurring, highly conserved repetitive DNA sequences present in the genomes of most Gram-negative and several Gram-positive bacteria are used. They include three families of repetitive DNA sequences, namely the 35-40 base pairs (bp) repetitive extragenic palindromic (REP) sequence, the 124-127 bp enterobacterial repetitive intergenic consensus (ERIC) sequence, and the 154 bp BOX element. The use of these primers in PCR leads to the selective amplification of distinct genomic regions between REP, ERIC or BOX elements. The collective name for this protocol is calledrep-PCR, and individually can be referred to as REP-PCR, ERIC-PCR and BOX-PCR with respect to the different primers (Rademaker and de Bruijn, 1997). Since the amplified fragments of DNA are of different sizes, they can be resolved in a gel matrix using gel electrophoresis to generate fingerprints (Rademaker and de Bruijn, 1997). In gel electrophoresis, negatively charged DNA molecules move from the negative pole toward the positive pole of the gel. The separation depends on the molecular weight of the molecules. Smaller molecules move faster than larger molecules. At the end of the gel electrophoresis, which usually runs for 18 hours, distinct bands of DNA can be found (Hartl, 1994; Rademaker and de Bruijn, 1997). The rep-PCR genomic fingerprints generated through PCR and gel electrophoresis permit differentiation to the species, subspecies and strain level (Rademaker and de Bruijn, 1997). The last Step of rep-PCR genomic fingerprinting involves making a hard copy of the fingerprint. The gel is stained and a photograph of the gel is taken. The gel image is processed and analyzed using numerical imaging tools (Rademaker and de Bruijn, 1997). GelCompare (Applied Maths, Kortrijk, Belgium) is a software package designed for gel image processing and analysis. 1.4 Digital Image Processing Image digitization is the first step towards machine pattern recognition. Physical images are converted into numerical representations by digitization. In the process, an image is divided into small regions called pixels and at each pixel location, the brightness of the image is sampled and quantized into a numerical gray level (Castleman, 1996). A digital image can be seen as an array of different gray levels. Digital images are used as inputs for pattern recognition systems. The design of a pattern recognition system includes the following five steps, namely: image segmentation, feature selection, classifier design, classifier training and performance evaluation. Image segmentation involves the isolation of individual objects to be recognized from the rest of the scene, using thresholding and edge detection techniques. Feature selection involves choosing object properties that can best distinguish the various classes of objects among themselves (Castleman, 1996). Several techniques can be used to enhance image quality and reduce undesired noise. Mean filter is one of the Simplest linear filters. It is implemented by a local 10 averaging operation where the value of each pixel is replaced by the average of all the values in the local neighborhood of size N (Jain et al., 1995): i+M h(i)=,-'.- Xfac) (1.1) t=i-M where h is the filtered image, f is the original image, i is the pixel location and M = L121]. The size of the neighborhood N is directly proportional to the amount of filtering. However, there is a trade-off between noise reduction and image detail. Sharp detail in the image is often lost and steep changes will be blurred into gradual changes as a result of linear filtering (Jain et al., 1995). Wavelet analysis offers another technique for noise reduction. Wavelet multiresolution analysis divides the frequencies of an image into octave bands, and the image can be analyzed at different scales (Strang and Nguyen, 1996). The wavelet de- noising procedure consists of the following three steps: (1) multilevel wavelet decomposition of the image, (2) detail coefficients thresholding at each level, and (3) wavelet reconstruction based on the original approximation coefficients and modified detail coefficients (Donoho, 1995). Discrete wavelet transform involves the decomposition of an image space into orthogonal subspaces (Strang and Nguyen, 1996). Let V0 3 V1 3 V2 3 D V; be a sequence of J nested subspaces, from fine to coarse scales. At each level j, the low frequency approximation of an image is captured in the scaling subspace Vj, and the high frequency detail of an image is captured in the complement wavelet subspace Wj. Each subspace V,--1 is the direct sum of the coarser subspace VJ- and its orthogonal complement Wj. The multiresolution decomposition of V0, the image space, can be written as V0=W|$W2$...$W1®Vj (1.2) 11 (Liang, 1997). A wavelet decomposition tree is Shown in Figure 1.1, where S is the image, and Aj and Dj represent the approximations and details, respectively (Misiti et al., 1996). m. {“7 Da Figure 1.1 Wavelet Multilevel Decomposition Tree Mn} and {ll/jg} form the orthonormal bases for the subspaces Vj and W-, respectively. A family of scaling basis functions can be obtained through dilations and translations of a scaling function as shown in Equation 1.3: ¢,-,. = 2"”¢(2"' x-k) (1.3) where ¢ is the scaling function, x is the pixel location, and j and k are integers representing the scales and positions (Misiti et al., 1996). Similarly, a family of wavelet basis functions can be obtained as Shown in Equation 1.4: V0,. = 2"’2W(2"' x-k) (1.4) where III is the mother wavelet and x, j and k are as previously defined. Some examples of wavelet families are Haar, Daubechies, Biorthogonal, Coiflets, Symlets, Morlet, Mexican Hat and Meyer (Misiti et al., 1996). The wavelet basis functions are dually localized in both time and frequency domains (Barclay and Bonner, 1997). 12 The approximation wavelet transform coefficients a“ are given by the inner product and the detail wavelet transform coefficients c“ are given by the inner product