7' V4" . . I~Il \ h?‘ i: ‘1‘: V..;' “ 39:02“ . .n, ‘ ”fin.” .. u. n .-‘.‘ 8‘ . '93:! . 1 4w: .5“ '1 m..:u u;- = ~ ‘ _ It V > v! 1“ 1’ . ‘ ' ' Jeff“ .‘34. ‘51,. V and . 3. LI, ‘3» , . 1%.; ““5 p x .. . 4.1. .‘L J! , .1 , 11.2%: ' (I . v?. '. ,' ... , £111 (1‘ .. t I «a. I ‘ ~ 1.3: ‘ g "m, n t‘i: Q! .9 1 .. .,. MN... .. 1 “3'3 ‘ . ,‘r'figflz Jr , . ,. , m ‘ . ’ . . ‘ A [yum‘iu' -\"\:. If , ‘ . -- -. . _ .nH‘rIH‘ ' ' ’ ' ‘, 1‘ ' L ‘6‘?” ‘ ' v t \ .U‘u:,‘,’;‘4f“v\¥‘;® ‘ “A "“' > ‘- ‘~ ".|1¢ w ‘a. ‘ «L'r', . LR‘U‘ ‘ , . . . . a‘ . , {‘3 ,5 “a“. 3 1 . I 32'3“??? I. 3“ J, , ‘ .r‘ . ,, V .3 ‘ {Pp _‘ 39:51:; ‘ ,. H ‘.\. 1““ THESIS lmillililillimilllll Date 01409 6527 This is to certify that the thesis entitled AN APPROACH TO BAR CODE IDENTIFICATION ON A CUSTOM COMPUTING MACHINE presented by Sandeep Dhameja has been accepted towards fulfillment of the requirements for Master's degree in Electrical Engineering 730-» Major professor 9/4/er 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY MIchlgan State Universlty PLAOE II RETURN BOXtonmovothh Mouton yowrocord. TO AVOID FINES mum on or More data duo. DATE DUE DATE DUE DATE DUE ES 0 5 mr - e ‘ l MSU I. An Mun-tin Adlai/Equal Opportunlty Institulon Wows-m AN APPROACH TO BAR CODE IDENTIFICATION ON A CUSTOM COMPUTING MACHINE BY Sandeep Dhameja A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1995 ABSTRACT AN APPROACH TO BAR CODE IDENTIFICATION ON A CUSTOM COMPUTING MACHINE By Sandeep Dhameja This thesis focuses on a detailed study of the custom computing machine Splash 2 and suggests an approach to identifying bar codes based on mapping an algorithm using the multi-channel filtering technique onto the reconfigurable architecture. First the Splash 2 architecture is discussed in detail. The reconfigurable architecture of Splash 2 is based on Xilinx field programmable gate arrays (FPGAs) that behave as the processing elements, allowing investigation of compute- and data—intensive algorithms. The programming environment of Splash 2 is discussed in detail. The suggested approach to identify bar codes is supplemented by VHDL code for mapping the design onto FPGAs. Finally, the basic communication and computation algorithms are discussed along with the VHDL simulation code and the partition, place and route results, which provide estimates of spwd and complexity of the design. The experimental results obtained for the basic algorithms show that the application of identification of bar codes is suited for implementation on a custom computing machine. Copyright by SANDEEP DHAMEJ A 1995 Acknowledgments I would like to acknowledge the contributions of the following to my academic and personal pursuits. Firstly, I would like to thank Dr. Diane T. Rover, advisor for my research work at Michigan State University. I have been motivated and inspired by her to pursue research of my own interests. This thesis could not have been organized so carefully without Dr. Diane T. Rover’s meticulous edits. I would also like to thank the other members of my Oral Defense Committee, for their willing discussion of my efforts and participation in my degree program: Dr. P. David Fisher and Dr. Michael A. Shanblatt. I would like to thank Duncan Buell, Jeff Arnold and Brian Schott for their invaluable guidance and support at the Supercomputing Research Center, Bowie, Maryland. I appreciate the assistance of Dr. Anil K. Jain for giving me the opportunity to work at the Pattern Recognition and Image Processing laboratory. Several other colleagues and friends have been patient to listen to my ideas and have provided constructive comments at several stages of the research. Part of the research was completed under a research contract from the Institute for Defense Analyses, Alexandria, Virginia. Also, thanks to my friend Rebecea Doma for her patient editing and valuable suggestions. 'Ib my parents for their enduring and endearing guidance; to my sister, for being a fiiend; to dear Anita for caring about me and my work. Finally, I thank The Almighty for giving me the strength and courage to complete this work. iv Table of Contents List of Tables .................................................................................................................... viii List of Figures .................................................................................................................. ix 1 Introduction 2 1.1 Thesis Motivation ........................................................................................ 2 1.2 Research Approach ...................................................................................... 3 1.3 Thesis Organization ..................................................................................... 3 2 Background S 2.0 Introduction to Bar Code Information Theory ............................................ .5 2.1 Bar Code Types ..... - - - .......................................... 6 2.1.1 Universal Product Code - - -6 2.1.2 Code 39 ......................................................................................... 9 2.2 Bar Code Scanning _______ - _ - _ -- ...11 2.3 Summary .................................................................................................... 16 3 Related Work 17 3.1 TIaditional Rapid Prototyping ................................................................... 17 3.1.1 Quicktum Systems, Inc - -_ _ _ _- - 18 3.1.2 T'InMANN .......... - - ................................... 19 3.1.3 The AnyBoard System ................................................................ 20 3.1.4 Summary .................................................................................... .20 4 The Splash 2 Programming Environment 21 4.0 Programming the Xilinx FPGAs ................................................................ 21 4.1 Splash 2 Simulator .................................................................................... .21 4.1.1 The Interface_board Entity ........................................................ .23 4.1.2 The Splash_boards Entity .......................................................... .23 4.1.3 The Splash2__board Entity.. -- ............................. .24 4.2 Simulator Files ........................................................................................... 24 4.3 The Xilinx Control Part .............................................................................. 28 4.4 Interface Board ........................................................................................... 29 4.4.1 XLeft Entity ................................................................................. 29 4.4.2 XRight Entity .............................................................................. 31 4.5 The Xilinx Processing Part ........................................................................ 32 4.6 Xilinx Control Part Entity .......................................................................... 41 4.7 Crossbar ..................................................................................................... 45 4.8 Memory ...................................................................................................... 50 4.9 Summary .................................................................................................... 51 Bar Code Identification On Splash 2 - - - - ‘2 5.0 Bar Code Identification--- -- - -------------- -- - .----52 5.1 Texture Analysis ........................................................................................ .52 5.2 Multi-channel Filtering .............................................................................. 53 5.3 Unsupervised Segmentation Method ........................................................ .56 5.3.1 Implementation Using Serial Code ............................................. 56 5.3.2 Unsupervised Segmentation. ....................................................... 57 5.3.3 Rotation-invariant Features ......................................................... 5 8 5.3.4 Supervised Segmentation Method .............................................. 63 5.4 A Suggested Approach to Bar Code Identification on Splash 2 ................ 64 5.5 A Suggested Mapping of Simple Perceptron onto Splash 2 ...................... 70 5.6 Summary .................................................................................................... 72 Mapping an Application Onto Splash 2 -- 73 6.0 Design Implementation .............................................................................. 73 6.1 Run-time Environment. .............................................................................. 73 6.1.1 Splash 2 Execution - -- - -- -- - - --...74 6.1.1.1 Initializing the Splash 2 Environment ................................ 74 6.1.1.2 Preparing the Splash 2 Application .................................... 74 6.1.1.3 The Ungroup_all and Boundary_optimization Options ..... 75 6.2 General Compiling Methodology - - - -- - -------------- 76 6.2.1 Compiling \Vrthout Constraints .................................................. 76 6.2.2 Setting Design Constraints .......................................................... 76 6.3 VI-IDL_to_EDIF ............. - - - - .................... 77 6.4 EDIF_to_XNF Conversion ........................................................................ 77 6.4.1 Environment. ............................................................................... 77 vi 6.4.2 Execution .................................................................................... 79 6.5 Xilinx Placement and Routing ................................................................... 79 6.5.1 Environment ................................................................................ 79 6.5.2 Execution .................................................................................... 80 6.6 The Xilinx Synopsys Interface (XSI) ....................................................... 82 6.6.1 Compiling the Design ................................................................. 83 6.6.2 Interpreting the Results ............................................................... 84 6.7 The Execution Process- -- - ......... - ........ 85 6.8 T2 Debugger ............................................................................................. 85 6.9 T2 Debugger Commands ---------------------- ------- 87 6.10 Results ................... . .................................................................................... 88 6.10.1 PPR Summary Explanatory Notes--- ..... - -....93 6.10.2 Chip Pinout Description for the XBar Algorithm ....................... 96 6.11 Additional Results of Simulations ------ - =....98 6.12 Summary .................................................................................................. 102 7 Conclusions and Future Research 103 Appendix ......................................................................................................................... 106 Bibliography .................................... -- - - -- - 119 Vii List of Tables 1 Specification of the Universal Product Code .................................................. 8 2 Part of the Specification of Code 39 ............................................................. ll 3 Chip Pinout Description ............................................................................... 96 viii List of Figures 2.1 Derivation of the UPC Scan Code Words .......................................................... 7 2.2 Illustration of the Specification of the UPC Symbol ......................................... 8 2.3 Illustration of the Code 39 Character .............................................................. 11 2.4 Illustration of the Invariance of the Edge-to-Edge Distance under Ink Spread .............................................................................................................. 13 2.5 Encoding of the Binary String 100111000 by the Delta Code (top) and Width Code Giottom) ......................................................................................... 15 2.6 Delta Code Module Representation ................................................................. 16 4.1 The Splash 2 System Architecture ................................................................... 22 4.2 Splash 2 Simulator ........................................................................................... 23 4.3 Interface_Board Entity Declaration........................... ...................................... 25 4.4 (a) Interface Library Declaration .......................................................................... 26 4.4 (b) Interface Library Declaration .......................................................................... 26 4.5 Splash2_Board Entity Declaration .................................................................. 27 4.6 Splash2_Board Structure Declaration ............................................................. 27 4.7 Xilinx_Control_Part Declaration .................................................................... 28 4.8 The MTest Architecture ................................................................................... 29 4.9 The MTest Declaration ................................................................................... 34 4.10(a) File "Parts.vhd" Processing Element Declaration .......................................... 37 4.10(b) File "Parts.vhd" Left_to_Right Architecture Declaration ............................... 38 4.10 (c)File "Parts.vhd" Left_to_XBar Architecture Declaration ............................... 39 4.11 File "Parts.vhd” XBar_to_Right Architecture Declaration ............................. 40 4.12 File "Parts.vhd" Control_Part Architecture Declaration ................................. 44 4.13 File ”Parts.vhd” Blank Architecture Declaration ............................................. 45 5.1 An Overview of the Bar Code Localization Algorithm ................................... 55 5.2 Bar Code Image ............................................................................................... 60 5.3 Filter Images Generated from the Bar Code Image as shown in Figure 5 .2 ..... 60 5 .4 Feature Images Generated from the Filtered Images as shown in Figure 5.3..61 5.5 Mean Feature Values for the Feature Images ................................................... 62 5.6 (a) Steps involved in Identifying Bar Code Images on Splash 2 ............................ 66 5.6 (b) Steps involved in Identifying Bar Code Images on Splash 2 ............................ 67 5.7 Table Lookup Architecture Declaration for Storing Data in the Local PE RAMS - -- .......... - - ............................................. 68 5.8 Exponential Function Architecture Declaration .............................................. 69 5.9 Local RAM LUT for Weight Values ............................................................... 71 6.1 Mapping of an Application onto Splash 2 ....................................................... 83 6.2 Data Transfers from the Processing Element .................................................. 89 6.3 PE to Crossbar Declaration... -- ...... - ......... - - - .................... 90 6.4 Crossbar Processing Element XP_XBAR_SLAVE Architecture .................... 91 6.5 Results for the XP_XBAR_SLAVE Algorithm Obtained from PPR .............. 92 6.6 Left_to_XBar Architecture Declaration .......................................................... 99 6.7 Processing Element to Crossbar based on SIMD Tag Declaration ............... 100 Chapter 1 Introduction Mind has transformed the world, and the world is repaying it with interest. It has led man where he had no idea how to go. Paul Valery Field-Programmable Gate arrays (FPGAs) are a new type of user-programmable inte- grated circuits providing a fast, fairly inexpensive solution to custom Very Large Scale Integration (VLSI). An FPGA consists of an array of logic cells that can be interconnected via programmable routing switches, where the general purpose routing switches allow the user to configure FPGA based multiple level logic cells. FPGAS combine the features of Mask Programmable Gate Arrays (MPGAs) and Programmable Logic Devices (PLDs). The two-dimension array feature is adopted from the MPGAs while the user-programma- bility is adopted fi'om the PLDs. The research reported in this thesis is focused on a sug- gested implementation of image analysis technique, Bar Code analysis, on an FPGA- based computing machine. Multilevel FPGAS were introduced in 1985 by Xilinx, Inc. [1]. Since then, multilevel FPGAS have evolved considerably as new devices [5]. They have gained widespread application, owing to their relatively low costs and quick production. As an implementa- FPGAS offer some unique advantages over the alternate technologies (MPGAs, standard cells, and full custom design) [2, 3, 4, 5, 6]: - FPGAS provide a reduction in manufacturing cost of custom VLSI from tens of thou- sands of dollars to approximately one hundred dollars. - FPGAS reduce the turnaround time of manufacturing from months to several minutes. These features can be attributed to the user-programmability of FPGAS. FPGA-based cus- tom computing machines provide the designers with a hardware-software based design platform, allowing multiple design iterations to be done both rapidly and cost-effectively. However, user-programmability does have its limitations: the logic density and the speed performance of an FPGA circuit are considerably lower than that of its counterparts [5]. Although developments over the last decade have resulted in a significant improvement in FPGAS, there are numerous opportunities for innovation before the best FPGA designs will evolve. 1.1 Thesis Motivation An FPGA-based custom computing machine allows the user to implement circuits by interconnecting its logic cells through the user-programmable logic or routing switches. The routing switches serve two distinct purposes: to connect the logic cells to the routing wires, and to connect one routing wire to another. Regardless of their implementation, routing switches consume a significant chip area and thus they involve a tradeoff among device flexibility, logic density, and speed performance. A high flexibilty yields a machine that can be configured easily, but if the flexibility is too high then there is an enormous waste of logic area because of unused switches. The primary focus of this thesis is the study of an implementation of bar code identifica- tion algorithm on a hardware-software co—designed, custom computing machine. The goal of this study is to determine the implementation of the algorithm and further, to determine the minimal flexibility that is necessary to provide sufficient interconnection to satisfy the requirements of real circuits. This research is part of a project sponsored by the Supercom- puting Research Center, Bowie, Maryland [7-10] that examines many aspects of imple- mentation of Image Analysis algorithms on an FPGA-based custom computing machine Splash 2 [39]. 1.2 Research Approach A digital document image usually consists of a mixture of text, graphics, and half-tone patterns. These regions of varied textures [13, 15] should be separable so as to allow sub- sequent processing of the region of interest : the bar code in this case. Several algorithms have been proposed for segmenting images in the literature [14-17]. The segmentation method applied to the bar codes as part of our research does not assume any a priori knowledge of the layout or the font styles of the document, and can be applied to input gray level scanned images. The segmentation method is based on the multi-channel gabor filtering technique which has been successfully used for texture segmentation [15, 16]. The areas containing the regions of interest are captured using gabor filters. The filters are applied directly to the input scanned images requiring no additional thresholding. 1.3 Thesis Organization This thesis is organized as follows: 0 Chapter 2 provides background information: a discussion of the definitions of bar codes and their design aspects, including the UPC Bar Code and Code 39. 0 Chapter 3 presents an overview of the background related research that has led to the development of custom computing machines. 0 Chapter 4 describes the Splash 2 system.The system architecture and the component building blocks are discussed. - Chapter 5 describes the Splash 2 programming environment in terms of the simulator and the files necessary to emulate designs based on the applications of interest. The sug- gested approach and the VHDL codes for this approach are also discussed. 0 Chapter 6 presents the bar code localization algorithm in addition to an overview of the Gabor filtering technique applied to texture segmentation. In additon, Chapter 6 pro- vides the details of several algorithms related to the application. These algorithms pro- vide an excellent groundwork essential for the implementation of the bar code localization algorithm and their mapping onto the Splash 2 system. The preliminary results of these algorithms are presented in this chapter. ° Chapter 7 summarizes the thesis by providing conclusions and recommendations for future research. Chapter 2 Bar Code Information Theory The eye sees only what the mind is prepared to comprehend Robertson Davies 2.0 Introduction to Bar Code Information Theory Bar codes were introduced almost 20 years ago. They are now being widely used in super- markets, department stores, factory floors, the health industry, insurance industry, and more. Early codes, in particular UPC (Universal Product Code), evolved from careful studies. These studies were done under the technological constraints of 20 years ago. The continu- ing improvements in computing hardware have resulted in a feasible use of powerful pro- cessors that would have been prohibitively expensive to use in the past. In recent years, this process has been applied to enhance the density of the information that is represented. This has created a portable data file, implemented using convenient bar code symbols as opposed to the license plate file [31]. The difference in the two types of implementations can be illustrated using the following example. The bar code of the supermarket item con- sists of 11 digits which represents the product identifying number but provides no infor- mation about the description. The price look-up must be accessed in a database keyed to the number in the bar code. This provides an alternative to a longer bar code with the rele- vant information, such as the price, name of the product, manufacturer, weight, inventory data, expiration date, etc. Thus providing a portable data file of the product wherein the relevant information can be retrieved without accessing the database. Pricing information, although useful and necessary for the retail stores, is not very useful information for the receiving and distribution centers. Thus, some inefficiencies are introduced into them deliberately to facilitate their robust reading for a variety of bar codes. 2.1 Bar Code Types 2.1.1 Universal Product Code (UPC) UPC is the symbology used in American supermarkets for more than 15 years. Symbology, as used in the bar code industry, represents the bar code scheme, while the symbol repre- sents the bar code itself. The code that has been described in this section is the version A, where each symbol encodes 12 digits. However, all versions of the UPC bar code use the same basic structure for the code words. Each UPC code word consists of two bars and two spaces with a total width of seven modules as shown in Figure 2.1. The seven mod- ules (a bit 0 or 1 as shown in the top line) have six boundaries labeled as 1...6 as shown in Figure 2.1. A partition of the modules into a set of two bars and two spaces requires three internal boundaries that can be placed in each of the six locations. The number of possible code words can be computed as follows: Two bars and two spaces require three dividers in one of the six positions (between modules) as shown in the Figure 2.2. Therefore, the total number of partitions will be the same as the number of ways to choose three objects out of six, or 20. Thus 40 code words can be constructed depending on whether the first element is a baror a space. However, code words that are mirror images of each other cannot be included owing to the need for bi-directional scanning.This results in 20 possible code words. While symbols should be readable by scanning in either direction, the direction of scanning is also important. UPC achieves this by assigning two code words for each digit: those used on the right part of the symbol start with a bar and those used on the left part start with a space. Figure 2.1. Derivation of the UPC Scan Code Words. The two versions can be distinguished easily because the ones starting with the bar have an even number of bar modules (even parity) and those starting with a space have an odd number of bar modules (odd parity). Table 1 shows the arrangement of the UPC code specification. Table 1: Specification of the Universal Product Code. Left half Right half Width (odd) (even) pattern 0 0001101 1110010 3, 2,1,1 1 0011001 1100110 2, 2, 2, 1 2 0010011 1101100 2,1,2,2 3 0111101 1000010 1,4,1,1 4 0100011 1011100 1,1, 3, 2 5 0110001 1001110 1, 2, 3, 1 6 0101111 1010000 1,1,1,4 7 0111011 1000100 1,3,1,2 8 0110111 1001000 1,2, 1,3 9 0001011 1110100 3,1,1, 2 Left guard bar (101) Center guard bar (01010) Right guard bar(101) Left five characters Right five characters of code of code I l i 45 67890 Figure 2.2. Illustration of the Specification of the UPC Symbol. (The diagram is not drawn to scale) The symbol shown in Figure 2.2 contains the following group of code words: 0 a left guard pattern 101 0 six digits of odd parity: - one digit denoting the industry type (such as a 0 for grocery, 3 for pharmaceutical, etc.) - five digits labeled as the Lefi five characters, denotethe manufacturer code 0 a center guard pattern 01010 consisting of five modules 0 six digits of even parity: - five digits labeled as the Right five characters, denote the item code - one check digit 0 a right guard pattern 101 2.1.2, Code 39 (three of nine) Code 39 is a width code with three wide elements out of a total of nine: five bars and four spaces between them. It has been the standard for the Department of Defense since 1980 [31]. The total number of code words that it can generate is the number of ways three items can be chosen out of nine. It has a code for the 10 digits, the 26 letters, and eight special symbols (hyphen, period, space, asterisk (*), $, /, +, and%) so that a total of only 44 code words are used. The asterisk is used only as the first and the last code word of a symbol. Table 1 shows some of the codes. A 1 means a wide element and a 0 means a narrow ele- ment. If an width of an element or bit is altered then an edge shift (the most common read— ing error) occurs. This shift, will change both, the bar and the space patterns. For example, 10 10001,0100 rrright become 11001,0000. The latter does not correspond to any Code 39 pattern. The patterns for both the bars and the spaces have been chosen in such a way that changing a single bit in either of them results in an illegal word. Spaces have only an odd number of wide elements and the bars have only an even number of elements. This allows for the detection of the single errors. For this reason code 39 is called self-locking for this 1' 638011. The number of bar patterns with two wide elements is 10 (the number of ways to choose two out of five), and there are four space patterns with one wide element. This yields 40 code words. Four additional code words are obtained by using the four patterns with three wide spaces and only narrow bars. In contrast to UPC, Code 39 has no specifications for the arrangement of code words on the symbol except that the asterisk must be the first and the last character and can appear only there. The special start-and-stop character, the asterisk makes it possible to determine the direction of scanning. As shown in Table 2, Code 39 contains code words that are rrrirror images of each other, such as the pairs (P,*) and (K, U). Therefore, the direction cannot be decided locally but must be determined from the start-and-stop code word. If the first code word is decoded as P, then we conclude that we are looking at the last code word of a symbol. The major drawback of Code 39 is that it does not allow local detection of the scan direction and can- not be decoded, even partially, without at least one of the start-and—stop code words. 11 Table 2: Part of the Specification of Code 39 [31]. Characters Bars Spaces 1 10001 0100 2 01001 0100 A 01001 0010 K 10001 0001 P 01100 0001 U 10001 1000 Z 01100 1000 * 00110 1000 $ 00000 1110 10001 0100 Figure 2.3. Illustration of the Code 39 Character. 2.2 Bar Code Scanning Bar codes encode information along one dimension with intervals of alternating black and white color. The arrangement of bars in an alternate fashion helps to diffuse reflectivity. The intervals are stored as rectangles whose vertical height carries no information but facilitates the scanning process. The term bars denotes the rectangles having the fore- ground (darker) color, while the term spaces denotes the intervals having the background 12 color between the bars. In general, the bars absorb light and the spaces reflect light, which is detected by the sensor. On soda cans, the bars have a metallic color that reflects more light than the background. However, the reflection from the bars is specular and directed away from the sensor. The spaces have a dull color with the diffuse reflection, part of which is always captured by the sensor. Similarly, on transparent bottles the bars are not colored; so light goes through them rather than reflecting to the sensor. Laser scanners in the supermarkets use essentially orthogonal patterns. The height of each half of the symbol is expected to exceed its width so that one of the two beams is guaran- teed to scan at least the symbol half. (It is a property of a block with height greater than width that an x pattern is guaranteed to read it by at least one line of the x in any orienta- tion of the symbol). The symbol can be assembled afterwards because each part is identi- fied by the parity pattern. UPC allows “scanning by halves” because the left and the right parts of the code are inherently identifiable from the local properties of each code word. One major source of distortions of bar code is the uniform ink spread, which causes bars to be comparatively wider than spaces. For this reason bar code readers measure not the width of the edge but the edge-to-edge distances (called t-distances) of successive bars as shown in Figure 2.4. 13 I I l I I I | Printing without I I ink spread I— 4—I Figure 2.4. Illustration of the Invariance of the Edge-to-Edge Distance under Ink Spread. Printing with ink spread There are two fundamentals ways of encoding binary information in such a one-dimen- h sional medium. In one we subdivide the available material into modules. Each of the mod- ules is assigned a ‘1’ or ‘0’ with the 0’s corresponding to the spaces. A single space or bar may contain many modules. Such schemes are called delta codes in communication the- ory. In another scheme of representation, bars or spaces alternate per bit, where an element is wide if the bit is 1 and narrow if the bit is 0. These codes are referred to as width codes. The common name used in the barcode industry is binary codes since only two widths are used. 14 The Figure 2.5 shows the two encodings of the binary string 100111000. In both cases, the minimum printed width (i.e., the width of a module or element) is the same. The delta code requires nine such widths (the number of bits), while the width code requires thirteen such widths if a wide element is twice as wide as a narrow element. Scanning speed is extremely and unpredictably variable in hand—held wands and laser scanners. This problem imposes the constraint that printed codes be self-clocking, which means that both the number of modules and the number of bars and spaces per code word must be fixed. The delta codes must have a fixed number of bars and spaces. A code with n modules and k pairs of bars and spaces is called an (n, k) code. Mdth codes are required to have not only a fixed number of bars and spaces but also a fixed number of wide elements to achieve a fixed width. Among the popular bar codes, the UPC is a (7,2) delta code with seven modules, two bars and two spaces). Code 39 is a width code with nine elements (five bars and four spaces), three of which are wide. 15 I - Module boundary .I/ |1|0|0|1|1l1|0l0| 1001 1 1000 Figure 2.5. Encoding of the Binary String 100111000 by a Delta Code (top) and Width Code (bottom). The self-clocking constraint for the codes is more stringent than that for optical and mag- netic recording, which share many of the features of bar codes. In such systems, informa- tion is encoded under the run-length-limited (RLL) constraint. This constraint is specified by a pair of constraints (d, k), where 0 _<. d < k , which means that consecutive 1’s are separated by at least d, but no more than k, 0’s. The constant d is used to control the inter- symbol interference effects. The constant k is used to provide self-clocking through a phase-locked loop (PLL) with windowing. This weaker constraint is possible because optical and magnetic recording systems are closed systems where the distance between the scanner and the symbol as well as the size of the symbol are fixed and known. Neither is known for bar codes, which are classified as open systems. 16 Besides self-clocking, a means to detect the direction of the scan is necessary. Practical requirements may demand that bar codes be scanned in either direction. (Many scanning devices, including those used in retailing, have repetitive scans in alternative directions.) There are two possible of solutions to this problem. One solution is to include a unique start/stop code word on each symbol. The other is to use only one of two code words if they are mirror images of each other. For example, three modules of the delta code with only one bar and one space yields the combinations shown in Figrne 2.6. In addition, the Figure 2.6 shows that the last two code words are mirror images of the first two and cannot be distinguished in case the scanning direction is not known. This results in taking only half of all possible code words. UPC uses only code words whose mirror image is not in the code, while the Code 39 uses a unique start/stop code word. II II_II_II -_II III Figure 2.6. Delta Code Module Representation. 2.3 Summary This chapter has provided an overview of bar code theory and the most commonly used codes. As new applications have deve10ped, there has been a demand for new types of codes. The UPC bar code versionA and the code 39 have been described in detail in this chapter. The UPC bar code, version A, has been used as an example to identify bar code images. Chapter 3 Related Work Intuition becomes increasingly valuable in the new information society precisely because there is so much data. John Naisbitt 3.1 'Itaditional Rapid Prototyping Over the past decade, prototyping of application specific designs has gained significant importance owing to several reasons. Complex projects cannot start with a complete set of final specifications at the beginning of the product’s life cycle. Furthermore, the product development cycle cannot proceed until the current version reaches completion and is evaluated. Hardware prototyping has been greatly influenced by software tools in recent years. Com- puter-aided design (CAD) tools have emerged, allowing the designer to integrate testabil- ity into designs ensuring proper operation of the final product. For example, the Test Engineer’s Assistant (TEA) software tool is one such prototyping tool which enhances the designer’s ability to design a robust system prototype by the use of test vectors [38]. Thus, the integration of hardware and software, also termed as hardware-software codesign, is aimed towards reducing the final system developmental costs while ensuring its reliability. TEA analyzes a design and indicates aspects that tend to inhibit the testability of the prod- uct. Several alternate approaches are suggested to provide for greater flexibility in the test and fault isolation. l7 18 3.1.1 Quickturn Systems, Inc. While Application Specific Integrated Circuits (ASICs) have become an important tech- nique of design implementation, one of their shortcomings has been the inability to test the designs at actual speeds before the product fabrication. Quickturn Systems, Inc. has developed a hardware-software codesign tool capable of performing the functions of the final desired product through the use of FPGAS [29]. Although similar designs can be implemented using other conventional logic devices (PALS and PLDs), designs have become so complex that such an implementation grows into an entirely separate design project. Often, such a task takes longer to complete than the actual production of the IC itself. Thus FPGAS provide a logical solution to the prototype design with the advantages of allowing for high density logic designs along with rapid reconfigurability for testing and modifications of the design. Quickturn systerrrs ease the implementation of prototypes by automating the FPGA proto- typing procedure. Normally, the designer must translate the design into a format accept- able to the FPGAS and then build the board supporting the design. Often the design must be partitioned into several FPGAS implying that proper commmunication amongst the FPGAS exists. Quickturn systems accept files from the ASIC suppliers called netlist files, and translates the descriptions into a FPGA compatible description. A process termed as Emulation Synthesis automatically connects the FPGAS [29]. Once the design is converted into an FPGA format, it is downloaded onto a generalized bank of FPGAS, thus complet- ing the prototyping phase. 19 3.1.2 TInMANN TInMANN, a massively-parallel, stochastic architecture for the Kohonen Neural Net- works. Previously, uniprocessor systems were used for neural network simulation pur- poses. In order to exploit the massive parallelism in the neural network algorithms, specialized hardware, using a high -speed arithmetic logic units (ALU) [39] or multiple digital signal processing (DSP) chips connected through communication busses [39], was implemented. These approaches could not realize the full benefits of the parallelism or did not approach the minimal requirements of the hardware implementation. Modern VLSI technology overcomes these drawbacks by providing a highly connected, dense logic on a single chip. However, large and complex VLSI designs require to simu- late, create a test plan, and fabricate a chip in a short time duration. This is possible by using a rapid prototyping system: field programmable gate arrays (FPGAS) provide the solution. The FPGAS allow the user to incorporate theoretical advances while the designs remain on the cutting edge of VLSI technology. With these concepts and ideas a custom computing machine using Xilinx FPGAS, TInMaNN was conceived [38]. An IBM-PC host was used to interface and configure the array board as well as provide the software design tools to implement the neural networks on FPGAS. Along with FPGAS Static Random Access Memories (SRAMs) were provided to allow for the storage of weight values. The FPGAS cormnunicate amongst themselves through local busses which provide communication between adjacent chips, or global busses allowing the signals to be broadcast globally to 20 the other array chips. While such a system of FPGAS cannot operate as fast as an ASIC, it does provide rapid realization of new theoretical knowledge at very low costs. A customized VLSI chip (ASIC) can be easily produced once the FPGA prototype has been evaluated [28]. 3.1.3 The AnyBoard System The Anyboard array card is based on Xilinx FPGAS and has evolved as a sucessor to the TInMANN sytem [38]. The AnyBoard consists of a field of programmable gate arrays (FPGAS), that can be modified by software. The FPGAS have both local busses between FPGAS and access to a global bus for interfacing the external hardware. Each FPGA has a local memory that can be mapped to the AnyBoard. A pre-programmed FPGA and several other buffer chips control the PC bus interface for the AnyBoard. The Anyboard array board resides in the IBM-PC host connected to a local network. Test vectors generated by the IBM—PC act as inputs to the board for diagnostics. 3.1.4 Summary This chapter provides an overview of several custom computing machines that have been developed as reconfigurable architectures to implement and verify designs. Chapter 4 The Splash 2 Programming Environment There is but one art, to omit Robert Louis Stephenson 4.0 Programming the Xilinx FPGAS The field programmable gate arrays (FPGAS) of the Splash 2 system, shown as His X0...Xl6 in Figure 4.1, are programmed in the VHSIC Hardware Description Language (VHDL). VHDL [34] is a hardware description language and a simulation language. In addition, VHDL behaves as an input language for logic synthesis. 4.1 Splash 2 Simulator The Splash 2 simulator [2, 4] is a collection of VHDL components that, together with the user applieation code, models the Splash 2 system behavior. The simulator structure as shown in Figure 4.2 is based on the Splash 2 architecture structure. The top level entity is the Splash_System, which contains the components for the Sparcstation Host, the Interface_Board, and the collection of Splash2_Boards. The Splash_System entity defines all of the top level signals, such as the host SBus and the Splash 2 SIMD Bus. 21 22 Splash Boards X1X2X3X4X5X6 Crossbar X16 X15 X14 X13 X12 X11 X10 Interface Board X1X2X3X4X5X6 XL Crossbar X16 X15 X14 X13 X12 X11 .0. 1"" I § u ”.9. ) ; .___. X1 X3 X4 X5 X6 X7 X8 I I I I I I l I _. xo Crossbar I I L I I I I xrs x15 X14 x13 x12 X11 x10 xs Figure 4.1. The Splash 2 System Architecture. 23 Sparcstation Host Interface Interface Board Splash 2 Array Board Figure 4.2. Splash 2 Simulator. 4.1.1 The Interface_Board Entity The interface_board entity contains components for the input and output FIFOs and the XLeft and XRight ICs, and is responsible for generating the system clock. The FIFO com- ponents read and write ASCII data files. 4.1.2 The Splash_Boards Entity The Splash_boards entity contains one or more component instantiations for the Splash 2 array boards. This entity declares the set of signals used to interconnect the Xilinx compo- 24 nents of each of the array boards and also the previous/next interboard connections. The slot ID for each board is declared at this level and is then passed on to each of the array boards. 4.1.3 The Splash2_Board Entity The Splash2_board entity contains component instantiations for the programmable Xilinx components, their associated RAM memories, and the crossbar switch. This entity also performs a variety of control functions, including selecting the source of the left input to X1, generating the memory write enable pulse, and checking the timing restrictions on memory accesses (e. g. write after a read constraint). 4.2 Simulator Files The system file, system. vhd contains the architecture and the entity for the top level Splash_System component. The Splash 2 simulator requires that this file be present in the current working directory, however, there is nothing within the file that the user needs to modify. The Splash 2 simulator (see Figure 4.2) reads a configuration file by the file, config.vhd. This section describes the components of the Splash 2 simulator configuration. This con- figuration file must identfy the libraries of the Splash 2_Board which contains the default architectures of the Splash 2 array board parts, and the Interface, which contains the 25 XLeft and XRight architectures. The first part of the file configures the interface board, in which the component identifier within the Splash_System is IFACE, the entity name is Interface_Board and the archi- tecture name is Structure as shown in Figure 4.3. for [F ace: Interface_board use entity Interface: Interface_board( Structure ) Figure 4.3. Interface_Board Entity Declaration. The generic parameters of the Interface_Board entity are: Input_Filel: A data string representing the name of the file to be read by the input FIFO 1. The default value string is a null string. Input_File2: A data string representing the name of the file to be read by the input FIFO 2. The default value is a null string. ,13_Filel: A data string representing the name of the file to be read by the external input. The default value is a null string. Output_Filel: A data string representing the name of the file to be written by the output FIFO 1. The default value is a null string. Output File 2: A data string representing the name of the file to be written by the output FIFO 2. The default value is a null string. File_Type: Format of the output files, i.e., either BIN/ HEX. The default value is BIN. Clock_Freq: The frequency of the system clock. The default value is 20 MHz. 26 The structural components of the interface board includes the XLeft and XRight FPGA parts. The entity names are also XLeft and XRight; the default architectures of both parts are named Valid and are found in the Interface library declaration in Figure 4.4 (a). for Structure for all: XLeft use entity Interface.XLeft( Valid); end for; for all: XRight use entity Interface.XRight( Valid ),' end for; end for; Figure 4.4 (a). Interface Library Declaration. The component identifier for the Splash 2 array board component within the Splash2_System is named as SPLASH, the entity name is Splash2_Boards and the archi- tecture name is Structure as shown in Figure 4.4 (b). for Splash: Splash2_Boards use entity S2b0ard. Splash2_Boards( Structure ) Figure 4.4 (b). Interface Library Declaration. The only generic parameter of the Splash2_Board is Number_of_Boards. This parameter defines the total number of Splash 2 boards in the system. The default value is 1. 27 The array of Splash boards is generated with the identifier SBOARDS. Within the gener- ate statement, the Splash2_Board component is instantiated as BD. The architecture of the Splash2_Board as shown in Figure 4.5, may be a user defined behavioral model from the current working library: for Sboards for Ed: Splash2_Board use entity work. Splash2_Board(Behavior); end for; Figure 4.5. Splash2_Board Entity Declaration. or the default structural model which instantiates individual FPGA components (architec- ture Structure in library SZBoard) as shown in Figure 4.6: for Sboards for Ed: Splash2_Board use entity SZBoard. Splash2_Board( Structure ); Figure 4.6. Splash2_Board Structure Declaration. The Structure architecture of Splash2_Board instantiates the following components: Splash_Crossbar a model of the central crossbar ControLPart the FPGA control part, X0 Processing_Part the FPGA processing parts, X1 through X16 Memory_Part the external memories associated with X0 - X16 28 The Splash_Crossbar component has a single generic parameter: Config_File: A data string specifying the name of the file containing up to eight crossbar configurations. The default value is a null string, indicating that the crossbar is disabled. The Memory_Part component has a single generic parameter: Load_File: A string specifying the name of the file containing directives for pre-loading the memory contents. The default value is a null string, indicating that the ini- tial value of the memory contents are unknown. 4.3 The Xilinx Control Part The Xilinx_Control_Part component (X0) is configured as shown in Figure 4.7. for all: Control _part use entity Board. Control_Part( Blank); end for; Figure 4.7. Xilinx_Control_Part Declaration. The Xilinx_Procsssing_Part components and their associated memories are instantiated with a VHDL generate statement with the identifier XPARTS. In the following example all the 16 processing elements are configured using the “MTest” architecture such that: 29 for Xparts(1 to 16) for all: Xilinx_Processing_part use entity work.X ilinx_Pr0cessing_Part( M Test); end for; for all: Memory_Part use entity SZBoard.Memory_Part( Dynamic) generic map (Load_File => “test.mem”); end for; Figure 4.8. The MTest Architecture. 4.4 Interface Board This section describes the interface board and its programmable components. 4.4.1 XLeft Entity A detailed summary of each of the ports of the XLeft entity is given below. XLeft_FIFO_IN : The 36-bit bi-directional DataPath from the input FIFOs. XLeft_FIFO_Sel : A bi-directional signal that selects either of the two input FIFOs to read from. When low (logic ‘0’) FIFO #0 is selected, when high (logic’ 1’) FIFO #1 is selected. XLeft_FIFO_EnabIe_L : A bi-directional signal that is an active low signal which enables either FIFO#1 or FIFO#2 (depending on the value of XLeft_FIFO_Sel) to drive the XLeft_FIFO_IN data bus. XLeft_FIFO3_OE : An active low signal (bi-directional) which enables the external input FIFO to drive the XLeft_FIFO_lN data bus. XLeft_FIFO3_OE and XLeft_FIFO_Enable_L must not be asserted simultaneously. 3O XLeft_FIFOl_RE : A bi-directional signal is the read signal of the input FIFO #1. When asserted (logic ‘1’) and the FIFO has valid data (XLeft_FIFOl_EF is ‘0’), data are clocked out of FIFO #2. This signal is an output from XLeft. XLeft_FIF02_RE : A bi-directional signal is the read enable signal of the input FIFO #2. When asserted (logic ‘1’) and the FIFO has valid data (XLeft_FIFO2_EF is ‘0’), data are clocked out of FIFO #2. This signal is an output from XLeft. XLeft__FIF03_RE : A bi-directional signal is the read enable signal to the external input. When asserted (logic ‘1’) and the external input has valid data (XLeft_FIFO3_EF is ‘0’), data are clocked out of FIFO #3. The signal is an output from XLeft. XLeft_FIFOl_EF : A bi-directional signal is the empty flag from the FIFO #1. When low (logic ‘0’), it indicates the FIFO has valid data available to be clocked. This signal is an input to XLeft. XLeft_FIF02_EF : A bi-directional signal is the empty flag from the FIFO #2. When low (logic ‘0’), it indicates the FIFO has valid data available to be clocked. This signal is an input to XLeft. XLeft_FIF03_EF : A bi-directional signal is the empty flag from the FIFO #3. When low (logic ‘0’), it indicates the FIFO has valid data available to be clocked. This signal is an input to XLeft. XLeft_BUF_Clk : An input bit signal that represents the free running version of the sys- tem clock and can be used to control the state machine which starts 31 and stops the system clock. XLeft_Enable_Clk : An out bit signal which may be used to start and stop the system clock. When XLeft_Enable_Clk is high (logic ‘1’), the system clock is enabled; when low (logic ‘0’), the system clock is dis- abled. XLeft_SYS_CLK: An in bit that is a wire-ORed global OR result signal from the Splash array board. XLeft_GOR_Result : An input bit that is a wire-ORed global OR result signal from the Splash array board. XLeft_To_XRight : A bi-directional datapath that is a 36-bit wide to chip XRight. XLeft_SIMD : A bi-directional datapath that is a 36—bit wide output to the global SIMD Bus. XLeft_Int : A bi-directional signal that is an interrupt to the host. 4.4.2 XRight Entity A detailed summary of the XRight Entity is given below. XRight_FIFO_IN : A bi-directional datapath that is 36-bit wide data to the output FIFOs. XRight_FIFOl_WE : A bi-directional data path that is a write enable output to the out- put FIFO #1. When high (logic ‘1’) and the XRight FIFO#1 is low, data are clocked from XRight_FIFO_IN into FIFO #1. XRight_FIF02_WE : A bi-directional data path that is a write enable output to the out- put FIFO #2. When high (logic ‘1’) and the XRight FIFO#2 is 32 low, data are clocked from XRight_FIFOJN into FIFO #2. XRight_FIFOl_FF : A flag input from the output FIFO #1. When high (logic ‘1’) and the XRight FIFO] is low, data are clocked from XRight_FIFO_IN into FIFO #1. XRight_FIF02_FF : A flag input from the output FIFO #2. When high (logic ‘1’) and the XRight FIF02 is low, data are clocked from XRight_FIFOJN into FIFO #2. XRight_SYS_Clk : The system clock for the splash 2 system. XRight_To_XLeft : A bi-directional datapath is a 36-bit wide bi-directional data path to the chip XLeft. XRight_R : A 36-bit wide bi-directional data path to the chip XLeft. XRight_Readir : An output bit that controls the direction of the RBus. When the signal is low, the data are driven from the last Splash 2 array board onto the RBus. When the signal is high, data on the XRight_R will be driven onto the RBus. XRight_Int: A bi-directional signal is an interrupt to the host. 4.5 The Xilinx Processing Part Xilinx_Processing_Part is the name of the basic processing element of the Splash 2 sys- tem, Xilinx processing elements X1 through X16 as shown in Figure 4.1. Two generic parameters are provided to allow the user to customize a processing element according to its location in the system. During simulation the Splash 2 simulator automatically assigns the values to these generics, but when computing for synthesis the user is responsible for 33 specifying the range of values. The Xilinx_Processing_Part generics are: BD_ID : An integer used to specify the board position within the system (0 to 12). The ini- tial value of this is specified to 0. PE_ID : An integer specifying the processing element number (1 to 16) within the board. The initial value of this is specified to 0. A detailed summary of each of the ports of the Xilinx_Processing_Part is given below. XP_Left : A 36-bit wide bi-directional data path to the left (lower numbered) neighbor chip. XP_Left of chip X1 is either connected to the SIMD Bus if it’s on board number 1, or to the previous board if it is on board 2 or greater. XP_Left and XP_Right together form the links in the linear data path. XP_Right : A 36—bit wide bi-directional datapath to the right (higher numbered process- ing element) neighbor chip. XP_Right of the processing element X16 is connected to the R-Bus if the board number matches the value of the Size bus. In case there is another Splash 2 board as part of the configuration, XP_Right of the processing element X16 is connected to XI of the next board. XP_XBar : A 36-bit wide bi-directional datapath between the Xilinx PE and the central crossbar. XP_XBar_Enable_L : An output Bit_Vector (4 downto 0), is a collection of the output enables of the five crossbar ports. When the enable bit is low (Logic ‘0’), the crossbar drives the corresponding byt e of the 34 data path. When the bit is high (Logic ‘1’), the crossbar may receive data on the corresponding byte. The correspondence between the enable bits and the data path is as follows: XP_XBar_Enable_L(0) XP_XBar (7 downto 0) XP_XBar_Enable_L(1) XP_XBar (15 downto 7) XP_XBar_Enable_L( 2 ) XP_XBar (23 downto I6) XP_XBar_EnabIe__L( 3 ) XP_XBar (31 downto 24) XP_XBar_Enable_L( 4 ) XP_XBar (35 downto 32) Figure 4.9. The MTest Declaration. XP_Clk : The global Splash 2 system clock’s duty cycle is not guaranteed, so all the application programs should only be implemented on the rising edge. XP_Int : An input bit that is an interrupt output signal. The interrupt signals from all the seventeen Xilinx PEs are combined with a 17-bit mask value to form the board interrupt signal to the Interface Board. XP_Memory_Addrsss : A bi-directional 18-bit address to the external memory. XP_Memory_Data : A bi-directional data bus to and from the external memory. XP_Memory_Read_L : A bi-directional bit is an external memory read signal. It is A asserted (active low) by the Xilinx PE synchronous with the address to initiate a memory read operation. Valid data are available on thr clock edge following the assertion ‘of XP_Memory_Read_L. XP_Memory_Write_L : A bi-directional bit is the external memory write signal. It is asserted (active low) by the Xilinx PE synchronous with the data and address to perform a memory write operation. 35 XP_Memory_Disable : An input bit is a system level input to the Xilinx PE, which when asserted, causes the Xilinx chip to tristate all of its 110 pads. This port should not be used by the application programs XP_Broadcast : An input bit that is a common input signal to all the Xilinx Processing Parts. It is driven by the control PE X0. XP_Reset : An input bit that is a system signal to all of the Xilinx PE which, when asserted, causes the Xilinx chip to reset the state of all the internal flip flops. This port should not be used by the application programs. XP_HSO, XP_HSl : The bi-directional signals that are signals of two handshake regis- ters. Signal XP_HSl is a common input to all 17 Xilinx PEs, while XP_HSO is an independent bi-directional signal to each chip. XP_GOR_Result : A bi-directional signal connected to chip X0 conventioanally used to perform a global OR among all of the Xilinx PEs. XP_GOR_Valid : A bi-directional signal connected to the control PE X0. It is conven- tionally used to perform a “global OR” among all of the Xilinx PEs. XP_LED : An output bit that is an active high. An output which controls the correspond- ing LED on the array board. There are a number of pre-defined architectures of the Xilinx_Processing_Part available in the Splash 2 library, including a null design for the applications that use less than all 16 PBS. The set of the pre-defined architectures are as follows: Blank : This architecture is completely passive, and should be inserted for Xilinx PEs not 36 being used in the application. Left_to_Right : This architecture moves the 35 bits of data from the left side (XP_Left) to the right side (XP_Right) with a two cycle pipeline delay. All the other ports are passive. Left_to_XBar : This architecture moves the 36-bit data from the left side (XP_Left) to the crossbar (XP_XBar) with a two cycle pipeline delay. All the other ports are passive. XBar_to_Right : This architecture moves the 36-bit data fiorn the crossbar (XP_XBar) to the right side (XP_Right) with a two cycle pipeline delay. All the other ports are passive. 37 The architectures for the above mentioned routines and the processing elements X0...X16 are shown in Figures 4.10 (a), (b), and (c). --Architecture Blank is a completely passive Xilinx part architecture Blank of X ilinx_Processing_Part is begin XP_Left <= TriState (XP_Left); XP_Right <= TriState (XP_Right); XP_Xbar <= TriState (XP_Xbar); XP_Memory_Address < = TriState (XP_Memory_Address); XP_Memory_Data <= TriState (XP_Memory_Data); XP_Memory_Read < = '0'; XP_Memory_Write < = '0'; XP_HSI <= 'Z'; XP_HSZ <= ‘Z'; XP_GOR_Result <= 'Z'; XP_GOR_Valid <= 'Z'; XP_Int <= '0’; XP_Xbar_Enable_L <= "11111"; end Blank; Figure 4.10 (a). File “Parts.vhd” Processing Element Declaration. 38 —- Architecture Left_to_Right feeds all 36 bits of data -- from the Left port: to the Right port.Data are latched -- in the Left port on the first cycle, and latched inthe -- Right port the next cycle, for a total of 2 cycles -- delay. architecture Left_to_Right of X ilinx_Pr0cessing_Part is signal PassThru: Bit_Vect0r(DATAPATH_WIDTH-I downto 0); signal Left: Bit__Vect0r( DATAPATH_WIDTH-1 downto 0); signal Right: Bit_Vect0r(DATAPATH_WIDTH-I downto 0); begin Pad_Input (XP_Left, Left); Pad__0utput (XP_Right, Right); process begin wait until XP_Clk'Event and XP_Clk = '1'; Right <= Passlhru after 2 ns; PassThru <= Left after 2 ns; end process; XP_Xbar <= TriState (XP_Xbar); XP_Memory_Address < = TriState (XP_Memory_Address); XP_Memory_Data < =TriState (XP_Memory_Data); XP_Memory_Read <= ’0'; XP_Memory_Write < = '0’; XP_HSI <= 'Z'; XP_HSZ <= 'Z'; XP_GOR_Result < = 'Z'; XP_GOR__Valid < = 'Z'; XP_Int <= '0'; XP_Xbar_Enable_L <= "11111"; end Left_to_Right; Figure 4.10 (b). File “Parts.vhd” Left_to_Right Architecture Declaration. 39 —-— Architecture Left_to_XBar feeds all 36 bits of data -— from the Left port to the Crossbar port.Data are -- latched in the Left port on the first cycle,and -- latched in the Crossbar port the next cycle, for a -- total of 2 cycles delay. architecture Left_to_XBar of X ilinx_Processing_Part is signal PassThru: Bit_VectorfDATAPATH_WIDTH-1 downto 0); signal Left: Bit_VectodDATAPATH_WIDTH-I downto 0); signal Right: Bit__VectodDATAPATI-LWIDTH-1 downto 0); begin Pad_Input (XP_Left, Left); Pad_Output (XP_XBar; Right); process begin wait until XP_Clk'Event and XP_Clk = '1 '; Right <= PassThru after 2 ns; Passlhru <= Left after 2 ns; end process; XP_Right <= TriState (XP_Right); XP_Memory_Address < = TriState(XP_Memory_Address); XP_Memory_Data < = TriState (XP_Memory_Data); XP_Memory_Read <= '0'; XP_Memory_Write < = '0'; XP_HSI < = 'Z'; XP_HSZ <= 'Z'; XP_GOR_Result < = 'Z'; XP_GOR_Valid <= 'Z'; XP_Int < = '0'; XP_Xbar_Enable_L <= "11111"; end Left_to_XBar; Figure 4.10 (c). File “Parts.vhd” Left_to_XBar Architecture Declaration. 4O architecture XBar_to_Right of Processing_Part is signal PassThru: Bit_Vector(DATAPATH_WIDTH-I downto 0); signal Left: Bit_Vector(DATAPATH_WIDTH-1 downto 0); signal Right: Bit_Vector(DATAPATH_WIDTY-I—I downto 0); begin Pad_Input (XP_XBar, Left); Pad_Output (XP_Right, Right); process begin wait until XP_Clk'Event and XP_Clk = ’1 '; Right <= PassThru after 2 ns; PassThru <= Left after 2 ns; end process; XP_Left <= TriState (XP_Left); XP_Memory_Address <= TriState (XP_Memory_Address); XP_Memory_Data <= TriState (XP_Memory_Data); XP_Memory_Read < = '0'; XP_Memory_Write < = '0'; XP_HSI <= 'Z'; XP_HS2 <= 'Z'; XP_GOR_Result <= 'Z'; XP_GOR_Valid <= 'Z'; XP_Int <= '0'; XP_Xbar_Enable_L < = "00000"; end XBar_to_Right; Figure 4.11. File “Parts.vhd” XBar_to_Right Architecture Declaration. 41 Architecture XBar_to_Right as shown in Figure 4.10, feeds all 36 bits of data from the Crossbar port to the Right port. Data are latched in the Crossbar port on the first cycle, and latched in the Right port the next cycle, for a total of 2 cycles delay. 4.6 Xilinx Control Part Entity The Xilinx_Control_Part entity also supports the same two generics to specify the physi- cal location, although PE_ID is always bound to O. BD_ID : An integer specifying the board position within the system (0 to 12). The initial value of the signal is set to O. PE_ID : An integer specifying the computing element number (always 0) within the board. The detailed summary of each of the ports are as follows: XO_SHV[D : A bi-directional datapath that is the input from the common SIMD bus. XO_XB_Data : A bi-directional datapath that is the output to the crossbar. This path is shared with the XP_XBar data path of the PE X16. X0_Memory_Address : A bi-directional memaddress that is the 18-bit address signal to the external memory. X0_Memory_Data : A bi-directional datapath that is the l6-bit bi-directional data bus to and from the external memory. X0_Memory_Read_L : A bi—directional bit that is the external memory read signal. It is asserted (active low) by the Xilinx PE synchronous with the 42 address to initiate the memory read operation. Valid data are available on the clock edge following the assertion of X0_Memory_Read_L. XO_Memory_Write_L : A bi—directional bit that is the external memory write signal. It is asserted (active low) by the Xilinx PE synchronous with the address and the data to perform a memory write operation. X0_GOR_Result_In : A bi-directional vector that is the array of XP_GOR_Result sig- nals from each of the Xilinx_Processing_Parts on the Splash 2 board. X0_GOR_Valid_In : A vector that is the bi—directional array of XP_GOR_Valid signals from each of the Xilinx_Processing_Parts on the Splash 2 board. X0_GOR_Result : An output bit that is the output from X0 to the wire-ORed GOR_Result signal of the backplane. X0_GOR_Valid : An output bit that is the output from X0 to the wire-ORed GOR_Result signal of the backplane. X0_Clk : An input bit that is the global Splash 2 system clock. X0_XBar_Set : An output Bit_Vector (0 to 2) that is the three bit crossbar configuration selector. When X0 drives the value of “000”, the crossbar selects the pre-loaded configuration 0, “001” selects the configuration 1 and “002” selects configuration 2 and so on. X0_X16_Disable : An output bit that when asserted, causes Xilinx PE X16 to be isolated from the crossbar and allows X0 to send or receive data from the crossbar. 43 XO_XBar__Send : An output bit that when asserted causes the Xilinx PE X16 to be iso- lated from the crossbar and allows X0 to send or receive data from the crossbar. X0_Int : An output bit that is an interrupt output signal. The interrupt signals from all the 17 Xilinx PEs are combined with the 17th bit mask value to form the board interrupt signal to the interface board. X0_Broadcast_ln : An input bit that is the broadcast signal input from the Interface Board. X0_Broadcast_0ut : An output bit that is the common broadcast signal to each of the 16 Xilinx PEs. X0_Reset : An input bit that is a system level input to the Xilinx PEs, which when asserted, causes the Xilinx ICs to reset the state of all internal flip fiops.This port should not be used by the application programs. X0_HSO, X0_HSl : The signals of the two handshake registers are bi-directional signals. Signal X0_HSl is a common input to all 17 Xilinx PEs. X0_HSO is an independent bi-directional signal to each chip. X0_LED : an out Bit that is an active high output which controls the X0 LED on the back of the board. There is a pre-defined architecture for the Xilinx_Control_Part in the Splash 2 library. Blank : Completely passive architecture that should be inserted whenever the control X0 is not used by an application. The VHDL code for the Control_Part in the “parts.vhd” is shown in Figure 4.12 as fol- lows: library SPLASHZ; use SPLASH2.TYPES.all; use SPLASH2.SPLASH2.all; use SPLASHZ. C 0MPONENT S.all; entity Xilinx_Control_Part is Part (X0_SIMD) X 0_XB_Dataata X0_Memory_Address X0_Memory_Data X0_Memory_Read X0_Memory_ Write X0_Memory_Disable X0_GOR_Result_In X0_GOR_Valid_In X0_GOR_Result X0_GOR_Val id X0_Clk X0_XBar_Set X0_XI 6_Disable X0_Int X0_Broadcast_ln X0_Broadcast_0ut X0_Reset X0_HSI, X0_I-IS2 ),. end Xilinx_Control_Part; : bi-directional datapath; : bi-directional datapath; : bi-directional memaddr; : bi-directional memdata; : bi-directional bit; : bi-directional bit; : in bit; : bi-directional_vector; : bi-directional _vector; : out bit; .- out bit; : in bit; : out bit_vector(0 to 2); .- out bit; : out bit; : in bit; : out bit; : in bit; : bi-directional bit Figure 4.12. File “Parts.vhd” Control_Part Architecture Declaration. Architecture Blank is a completely passive control part and is shown in Figure 4.13. 45 architecture Blank of Xilinx_Control_Part is begin X0_SIMD X0_XB_Dataata X0_Memory_Address X0_Memory_Data X0_Memory_Read X0_Memory_ Write X0_GOR_Result_in X0_GOR_Valid_in X0_GOR__Result <= TriState(X0_SIMD); < = TriState(X0_XB_Dataata); < = TriState(X0_Memory_Address); < = TriState(X0_Memory_Data); <= ’0'; <= '0'; < = TriS tate( X 0_GOR_Result_In ); < = TriS tate( X0_GOR_ Val id_ln ); <= '0'; X0_GOR_Valid <= '0'; X0_XBar_Set <= "000"; X0_X16_Disable < = '0'; X0_Int <= '0'; X0_Broadcast_0ut <= '0'; X0_HSI <= 'Z'; X0_HSZ <= ‘Z'; end Blank; Figure 4.13. File “Parts.vhd” Blank Architecture Declaration. 4.7 Crossbar The crossbar connects a 36—bit bidirectional port of each of the 16 Xilinx processing part (X1 - X16). The control IC, X0, shares the crossbar port of X16. The crossbar can store upto eight configurations, with X0 selecting between them. The simulator model for the crossbar reads the control file which describes up to eight crossbar settings. The Xilinx_Processing_Part entity contains a 5-bit port to control the active-low output enables of its corresponding crossbar port. The control port is declared as : XP_XBar_Enable_L : out Bit_Vector (4 downto 0); 46 The least significant bit (XP_XBar_Enable_L(0)) controls the least significant byte of the crossbar (XP_XBar (7 downto 0)) while the most significant (bit 4) controls the most sig- nificant byte (XP_XBar (35 downto 32)). Thus, since the crossbar output enables are active-low, a ‘1’ on an enable signal implies the corresponding crossbar byte is receiving from the Xilinx IC. Similarily, a ‘0’ implies the crossbar is sending to Xilinx chip. The user must be careful not to allow both the crossbar and the Xilinx chip to drive these sig- nals simultaneously. The default state of XP_XBar_Enable_L therefore should be ‘1’. The Xilinx_Control_Part entity contains two ports, X0_X16_Disable and X0_XBar_Send, which control the shared crossbar connection. When X0_X16_Disable is high (logic ‘1’), the output enables associated with X16 are disabled, independent of the state of X16’s XP_XBar_Enable_L port, and the direction of the crossbar is controlled by the X0_XBar_Send: a logic ‘1’ causes X0 to drive into the crossbar, a logic ‘0’ enables the crossbar to drive to X0. When X0_X16_Disab1e is low, X0 is isolated from the crossbar. However, X0 cannot selectively control individual bytes of the crossbar, when X0_X16_Disable is asserted, X0 must drive or receive all 36 bits. The default architecture for X0 (Blank) sets both the outputs to ‘0’, isolating X0 from the crossbar. The crossbar model in the simulator is passed a generic parameter that contains the name of the file, which consists of up to eight settings. The first line of the setting consists of the key word “configuration” followed by the integer value 0 to 7 (both inclusive). The remaining lines of the file are of the form: output-port-number input-port-specifier 47 where the “output-port-number” ranges from 1 to 16 and corresponds to the Xilinx PE number. The “input-port-specifier” is either a single integer (0 to 16), or 5 integers (0 to 16 each). If it is a single integer in the range (1 to 16), it specifies the source port of all 36 bits of the output. A value of zero is used to indicate port “output-port-number”. This value is the input to the crossbar switch. If the “input-port-specifier” consists of S integers, each of the integers specifies the source for one byte of the output, from the most significant to the least significant bit. If an output port number is missing from the configuration, it is assumed to be set to zero. Also, in order to disable the crossbar port, both the “input-port-specifier” and the XP_XBar_Enable_L signals must be set to ‘0’ and ‘1’ respectively. Since the crossbar configurations are defined from 0 through 7 (both inclusive), they can be defined as follows : dd 50 $1 32 s3 54 where dd represents the destination PE identity, 50 represents the source (processing ele- ment) for the bits (0...7), 31 represents the source for the bits (8...15), 52 represents the source for the bits (16.23), 83 represents the source for the bits (24...31), and s4 repre- sents the source for the bits (32.35) respectively. 48 For example configuration 0 may be defined as: 16 16 l6 16 16 16 l6 l6 16 16 16 16 16 16 16 16 l6 16 16 16 l6 l6 l6 l6 l6 16 16 16 16 16 16 16 l6 l6 16 16 16 16 16 l6 l6 l6 l6 16 16 10 16 16 16 16 16 ll 16 16 16 16 16 12 16 16 16 l6 16 10 16 l6 16 16 16 13 l6 l6 16 16 16 14 16 16 l6 l6 16 15 l6 16 16 16 16 16 16 16 l6 l6 l6. \OOO\)O\UIJ>U)Nv—‘ In this example, the X16 represents the X0 data since X0 and X16 both share a common port on the crossbar. Thus 1 16 16 16 l6 16 represents “PE X1 gets data from PE X16 (X0)”. Also whenever the ConfigXBar file is invoked by the T2 debugger, multiple con- figuration files, as shown, program the crossbar with the appropriate configuration files (maximum of eight configurations). 49 Configuration 1 may be defined as : l 1 l 1 1 1 PE X1 gets no data from the data bus 2 l6 l6 l6 16 16 PEXZ getsdatafromPEXO 3 3 3 3 3 3 PE X3 gets no data from the data bus 4 16 16 l6 l6 l6 PEX4getsdatafrom PEXO 5 5 5 5 5 5 PE X5 gets no data from the data bus 6 16 16 16 16 16 PEX6getsdatafrom PEXO 7 7 7 7 7 7 PE X7 gets no data from the data bus 8 16 16 l6 l6 l6 PEX8getsdatafrom PEXO 9 9 9 9 9 9 PE X9 gets no data from the data bus 10 16 16 16 l6 l6 PEXIOgetsdatafromPEXO 11 11 11 ll 11 ll PEXIIgetsnodatafromthedatabus 12 16 l6 16 16 16 PEXIZ getsdatafrom PEXO 13 13 13 13 13 13 PEX13getsnodatafrom thedata bus 14 l6 16 16 16 16 PEXI4getsdatafrom PEXO 15 15 15 15 15 15 PEXI5getsnodatafrom thedata bus 16 16 l6 l6 16 16 PEX16getsdatafromPEXO. In this example, 1 l l 1 1 1 represents “PE X1 gets no data from the data bus” however, 2 16 16 16 16 16 represents “PE X2 gets data from PE X0". Thus only the even number of PBS receive data from PE X0. Also, whenever the ConfigXBar file is invoked by the T2 debugger, multiple configuration files, as shown above, program the crossbar with the appropriate configuration files (maximum of eight configurations). The different configu- rations are selectable from the control PE X0. A port, X0_XBAR_SET, tells the crossbar which of the configurations is to be used. In case multiple configurations (greater than six- teen) are to be selected, multiple ConfigXBar calls can be made. In the following example, the crossbar settings (0 and 1) are defined. In the first setting, the odd number of the Xilinx PEs drive the crossbar, and the even numbered Xilinx PEs receive the data from the crossbar. In the second setting, only PE X2 receives, but it is receiving only one byte from each of the PEs X9, X7, X5, X3, and XI. The crossbar ports 50 of PBS X3 and X16 are explicitly disabled. The file containing the configuration of the crossbar is passed to the crossbar model in the Splash 2 configuration file “config.vhd” using the following lines of code to the Splash 2 structure: for Xbar : Splash_Crossbar use entity SZBoard.Splash_Crossbar (Behavior) generic map (Config_File => “test.xbar”); end for;. 4.8 Memory The memory model used by Splash 2 simulator allows the user to specify a set of initial contents for the on-board memory from an ASCII input file. The name of the file is passed in as a generic parameter to the entity Memory_Part. The default value (a null string) implies that no initialization of memory is performed. During simulation, a read operation from an uninitialized and unwritten memory location will lead to an error. Thus, in order to specify the initialization of the memory, the file name must be passed on to the appro- priate Memory_Part through the Load_File generic in the “configvhd” file. The format of this initialization file is fairly straightforward. For each set of the contiguous blocks of data, the base address of the block is given followed by one or more data values. The base address specification is the line containing the address followed by the decimal, unsigned integer. The data values are specified beginning on a new line, each line may specify any number of values. The data values are also interpreted as unsigned decimal 51 values. For example, to load the first eight locations of the memory with consecutive inte- gers, the following file may be specified: address 0 1 2 3 4 5 6 7 8 9 10. Similarily, multiple blocks of the data may be specified within a single file separated by an address specification. For example : --First block of data address 0 1 2 3 4 5 7 8 9 10 11. «Second block of data address 255 2 3 5 7 ll 12 13 17 «End of file. Alternately, in case the entire memory has to be initialized to zero with a single line in the file consisting of the key word “clear”: clear --End of file. 4.9 Summary This chapter provides a description of the programming environment of Splash 2 and the files that are required to emulate designs on field-programmable gate arrays. The files described allow configuration of the crossbar and the processing elements based on the application. Chapter 5 Bar Code Identification On Splash 2 If a conceptual description is to be made, the machinery for making it ought to show itself in language. If a distinction cannot be made in language, it can not be made concep- tually. N. R Hansen 5.0 Bar Code Identification As per the earlier discussion on bar codes and their ever increasing applications, it is highly desirable to design a bar code recognition system which can automatically scan the bar code and decode the relevant information from it. In such a system, it is necessary to first obtain an estimate of the location of the bar code. Locating or localizing a bar code from an image can be a highly complex task, as the image may consist of text and graphics in addition to the bar code. 5.1 Texture Analysis A bar code represents a homogeneous region in an image that can be distinguished in terms of texture. Conventional segmentation techniques cannot be applied directly to a textured image as the gray scale levels of individual pixels do not provide adequate infor- mation. Contextual information describing the spatial relationship between the gray levels 52 53 of the pixels is necessary. 5.2 Multi-channel Filtering Jain and Farrokhnia [11] have developed a texture segmentation system based upon a multi-channel filtering technique using gabor filters. This system has been applied to doc- ument image processing applications such as text/graphics separation and address block identification on envelope and magazine covers [14]. The algorithm used for the texture segmentation based on gabor filters is shown in the Figure 5.1. This algorithm can be decomposed into five modules: (i) filtering, (ii) nonlinear transformation, (iii) feature computation, (iv) clustering, and (v) classification. The filtered images are obtained by applying a bank of gabor filters with different fre- quency and orientations to the input image. The filtering technique in the spatial domain can be successfully applied on the image using Splash 2. 0 By using a k x k mask, g(m,n) for 1 S m, n S k , each filtered output pixel is computed usrngbi’j= 2 2 gm.n.ai-m-i-n n=l,km=l,k ° The nonlinear transformation, being a point operation can be performed using a look-up table technique. This is a processing module that can reside on the PBS of Splash 2. Thus, based on the nonlinear function, f, each transformation pixel is defined as ci’j = f(bi’j). 54 0 The feature images define the “local energy” measure in a small window around each image pixel in the response image. Computation of the feature image is a neighborhood operation and is efficiently mapped onto Splash 2. 0 The square—error clustering algorithm is performed on a set of pixels which are randomly sampled. 0 The final segmentation is implemented by applying the minimum distance classifier to every pixel using the cluster center information generated by the square error clustering algorithm. Input Image l l | l I Gabor Filter Masks l l H} l l I ' * Filtered Images lit e l l l M—H Feature Images 1 (local energy) l l Classification Fig. 5.1 An Overview of the Bar Code Localization Algorithm. 56 5.3 Unsupervised Segmentation Method 5.3.1 Implementation Using Serial Code The pixels in the bar code region display rapid tonal variations along one direction. Thus, high radial frequencies are useful to identify these variations. In case of the unsupervised method, the pixels of the image are not classified into classes. The two highest radial fre- quencies 32./i and 64/2 c.p.i. (cycles/image) for a 256 x 256 image and four orientations (0, 45, 90, and 135 degrees) are of interest. The eight filtered images as shown in Figure 5.3. The 256 x 256 image is scanned at a resolution of 100 dpi. The input image is passed through each of the eight filters to generate eight images. The filtered images further gen- erate feature images. Figure 5.5 gives the mean values of the eight gabor filter features shown in Figure 5.4. There are three different lines representing three different texture regions present in the input image. The regions are bar code (solid line), text (dashed line), and picture/half tone (dotted line) regions. The difference in the responses to the filters of the bar code and the non-bar code regions is apparent in Figure 5.4. When 9 = 90° (filters: 3 and 7), the feature corresponding to the bar code has the lowest response, while when 0 = 0° (filter nos: 1 and 5), it has the highest response. The difference between the mean value of the adjacent filters for the bar code region is larger than the one for other regions. This characteristic of the bar code dis- tinguishes it from other non-bar code regions in the input image. 57 The unsupervised segmentation algorithm consists of the following steps: 0 Input the bar code image through a bank of gabor filters to obtain eight features for every pixel. Normalize these features to zero mean and unit standard deviation in order to avoid some features dominating the others. ° Cluster the normalized pattern vectors using a squared-error clustering algorithm [11]. In order to reduce the computation overhead, the algorithm is applied to 6% of the ran- domly chosen patterns to obtain the centroid of each cluster. 0 Assign all the patterns (pixels) into k clusters using the minimum Euclidean distance classifier; k is predefined. ' Identify the bar code region in the segmented image using the following heuristic: per- form the connected component analysis, and remove all “small” or “large” connected components. Another heuristic, based on the aspect ratio, can be used if more than one connected component survives the size heuristic. 0 Finally, an identifying or bounding rectangle is computed for the identified bar code region. 5.3.2 Unsupervised Segmentation Since we specifically need to localize the bar code from the background, a priori knowl- edge can be used to design a two—class supervised classifier. This not only results in a bet- ter classification accuracy, but requires less computation time. However, the eight gabor filter features used in the unsupervised segmentation method are not inherently rotation- invariant. Although this is not a problem for the unsupervised segmentation, using these 58 features in the supervised method does limit the applicability of this method to bar code images with a specific orientation rotation. Thus, it becomes important to find features with rotation-invariance property. 5.3.3 Rotation-invariant Features An important characteristic of bar code is that the filter with its orientation perpendicular to the bar code always has the highest response (see Figures 5.3), and the filter with orien- tation along the bar code lines always has the lowest response (see Figure 5.3). The filters with orientations in the middle of the above two filters (45, and 135 degrees) have almost the same response (see Figure 5.3). Suppose that we have gabor filter for every orienta- tion. After taking the difference between pairs of orthogonal features, one of the differ- ences should give the highest mean value for the bar code region, while another one should give the lowest mean value. These two differences are rotation-invariant. In order to obtain more accurate results, more filters with additional orientation selection property are needed. However, a large number of filters will increase the computation time. Thus, a trade-off between the number of filters and computation time. For each image, the new feature images at eight different orientations are generated. The orientations are 0°, 225°, 45°, 675°, 90°, 112.5°, 135°, 157.5 °, respectively. The algo- rithm for generating the rotation-invariant features for each radial frequency has the basic three steps: 0 Generate eight gabor features, f(0), corresponding to the eight orientations. ' Compute the difference between the pairs of orthogonal features: 59 Af0=|f(0i)-f(9i+90°)|,i=1,2,3,4 where 91 = 0°,92 = 22.5°,93 = 45°, 94 = 675°. 0 For each i, i = 1, ..., 4 , compute the mean value over a window centered at each pixel, then select v9. which has the maximum value: 1(x0, yo) = Z (Afep (x+x0, y +y0 ) —Afe_ 59 (x, y) ),where W is a window (x,))) E W I 11:4 of 50 x 50 pixels (less than the size of the bar code). Using A f 9* + 45° for i = l, 2 and A f 9* _ 45., for i = 3, 4. Then, select 0k , which has the maximum value of v among the values {v9 , ..., ve } . The new rotation-invariant features are: 1 4 F1 = A f 9* and F 2 = A f e+45°' Using two highest radial frequencies, a total of four rotation invariant features are available for the supervised segmentation method of analy- sis. .ll lllllllllllllll. Figure 5.2. Bar Code Image. hill!!!) 1 3I til. 1.... u .s: \ : “If“ T?” ‘ (i) (ii) (iii) (iv) (v) 9 (vi) (vii) (viii) Figure 5.3. Filter Images Generated from the Bar Code Image as shown in Figure 5.2. 61 $.7m-m"-B (V) (Vi) (vii) (viii) Figure 5.4. Feature Images Generated from the Filtered Images as shown in Figure 5.3. 62 9 Q 1 Q) 2 <0 - (U > C 3 v 2 N '1 o I T r I 2 4 6 8 FilterNo. Figure 5.5. Mean Feature Values for the Feature Images. (Each line represents a different texture region for the input image.) 63 5.3.4 Supervised Segmentation Method The supervised segmentation method uses the two highest radial frequencies with eight orientations. Each of the frequencies has four features calculated by taking the difference of pairs of orthogonal features. Applying window averaging operation on these different images enables selection of the difference image with the maximum mean value and its corresponding difference image with the minimum value. Four features, two for each radial frequency, are obtained. A one-layer feedforward neural network (4 input units, 1 output unit) performs the simple two-class classification (bar code versus non-bar code). The training patterns for the neu- ral network are picked randomly from a set of rotated versions of the same bar code image. The number of training pixels for each image is 3000. The network is trained to perform 2000 iterations of the weight vectors. After the training is completed, the same network is capable of classifying other bar code images. Automatic bar code localization based on multi-channel gabor filtering technique is applied to the bar codes successfully. Both supervised and unsupervised methods are robust to variations in the orientation, scale, and shift of bar codes. The filtering and the feature extraction operations in the algorithm account for most of the required computations. However, regardless of the number of filters used, these algo- rithms can be implemented to perform operations in parallel. The use of non-linear trans- 64 formation following the linear filtering operations has been suggested as one way to account for the inherent nonlinear nature of the biological visual systems [12]. The local- ized filtering operation using gabor filters followed by the “squashing” nonlinear transfor- mation can be interpreted as a multi-scale blob detector [1 l]. The limitation of the texture segmentation algorithm is that it lacks a criterion for choos- ing the value of Cl in the nonlinear transformation. The value of or = 1/ 4 is used for the bar code localization. Also, the algorithm assumes that different channels are independent from each other. However, there is a psychophysical and physiological evidence indicat- ing inhibitory interactions between different spatial frequency channels [11]. Some of the research incorporates such interactions in their experiments [11]. Allowing inhibitory interactions among the channels is shown to have the capability of reducing the effective dimensionality of the feature space [12]. 5.4 A Suggested Approach to Bar Code Identification On Splash 2 The previous section describes the serial implementation of bar code analysis. Extending the concept to field programmable devices, the suggested approach to bar code analysis on Splash 2 can be sub-divided into the following basic steps: - Step 1: Define lockup tables in the local RAMs of multiple PEs. Load the Gabor filter mask data as the entries to the tables (see Figure 5.6 (a) Step 1). 0 Step 2: Apply convolution to the image data using the filter masks data stored in the lookup tables to obtain a set of filtered images. The convolution of the bar code image can be performed in the PEs whose local RAMs store the mask data (see Figure 5.6 (a) 65 Step 2). ' Step 3: Apply a non-linearity to the set of filtered images to obtain the feature images of the bar code images. This can be achieved by a table lookup performing convolution of the filter image data with a hyperbolic tangent function (see Figure 5.6 (b) Step 3). The trigonometric function data for the convolution resides in the local RAMs as lookup table entries. The resulting image data corresponds to the feature data for each of the filtered bar code images. The hyperbolic tangent function can be expressed in terms of an exponential function (see Figure 5 .8). 0 Step 4: Classify the image feature data using a simple perceptron implementation in VHDL (see Figure 5.6 (b) Step 4). Altemately, the feature data can be returned back to the Sun Sparcstation to perform the classification of bar code data. Each of the above mentioned steps have been illustrated in Figures 5.6 (a) and (b) and described using VHDL code in Figures 5.7, 5.8, and 5.9. RAM 66 RAM Gabor Mask LUT Gabor Mask LUT RAM ,l 3 PE 3 Gabor Mask LUT L l H . H . Step 1: Load Gabor mask data into the local RAMs of the PEs RAM RAM Gabor Mask LUT Step 2: Load bar code data into the PBS and perform convolution with Gabor masks i Gabor Mask LUT RAM l H FE i Crossbar Gabor Mask LUT - l H mi l I Figure 5.6 (a). Steps Involved in Indentifying Bar Code Images on Splash 2. 67 RAM RAM RAM tanh () tanh () tanh () LUT LUT LUT +1 H m H . l I Crossbar Step 3: Perform convolution of the bar code filter images with tanh () LUT stored in RAM to obtain feature images RAM Weights LUT PE Simple Perceptron Step 4: Perform classification of the feature images using a neural network classifier Figure 5.6 (b). Steps Involved in Identifying Bar Code Images on Splash 2. 68 The lookup tables illustrated in Figures 5.6 (a) and (b) can be implemented efficiently as shown in Figure 5.7. subtype Byte_Int is Integer range -128 to 127; subtype Index_1 is Integer range 0 to 12; subtype Index_2 is Integer range 0 to 12; type 7Table is array (Index_1, Index_2 ) of Byte_Int; ( VHDL arrays are assumed to be allocated in row-major, last-index varying fastest, order.) --Assuming that Index;2 is ascending type Tfable is array (Index_1, Index_2 ’Low to Index_2 ’High/Z ) of Integer; - -Rou tine to access the array fitnction Tablelookup (TableflTable; I_I : Index_1; I_2: Index_2) return Byte_Int ts variable Temp: Integer := TI‘able (I_I, I_2/2); variable Bias: Integer := I_2 mod 2; begin if Bias = 0 then - - Need lower half of temp: return Temp mod 256; else - -Need upper half of temp; return Ramp/256; end if end Table Lookup;. Figure 5.7. Table Lookup Architecture Declaration for Storing Data in the Local PE RAMs. 69 The VHDL code for the exponential function is shown in Figure 5 .8. function EXP (X: real) return real is --returns e**X; where e =MATH_E -- This function computes the exponential using the following -- series:exp (x) = 1+ x + x**2/2! + x**3/3! + ...; x > 0 constant MATH_E; real := 2.71828_18284_59045_23536; constant Eps : real := 0. 000001; -- precision criteria variable Reciprocal: boolean := x < 0.0; -- check the sign of the argument variable Xlocal : real := abs (x); -- use positive value variable Oldval : real; -- variables for series evaluation variable Num : real; variable Count: integer; variable Denom : real; variable Newval : real; begin -— compute value for special cases If X = 0.0 then return 1.0; else if X = 1.0 then return MATH_E; end if; -- compute value for general cases Oldval := 1.0; Num := Xlocal; Count := I; Denom := 1.0; Newval := Oldval + Nam/Denom; while ( abs(NEWVal - Oldval) > eps ) loop Oldval := Newval; Num := Num * Xlocal; Count := Count + I; Denom := Denom * (real ( Count) ); Newval := Oldval + Num/Denom; end loop; if Reciprocal then Newval := 1.0/Newval; end if; return Newval; end EXP;. Figure 5.8. Exponential Function Architecture Declaration. 70 5.5 A Suggested Mapping of Simple Perceptron onto Splash 2 Lately, artificial neural networks have been extensively simulated on conventional com- puters. They are capable of providing solutions to complicated problems. However, highly parallel architectures like Splash 2 can be optimized to accommodate these neural models. The simple perceptron neural network model has been used for pattern recognition and robotics [15]. The architecture suggested for implementation on Splash 2 is designed to exploit the concurrency of the neural networks [41]. The processing elements on Splash 2 are capable of performing multiplication (as a table lookup), addition, and function evalu- ation, while the local RAMs are capable of storing I/O data. The host supplies the data and receives trained data results from Splash 2. The three major entities required are the controller configuration, the neuron configura- tion, and its associated RAM based LUT for storing the weight values (see Figure 5 .9). In order to perform a simple classification, the Splash 2 is configured as a SIMD machine. The PEs are configured to allow a single layer of the algorithm to run on Splash 2. The PEs local RAMs contain the initial weights and thresholds. The contents of the local RAMs are duplicated to allow broadcast to the left and right processing elements. Finally, the neural network based classification has the PEs perform the sum of products using the broadcast values and their corresponding local weights. Once all the input pat- terns of the bar codes are presented, the final sum of products plus the threshold is com- 71 puted and the sigmoid function is performed on all the PBS. The number of physical neurons may be limited by the number of FPGAS available on the Splash 2 array board. Sum of products operation takes place in parallel as the required weight values are avail- able locally. entity Local_ram is port (XP_Data: inout Datapath; XP_Address: in Address; XP_Clk : in Bit; XP_Control : in Bit; ),' end Local_ram; architecture Behav of Local_rarn is type Ram_array is array (1 to memsize) of float_num; signal Count: integer; begin process file In_datajile: TEXT IN “./data_file_1 ”,- file 0ut_data_file: TEXT OUT “./data_file_2 variable In_line, 0ut_line: line; variable Weight: float__nurn; variable V_address: addr; variable L_rarn: Ram_array; begin wait on control, address, clock; if control = load then «load data to memory while not (endfile (In_datafle» loop readline (In_datajile, In_line); read (In_line, V_address); read (In_line, Weight); L_ram (intval(V_address)) := Wright; end loop; Figure 5.9. local RAM LUT for Weight Values. 72 control = read then -- read from memory data <= L_ram ( intval( address) ); else -- write to memory L_ram ( v_intval( address ) ) : = data; write (0ut_line, Weight); write (0ut_line, V_address); writeline (0ut_data_file, 0ut_line); endif; end process; end behav; Figure 5.9 (continued). 5.6 Summary This chapter describes the approach to bar code identification on Splash 2 based on the multi-channel filtering technique of image analysis. The VHDL synthesis code for this approach suggests that the technique can be extended to implement bar code identification on Splash 2. Chapter 6 Mapping the Application onto Splash 2 A hard beginning maketh a good ending. John Heywood 6.0 Design Implementation This chapter provides a detailed study of the programming environment of Splash 2 and the steps involced in mapping and application onto PFs. Using the Synopsys Design 'lbols, the VHDL code is combined into one interface file, the EDIF file. Next the EDIF file is converted to a Xilinx Netlist Format (XNF) file. The circuit is partitioned into LCA blocks with consideration given to the user-supplied constraints. The fourth phase consists of placing and routing these partitioned LCA blocks. The run—time environment for the design implementation is described in the next section. 6.1 Run-time Environment The run-time environment of Splash 2 is composed of the Splash 2 environment variables, the C runtime libraries, and the symbolic debugger T2. This environment provides the user interface for executing and debugging Splash 2 applications. 73 74 6.1.1 Splash 2 Execution 6.1.1.1 Initializing the Splash 2 Environment The environment for Splash 2 is established by sourcing the setup script, namely using the setenv and set path commands. The environment variables defining the default bit- stream file for the Control element, the default bitstream file for all the other processing elements, and the binary symbol table may be set optionally. 6.1.1.2 Preparing the Splash 2 Application An application that is implemented on Splash 2 uses the scripts from the defined directory in the following sequence is: vhdl_to_ediffilename.vhd edif_to_xnffilename.edif xnf_to_bit filename.xnf The script vhdl_to_edif command invokes the shell script of the Synopsys synthesis tool. The user defined VHDL application program goes through the three processes: reading the source code (“analyzing” and “elaborating” in VHDL terminology), optimizing the source code (or “synthesizing” the logic), and writing the source synthesis to an output file (in the EDIF 2 0 0 netlist format [1]). The FPGA architecture imposes a different kind of constraint on synthesis tools in com- parison to ASICs. The XC4000 devices consist of an array of CLBs. Each CLB has func- tion generators (i. e., lookup tables) and registers. By default, the Design Compiler and 75 FPGA Compiler do not flatten out the design, and thus designs should be flattened to elim- inate all of the existing logic structure. F lattening is recommended when the number of CLBs needed to implement a boolean function seems too high or there are too many logic levels. However, highly regular structured designs (such as adders and ALUs) that have been designed with an explicit structure should not be flattened. Flattening is especially useful for FPGA CLB structures. The FPGA Compiler has a built in optimizer for boolean logic. In order to use the algorithm effectively, the structure must be decomposed so that the boolean logic can be mapped into the CLB function generators. The portions of the design that are not going to change are recommended to be flattened. Any changes made using an optimization tool to a module can change the entire module. Thus the old module will no longer serve as a guide to the changed modules partitioning, placement and routing. 6.1.1.3 The Ungroup_all and Boundary_optimization Options The CLBs contain both boolean logic implementation in Function Generators and regis- ters. When a hierarchial design which uses a DesignWare module is compiled, the logic is NOT optimized across the hierarchial boundary. This means that some of the combinato- rial logic will end up with unused flip-flops while the CLBs that implement the registers will have unused function generators. Also the boolean logic from one hierarchy to another will not be optimized to reduce the CLB area or logic levels. 76 The choice of hierarchial boundaries impact the quality (i. e., area. or speed) of the synthe- sized design. By using the Design Compiler, designs can be optimized while preserving the hierarchial boundaries. However these optimizers help to improve designs only par- tially. Complete optimization of the design is possible across parts of the design hierarchy that are flattened in the Design Compiler. A quick check to find out if the design has been partitioned suboptimally is to first opti- mize the design with hierarchy and then compare the results with those obtained by opti- mizing the design without hierarchy. 6.2 General Compiling Methodology 6.2.1 Compiling Without Constraints The wireload model of the application may not represent the application accurately. Thus it becomes difficult to set timing constraints in Design Compiler or FPGA Compiler that correlate to the actual end results (after the PPR tools are executed). Thus in order to set realistic constraints, the design should be compiled without any constraints. 6.2.2 Setting Design Constraints Using the max _period, max_delay or max_area commands the timing of the design or the area constraints are set up. Thus the typical design will include the following commands: max _period 50.0 CLK 77 set _flatten true -efiort medium compile -map-efiort me -ungroup_all. 6.3 VHDL_to_EDIF The command syntax for vhdl_to_edif command is specified using the parameters top level design entity, the specified output file, bounds for the board, bounds for the process- ing elements. The specified output file is created with a “.edify” extension, and the synthe- sis tools are invoked separately for each unique pair of the two bounds. 6.4 EDIF_to_XNF Conversion The EDIF netlist produced by the Synopsys synthesis tools is translated into XNF by the program edif_to_xnf. In the process of translating the edif file to the netlist format, the VHDL port names get assigned to physical pins, IOB logic gets synthesized along with the global clock buffers, hard macros are identified and flip flop enable inputs are exploited, and the STARTUP block is inserted to connect the global tristate and the global reset signals. 6.4.1 Environment The design environment for the mapping of the applications includes the following files: 78 edif_to_xnf : The command loads three files (subroutines) from the library. Each of these subroutines is in a Common Lisp file. xilinx.lisp : The file contains the cell name and pin name translations and the pin directions for the set of Xilinx XC4000 logic primitives. The table is represented as a Lisp association list (alist). Each entry in the alist is the list that represents the primitive component. The first element of an entry is the name of a primitive in the Synopsys technology library; the sec- ond element is the name of the corresponding Xilinx primitive; and the rest of the entry is a list of the pin identifiers. Each pin identifier is a list with three parameters: the first ele- ment is the pin generated by the Synopsys tools; the second element is the corresponding Xilinx pin name; and the third element is the direction of the pin. xhard.lisp : The file contains a list of the Xilinx 4000 hard macros. xpins.lisp : The file contains tables of design port name to physical pin name translations for each of the Splash 2 computing elements. These tables are represented by the Lisp association lists (alists). The CAR of each element is a port name as it appears in the EDIF netlist; the CDR is a list of the physical pin number or number, optionally followed by a set of properties use to define the global signals. 79 6.4.2 Execution edif_to_xnf may be invoked from the Unix command line using the command with the parameters output (.xnf) file, input (.edif) file. The.edif extension on the input file is optional, the XNF output is written to the standard output. In addition, the Xilinx part name (4010), is specified as used in the Splash 2 system, the library directory in which all the related files are defined, the start-up block definition option. 6.5 Xilinx Placement and Routing The FPGA vendor, Xilinx Inc., has provided the software for the placement and routing, ppr; makebits, the software that converts the placed and routed netlist into the binary FPGA configuration format; and xact, the graphical tool for manual editing of the FPGA configuration. 6.5.1 Environment The environment for the Xilinx tools is established by same setup script as that used by the Synopsys software. This script allows for the definition of the several environmental vari- ables that identify the locations of the executable and library files and also modifies the user’s search path to include the Xilinx executable directory. 80 The script in addition binds the environment variable XACI‘ to point to the directory con- taining the latest release of the Xilinx software, invokes a Xilinx shell script and deter- mines the amount of physical memory available for the tool. 6.5.2 Execution A single shell script, xnf_to_bit, is available to facilitate the conversion from Xilinx Netlist Format (XNF) to Xilinx bitstream format (.bit). The script is invoked using this command with the parameters timing optimization, PPR/LCA option for makebits, and XNF basename design file. If PPR is executed, xnf_to_bit makes the bit stream by invok- ing makebits, and symbol_to_symbol to produce the bitstream and symbol tables. The Xilinx PPR, may be invoked from the command line using the parameters defining the base name of the file containing parameter/value pairs. PPR reads the file design.xnf and if it is defined in the current directory, the constraint file, and produces the following set of the files: ppr.log : A transcript file of the output screen that is produced by PPR. designJca : The file that defines the placed and routed LCA file generated by PPR. designxpt : A report file that includes design statistics, pin locations, and the information related to optimization. 81 In case the timing parameter is specified as ignored, the timing optimization phase of the routing is ignored, and this helps producing inefficient results quickly. The program xdelay performs the static timing analysis of a placed and routed LCA file. Xdelay produces a detailed list of all the critical path delays from one clock edge to the other, from the pin to the clock, and clock to the pin. Xdelay is invoked by defining the parameters base name of the file produced by PPR, redirect option for the output report to the file, short format timing option, or long format timing option. The program makebits is used to create configuration bitstreams from the placed and routed LCA files produced by PPR. Makebits may be invoked from the command line by defining the parameters for the base name of the LCA file produced by PPR, the design bit file, enable internal logic used by the readback option, tie unused internal interconnect so as to reduce noise and the power consumption, logical pin map file to produce a T2 sym- bol table. The output generated by PPR discusses the design implementation of the broadcast of the pattern found in the memory to the crossbar switch by a processing element based on the handshake signal.The PPR Report file contains important information in the following cat- egories. 0 Partition, Place and Route Summary 0 Chip Pinout Description 82 0 Critical Nets 0 Feedthrough Split Nets 0 Deletion Traceback The Partition, Place and Route Summary includes the number of occupied CLBs that cor- responds roughly to the total area number provided as part of the Synopsys report. An example of a PPR report file is shown in Figure 6.5. 6.6 The Xilinx Synopsys Interface (XSI) The XSI design toolkit allows Xilinx FPGA designs to be implemented using either the Synopsys FPGA Compiler or Design Compiler software. Synopsys HDL synthesis soft- ware creates and optimizes circuit designs from hardware description languages such as VHDL. The FPGA Compiler uses libraries supplied by Xilinx, and VHDL that you create and produces Xilinx, and VHDL files that you create and produces Xilinx Netlist Format (XNF) files for the XC4000 family. The FPGA Compiler is automatically invoked if the FPGA Compiler-specific libraries are specified as the target and link libraries. The Design Compiler uses libraries supplied by Xilinx for the XC4000 families to produce EDIF files. These EDIF files are translated to XNF files which are used to configure FPGAS. The design flow of VHDL-to—XC4000 as shown in Figure 6.1 comprises of the following steps: 0 design analysis of the VHDL file using the Design Analyzer 0 conversion of .vhd file to a .sxnf file 83 - conversion of the .sxnf file to a .xnf file - conversion of the .xnf file to ppr file 0 conversion of the ppr file to a .lca file 0 conversion of the .lca file to a .bit file to download to the LCAs on FPGAS. .vhd file l Design Analer l .sxnf file 1 .xnf file 1 ppr file I .lca file I makebits file—h XChecker ,___> _report.out Figure 6.1. Mapping of an Application onto Splash 2. 6.6.1 Compiling the Design The optimization process is part of the Compile command. Optimization is guided by the 84 series of constraints specified. Mapping of the technology transforms the Boolean logic network to a set of interconnected gates that are selected form the technology library. 6.6.2 Interpreting the Results The timing results specified by Synopsys are in nanoseconds and are estimates determined prior to the placement and routing. Accurate timing is not possible until after PPR is run on the design. Synopsys estimates timing delays to include the wire-load delays in addi- tion to the gate delays. The results obtained using PPR are better than the pre-placement and routing delays estimated by Synopsys. The area reported by Synopsys is for three parts: combinatorial area, non-combinatorial area, and total area. The area reported is in terms of the number of Xilinx Combinatorial Logic Blocks (CLBs) used. The CLB consists of four input function generators and two flip-flops. The flip-flops and the 2-,3-, and 4-input Boolean functions are weighted for area (assigned a value) at 0.5 CLBs. The 5-input primitives are weighted for area at 1 CLB. Register intensive designs require the number of CLBs to be equal to the non-combinato- rial area reported. If the design is highly combinatorial, then the number of CLBs required is roughly equal to the combinatorial area reported. The CLBs required by the design can be accurately computed only by the PPR software. 85 6.7 The Execution Process Each processing element of each array board is associated with a bitstream file to be exe- cuted and a symbol table for the evaluation of the associated symbols. If previously pro- duced merged raw load files for the application exist then the assignments will be automatically made when the specified raw files are loaded. In case new raw load files are defined then default bitstream files will be assigned for the PEs not associated by the user based on the environment variables. Also, in case the user has not provided a bitstream assignment or a default for a PE, then the default system files are used. The bitstream files are merged into a raw load file via a “comerturn” operation. Once the “rawfile” is saved to disk, an additional file that records the association of the PE—bitstream-symbol table files. Each array board may load a unique raw load file. Thus an array board that is being used for an application must be configured prior to exe- cution. The sequence of operations related to the process are namely, download the raw load arrays, download the crossbar configuration files if used. 6.8 T2 Debugger The Splash 2 debugger performs a large number of tasks including initialization, associa- tion and merger of the bitstream data and configuring of the Splash 2 array boards. The initialization of Splash 2 includes opening, resetting, and setting the information lev- els of the Splash 2 array boards. Association of the Splash 2 includes assigning the bit- 86 stream to the processing elements, reading and writing the Splash 2 merged bitstream to a file. Configuring the Splash 2 includes downloading the merged bitstream, downloading the crossbar configuration to the Splash 2 array, and clearing the PE local memory by ini- tializing it with a zero fill. The interface board registers are also accessible using the T2 Debugger. The registers are addressed using an identifier and an appropriate read/write command for the related “class” of the register. Typically, the read and write commands may be specified as fol- lows, RD identifier WR identifier, value. A typical T2 Debugger session includes T2 initialization (start-up), Splash 2 unit initial- ization and execution of the Splash 2 array system. Execution of the Splash 2 array sys- tem involves stepping through the program, debugging the program and finally shutting the Splash 2 system down. 87 6.9 T2 Debugger Commands The T2 debugger command consists of the syntax, command unit-list, board-list, pe-list, arg-list. The unit-list and the board-listt specify the Splash 2 array unit and the board address of the corresponding Splash 2 array system, respectively. The pe-listt specifies the processing element related to the Splash 2 array system. In order to execute an application using T2, the process involves creation of the related T2 customizing file, creation of the application start—up file, and initialization of the application using the “source” files. The T2 Debugger supports commands categorized as miscellaneous functions, readback functions, message functions, file and data stream functions, including: Readback unit-list, board-list The status of the Splash 2 boards are readback using this command in order to determine the current state of the hardware. This command allows the symbol table readback for each of the PEs on the specified boards. Ascii_to_Bin file The binary data file is converted to an ascii file. Clear_Memory unit-list, board-list, pe-list The memory contents are cleared and the addressed memory is initialized to contain zeros. 88 Conflgure_Array unit-list, board-list The selected Splash 2 array board is configured using the downloaded bitstream data. Configure_CrossBar unit-list, board-list The crossbar of the selected Splash 2 array is configured based on the configuration file. Fill_Memory unit-list, board-list, pe-list The memory associated with the specified PEs are filled with the values specified. List_PE unit -list List the current symbol table associations for all the PEs specified in the unit list. 6.10 Results Based on the suggested approach discussed in chapter 5, several communication and com- putation algorithms are discussed in this section. These algorithms are essential for the implementation of the bar code identification using the multi-channel filtering technique. The PE reads the crossbar configuration data and stores the value in the local RAMs. The data are passed onto the right neighbor for additional computation. This algorithm allows debugging via the RBus from X16 (see Figure 4.1). The data tranfer paths and VHDL code for the crossbar algorithm are shown in Figures 6.2, 6.3, 6.4 and 6.5. This algorithm provides an estimate of the execution speed for lookup table based calculations. In addi- tion, this algorithm can be used for transferring data among PEs via the crossbar configu- ration during the lookup table calculation for classification of image pixels using a 89 perceptron as described in Chapter 5. S‘BUS Read RD —’ 256 x16 RD . Memory WR S-Bus Wnte rWR S B 32 ' '16 - us Address -/ 2', D... S-Bus 32 I Data 'l‘ J? 4 Processor Inhibit ’ Processin . To Left 36 Element g To l-hght [ 3. 36 'Ib Crossbar Figure 6.2 Data Transfers from the Processing Element. (Dotted lines indicate the data transfer paths.) 90 type States is (Init, One, Two, Three); signal State signal Data signal Address signal Xbar_hold signal Xbar_in : STATES := INIT} : Bit_Vector (15 downto 0); : Bit_Vector (I5 downto 0); : Bit_Vector (35 downto 0); : Bit_Vector (35 downto 0); begin XP_Left <= Tristate (XP_Left); XP_Right <= Tristate (XP_Right); XP_Int <= ‘0’, XP_GOR_Result <= ‘0’; XP_GOR_Valid <= ‘0’; process variable Counter: Integer begin WAIT until XP_Clk’event AND XP_Clk = ‘1 ’ Pad_Output (XP_Mem_D, Data); Pad_Output (XP_Mem_A, Address); XP_Mem_RD_L < = ’1 ’,' XP_Mem_WR_L <= ’0 ’; case State is When Init => Xbar_hold <= Xbar_in; Address (I downto 0) <= “”;00 Adress (1 7 downto 2) <= itobv (counter; 16); if (Xbar_in /= “000000000000000000000000000000000000”) then Counter := Counter + I; if (XP_HSI = ‘0’) then State <= One; else Counter := 0; end if; end if; Figure 6.3. PE to Crossbar Declaration. 91 when One => Address (1 downto 0) <= “00”; Data (15 downto 0) <= Xbar_hold (15 downto 0); State <= Two; when Two => Address (I downto 0) <= “01”; Data (15 downto 0) <= Xbar_hold (31 downto 16); State <= Three; When Three => Address (1 downto 0) <= “10”; Data (15 downto 0) <= “ 000000000000” & Xbar_hold (35 downto 32); State <= Init; end case; ——connect the crossbar registers Pad_Input (XP_Xbar; Xbar_In); --enab1e crossbar output if HSO is low if (XP_HSO = ‘1 ' then XP_Xbar_En_L <= “11111 ”; XP_LED <= ‘1 ’; else XP_Xbar_En_L <= “00000”; XP_LED <= ‘0 ’; end if; end process; end Xbar_to_mem; Figure 6.4. Crossbar Processing Element XP_XBAR_SLAVE Architecture. 92 Input XNF Design Statistics (refer Note 1) Number of Logic Symbols: 189 Number of Flip Flops: 167 Number of 3-State Buffers: 0 Number of IO Pads: 87 Number of Hard Macros: 0 Number of Nets: 465 Number of Pins: 1448 Equivalent “Gate Array” Gates: 1344 (refer Note 2) - From Logic: 1344 - From Random Access Memories: 0 - From Read Only Memories: 0 Partitioned Design Utilization Using 4010 (refer Note 3) Occupied CLBs (refer Note 4): Used=79 Max:400 Packed CLBs (refer Note 4): Used=45 Max:400 Package Pins (refer Note 5): Used=85 Max:160 FG Function Generators: Used=91 Max:800 H Function Generators: Used=28 Max:400 Flip Flops (refer Note 5): Used=l65 Max=1120 Memory Write Controls: Used=0 Max:400 3-State Buffers: Used=0 Max:960 3-State Buffer Output Lines: Used=0 Max=80 Address Decoders: Used=0 Max=32 Routing Summary Number of unrouted connections: 0 Number of pips used: 1563 Number of local lines used: 505 Number of double lines used: 289 Number of long lines used: 172 Number of global lines used: 21 Number of decoder lines used: 0 Util=l9% Util=1 1 % Util=5 3% Util=1 1 % Util=7% Util= 14% Util=0% Util=0% Util=0% Util=0% Figure 6.5. Results for the XP_XBAR_SLAVE Obtained from PPR. 93 6.10.1 PPR Summary Explanatory Notes This section contains the explanatory notes for the entries and values which are presented on the previous page (see Figure 6.5). Note 1 : The first section gives statistics about the design which served as input to PPR. The data file used to compute these statistics is PPR’s internal net list file which is derived from (and closely related to) the input XNF file. There is one case where the internal file may be very different from the input XNF file: if the XNF file contains references to other XNF files, PPR automatically merges the lower level net list into the top level net list. The internal net list file therefore represents a FLAT version of the design. Note 2 : The value listed for the number of “gate array” gates is an estimate which is provided as a reference. It is based on the symbols (AND, OR, FDRD, etc.) in the internal net list file. The value is computed by adding the equiv- alent gate counts of each symbol. The symbol gate counts are derived from gate array data book [1]. Since the number of “gate arrays” is based on the internal net list file. This includes the contents of any hard macros which are used in the design. As shown in Figure 6.5, there are no hard macros used for the implementation. 94 Note 3 : The second section gives statistics related to the design after it has been partitioned. The utilization of the LCA by the design is dependent on the LCA resource of interest. Note 4 : The CLB utilization is the most common and general utilization statistic specified. The “Occupied CLB” utilization is obtained by counting the CLBs that are not utilized (empty CLBs). The value includes the feedthroughs that are inserted into the design by the router as being part of the design.The “packed CLB” utilization is computed by counting the number of function generators and flip flops in the design, dividing each by two, and then taking the maximum of the two results. The packed CLB value represents the LCA utilization for the case in which the design has been packed into as small an area as possible. PPR does not pack the design CLBs tightly but instead spreads them out over all available CLBs. This may however result in the con- dition in which a CLB may not be fully utilized. Further packing of the CLBs is possible by manual partition, place and route of subsections of the design. The packed CLB utilization is a measure of how much more logic can be included in the design and can be fitted on an LCA. 95 Note 5 : Logic trimming that occurs before the partitioning process results in the removal of pads if they connect to a loadless or sourceless net. Flip Flops are removed if one of the following conditions are valid: (1) The output pin (Q) has no load connected. (2) The data pin has no load connected. (3) The data pin is attached to GND. (4) The data pin is attached to Vcc. (5) The clk pin is connected to no source, or it is attached tchc or GND. (6) The reset pin is attached to Vcc. (7) The set pin is attached to Vcc. Flip Flops may also be deleted if they are part of the sequential logic loop which feeds itself and no other symbols. Based on the results obtained from PPR only 11% of the CLBs are packed for the imple- mentation of the XBar algorithm. By including hard macros into the algorithm, it will be possible to pack more CLBs and also free the total number of pins utilized for the design. Thus a PE (XC4010), can accommodate additional logic configurations to perform data handling operations. The wireload model results obtained from PPR indicate that the algorithms execute at speeds up to 12 MHz using soft macros. A hard macro for the building blocks, namely, the integer counter and ripple-carry adder; would improve the execution speeds significantly. 6.10.2 Chip Pinout Description for the XBar Algorithm This section describes the location of the design’s pins in terms of the package pins. The 96 list is sorted by the Xilinx PPR based on the location of the package pin. Table 3: Chip Pinout Description. Rage Pin Name Pamge Pin Name A 16 XP_XBAR_EN_L_4_PAD F1 XP_MEM_A_5_PAD A17 XP_XBAR_EN_L_3_PAD F2 XP_MEM_A_6_PAD B1 XP_XBAR_EN_L_2_PAD F3 XP_MEM_A_7_PAD 132 XP_HSl_PAD F16 XP_XBAR_29_PAD B 16 XP_HSO_PAD F17 XP_XBAR_28_PAD B 18 XP_XBAR_EN_L_1_PAD F18 XP_XBAR_27_PAD C1 XP_MEM_WR_L_PAD Gl XP_MEM_A_8_PAD c2 XP_MEM_RD_L_PAD 62 XP_MEM_A_9_PAD c 17 XP_XBAR_EN_L_0_PAD G 17 XP_XBAR_26_PAD C18 XP_XBAR_35_PAD 618 XP_XBAR_25_PAD D1 XP_MEM_A_0_PAD H1 XP_MEM_A_10_PAD D2 XP_MEM_A_1_PAD H2 XP_MEM_A_11_PAD D17 XP_XBAR_34_PAD H3 XP_MEM_A_12_PAD D18 XP_XBAR_33_PAD H 16 XP_XBAR_24_PAD E1 XP_MEM_A_2_PAD H 17 XP_XBAR_23_PAD 132 XP_MEM_A_3_PAD H18 XP_XBAR_22_PAD E3 ' XP_MEM_A_4_PAD 11 XP_MEM_A_13__PAD E16 XP_MEM_A_32_PAD 12 XP_MEM__A_14_PAD 1317 XP_MEM_A_31_PAD J3 XP_MEM_A_15_PAD E18 XP_MEM_A_30_PAD J 16 XP_XBAR_21_PAD 97 Table 3: (continued) Pacplifige Pin Name Pagiige Pin Name J 17 XP_XB AR_20_PAD N 18 XP_XBAR J 18 XP_XBAR_19_PAD P l XP_MEM_D_8_PAD Kl XP_MEM_A_16_PAD P2 XP_MEM_D_9_PAD K2 XP_MEM_A_I 7_PAD P3 XP_MEM_D_10_PAD K3 XP_LED_PAD P16 XP_XBAR_7_PAD K16 XP_XBAR_18__PAD P17 XP_XBAR_6_PAD K17 XP_XBAR_17_PAD P18 XP_XBAR_5_PAD K18 XP_XBAR_16_PAD R l XP_MEM_D_] l__PAD Ll XP_MEM_D_0_PAD R2 XP_MEM_D_12__PAD L2 XP_MEM_D_1_PAD R17 XP_XBAR_4_PAD L3 XP_MEM_D_2_PAD R18 XP_XBAR_3_PAD L16 XP_XBAR_15_PAD Tl XP_MEM_D_13_PAD L17 XP_XBAR_14_PAD T2 XP_MEM_D_14_PAD L18 . XP_)GBAR_13_PAD T3 XP_MEM_D_15_PAD Ml XP_MEM_D_3_PAD T4 XP_MEM_DIS_PAD M2 XP_MEM_D_4_PAD T16 XP_GOR_VALID_PAD M17 XP_XBAR_12_PAD T17 XP_XBAR_2_PAD M18 XP_XB AR_l I_PAD T18 XP_XBAR_1_PAD N1 XP_MEM_D_5_PAD U1 XP_CLK_PAD N2 XP_MEM_D_6_PAD U 16 XP_GOR_RES_PAD N3 XP_MEM_D_7_PAD U18 XP_XBAR_0_PAD N16 XP_XBAR_10_PAD V2 XP_INT_PAD N17 XP_XB AR_9_PAD 98 6.11 Additional Results of Simulations The following are basic algorithms for data handling, namely: ° Selecting the crossbar configuration based upon the SIMD tag as shown in Figure 6.7. ° Feeding bits of data from the left port of Splash 2 to the Crossbar Port (left_to_Xbar). Data are latched into the left port on the first clock cycle. In the next clock cycle the data are latched into the crossbar port. The architecture declaration is shown in Figure 6.6. These algorithms are an essential part of the suggested approach discussed in Chapter 5 since crossbar configurations and the data paths are changed based on the data obtained from the PBS for computation of intermediate steps (see Figures 5.1, 5.6 (a) and 5.6(b)). architecture Left_to_XBar of Processing_Element is signal PassThru: Bit_ Vector (Datapath_width -I downto 0); signal Lefi: Bit_ Vector (Datapath_ width -1 downto 0); signal Right: Bit_ lector (Datapath_ width -1 downto 0); signal Madr, Nadr - .Bit_ Vector (MAR_ RANGE -I downto 0); signal Mdata_out : Bit_ Vector (MEM_WIDTH— -1 downto 0); signal Mdata_in : Bit_ Vector (MEM_WID7H -I downto 0); signal erite : Bit; signal Mem_Data : Bit; begin Mem_Access: process begin WAIT until XP_Clk ’Event and XP_Clk = ‘1 Pad_Output (XP_Mem_A, Madr); Pad_Input (XP_Mem_D, Mdata_out, Mdata_in, Mem_Data); Pad_Output (XP_Mem_WR_L, erite); erite < = ‘1 Mem_Data < = ‘0’; if (lefi (35) = ‘1'):hen erite < = ‘0’; Mem_Data < = ‘1 ’; Madr < = Nadr; Nadr <= Nadr + I; Mdata_out < = Lefi (15 downto 0); end if end process; process begin WAIT until XP_CLK ’Event and XP_Clk = Pad_Output (XP_XBar; Lefi); Pad_Input (XP_Left. Lefl); end process; XP_Right < = Tristate (XP_Right); XP_Mem_RD_L < = ‘1 ’; XP_HSO < = ‘2 ’,° XP_HSI < = ‘Z XP_GOR_ResuIt < = ‘Z ’; XP_Int < = ‘0 ’; XP_fiar_EN_L < = “1111 ”; end Lefi_to_XBar; Figure 6.6. Left_to_XBar Architecture Declaration. 100 type States is (Init, One, TWO, Three); signal State signal Data signal Address signal xbar_hold signal Xbar_in begin XP_Lefl XP_Right XP_Jnt XP_GOR_ResuIt XP_GOR_ lblid process : States := Init; : Bit_ vector (15 downto 0); : Bit_ Wetor (15 downto 0); : Bit_ lecmr (35 downto 0); : Bit_ “ram (35 downto 0); < = Tristate (XP_Lefi); < = Innate (XP_Right); <= ‘0’; <= (0”, <= «0:; variable Counter : Integer begin case State is when Init = > WAIT until XP_Clk ’event AND XP_Clk = ‘1 ’ Pad_Output (XP_Mem_D, Data); Pad_Output (XP_Mem_A, Address); XP_Mem_RD_L < = ’1 XP_Mem_WR_L <= ’0’; Xbar_hold < = main; Address (I downto 0) < = “00 ”,' Adress (I 7 downto 2) < = itobv (counter; 16); If (Xbar_in /= “W ”) THEN Counter := Counter + I; if (XP_HSI = ‘0’) then state < = One; else Counter := 0; end if end if- Figure 6.7. Processing Element to Crossbar based on SIMD Tag Declaration. when One => Address (1 downto 0) Data (15 downto 0) State when Two => Address (I downto 0) Data (15 downto 0) State when Three => Address (1 downto 0) < = Data (15 downto 0) < = State < = end Case; X0_XB_Data <= X0_Mem_A <= X0_Mem_D <= X0_Mem_RD_L <= X0_Mem_WR_L <= X0_GOR_Result_In <= X0_GOR_Valid_In <= X0_GOR_Result <= X0_GOR_Valid <= X0_Int <= X0_Broadcast_0ut end X 0_XBAR_C ONT R OL; 101 < = “00 n; <= xbar_hold (15 downto 0); <= Two; < = “01 n; < = xbar_hold (31 downto 16); <= Three; u 10 n; “ 0000000000 Init; ” & xbar_hold (35 downto 32 ); Tristate (X0_SIMD); Tristate (X0_XB_Data); Tristate (X0_Mem_A); «I r; :1 r; “ZZZZZZZZZZZZZZZZ ”; “ZZZZZZZZZZZZZZZZ ”; ‘0 r; :0 9,, (0 r; 40 9,, Figure 6.7. (continued) 102 6.12 Summary This chapter provides a detailed explanation as to how an application can be mapped onto the reconfigurable architecture, Splash 2. In addition results of basic communication and computation algorithms are provided. Chapter 7 Conclusions and Future Research Every man with an idea has at least two or three followers Brooks Atkinson Bar code identification has been an active research topic in computer vision for almost two decades and has been successfully implemented on application specific hardware. How— ever, identification of the bar codes fiom the image that includes graphics and text can be time consuming and complex task. The multi-channel filtering technique applied to identi- fication of bar code images is a more general approach in comparison to the other approaches and is applicable to a diverse class of images. This approach is general owing to the fact that the technique relies on frequency (size) and the orientation, two basic attributes of any image. In this thesis, the major contributions are: 1. A detailed study of the architecture of a custom computing machine Splash 2. 2. A detailed study of the programming environment of Splash 2 and the steps involved in programming an application on Splash 2. 3. A suggested approach to mapping the multi-channel filtering technique for bar code identification onto Splash 2 PBS. 4. A detailed investigation of some of the fundamental communication and computation algorithms that are essential for implementation of the suggested approach, including performance results and partitioning results obtained on Splash 2 architecture. 103 104 One limitation of the suggested approach is the inability to perform all the steps involved in multi—channel filtering on Splash 2. Data from the feature images can be classified more effeciently (at a faster rate) when intermediate results are computed on the host Sparcsta- tion. The multi-channel filtering technique can be applied to textured images only [15]. Uni- form (almost) gray level regions cannot be discriminated using this technique. Images, as they occur in the real world, contain both textured regions as well as regions with nearly uniform gray levels. In Chapter 5, an approach to implementing bar code identification is recommended. This approach does not suggest routing intermediate data, between steps, to the host Sparcsta- tion for computation. Computation of intermediate data on the host is suggested to obtain a comparative study in terms of execution speed. Verification of the correctness of the Splash 2 implementation can be performed by com- paring it with a sequential implementation on a Sparcstation. Timing results will be gener- ated as part of completing the suggested implementation. Some bar code image boundaries are enhanced to a greater extent in comparison to others by the process of adding magnitude responses. It is suggested that an adaptive threshold- ing technique be investigated instead of the current universal or global thresholding tech- nique applied during multi-channel filtering. 105 The implementation of the dense field programmable gate arrays, replacing the the exist- ing Xilinx XC4010 series, would enhance Splash 2 performance. In addition, hardware components like rapid programmable interconnects should be investigated and incorpo- rated into the next version of Splash 2 to improve both the connectivity among PBS and reduce the capacitative delays that affect the rate of data transfer from and to the PEs. 106 Appendix Xilinx’s FPGAS, also known as Logic Cell Arrays (LCAs) [39], are high-density user-pro- grammable gate arrays. The major configurable elements of the FPGAS are: configurable logic blocks (CLBs), input/output blocks (IOBs), and interconnections.The CLBs provide the functional elements to construct the logic based on the user’s application.The IOBs provide the interface between the package pins and the internal signal lines. The program- mable interconnects provide a routing path to connect the inputs and outputs of the CLBs and IOBs into desired configurations. Internal static memory cells are programmed based on the application. These cells determine the logic functions and interconnections imple- mented on the LCAs. Configurable Logic Blocks The configurable logic block of the Xilinx 4000 family contains three combinatorial func- tion generators, two flip-flops, and their associated control logic. The powerful and flexi- ble CLBs are surrounded by a dense set of routing resources as shown in Figure A.1. The function generators and flip-flops are accessed by 13 CLB inputs and 4 CLB outputs. These inputs and outputs can also connect to programmable interconnect resources out- side the CLB. Four independent inputs are provided to each of the two function generators (Fl-F4 and Gl-G4). The function generators F and G, are capable of implementing any Boolean logic function with 4 inputs. The function generators are implemented as look-up 107 tables, allowing for the generation of any arbitrary function. The propagation delay through the function generator is independent of the function being implemented. A third function generator, labeled H’ as shown in Figure A.l, can implement any Boolean func- tion with three inputs: F’ and G’ and a third input (H1) from outside the block. Signals from the function generators can exit the CLB at two outputs: F or G. The output F is either F’ or H’ and the output G is either G’ or H’. Thus the CLB can be used to imple- ment either two independent functions of upto 4 variables, any single function of 5 vari- ables, or some single functions upto 9 variables. Thus, implementing a large number of functions in a single block, reduces the total block utilization and the length of the signal path. These parameters contribute towards increased logic density and speed. T i T Dill 80! EC ] .._ 1 L T .._ —l~ J_. 0t-1 ‘ > up?) > Figure A. 1. Simplified Block Diagram of XC 4000 Configurable Logic Block. 108 The two storage elements in the CLB are edge-triggered D-type flip-flops with common clock (K) and clock enable (EC) inputs. A third common input (SIR) can be programmed as either an asynchronous set or reset signal independently for each of the two registers. The S/R input can also be disabled for either flip-flop. A separate global set/reset line (not shown in Figure A.1), sets or clears each register during power-up, reconfiguration, or when a dedicated RESET net is driven active. (The RESET net can be connected to any package pin, providing a global reset input). Either polarity of the clock signal can be selected, allowing each flip-flop to be rising —edge or falling -edge triggered. The source of a flip-flop’s data input also is programmable: the functions F’, G’, and H’, and the Direct In (DIN) block input are the options. The flip-flops’ outputs can exit the block via the Q1 and Q2 CLB outputs. In addition, each CLB contains dedicated arithmetic logic for the fast generation of carry and borrow signals, greatly increasing the efficiency and performance of applications such as comparators and ALUs. This logic is used in conjunction with the F’ and/or G’ function generators to implement high-performance arithmetic functions (see Figure A2). The CLB’s two carry-in inputs (Cin) and a carry-out (Cout) are propagated on dedicated rout- ing lines that connect neighboring CLBs. The carry-out of one CLB connects to the carry- in of the adjacent CLBs above and below it in that column. At the top and bottom rows of the chip, the carry-out signal propagates to the neighboring CLB to the right. Thus, chains signals can wind up and down columns of CLBs in a serpentine manner to efficiently implement very wide adders or similar functions. 109 Multiplexers in the CLB allow its four control inputs, labeled C1 through C4 in Figure A.l, to be mapped one-to-one into the CLBs four internal control signals (H1, DIN, S/R, and EC) in any arbitrary manner. The flexibility and symmetry of the CLB architecture facilitates the placement and routing of a given application. Since the function generators and registers have independent inputs and outputs, each can be treated as a separate entity during placement to achieve high packing densities. Inputs, outputs, and the functional themselves can freely swap positions within a CLB during placement and routing operations. CLBs and On-Chip Memory In addition to providing unprecedented speed and performance, the XC4000 family LCAs are the first programmable logic devices to include on-chip static memory resources, fur- ther increasing system integration levels. An optional mode for each CLB allows the memory look-up tables in the F’ and G’ func- tion generators to be used as either a 16 x 2 or 32 x 1 bit array of read/write memory cells ( see Figure A.3). The F1-F4 and Gl-G4 inputs to the function generators act as address lines in this configuration, selecting a particular memory cell in the function generator’s look-up table. The functionality of the CLBs control signals change in this configuration; the H1, DIN, and SIR lines become the D1, D0 and Write Enable (WE) inputs, respec- tively. When the 32 x 1 configuration is selected, Fl-F4, G1-G4, and D1 act as address bits 110 and D0 is the data input during writes. In the 16 x 2 orientation, F1-F4 and G1-G4 are the address bits, and D0 and D1 are data inputs during write operations. The contents of the memory cell(s) being addressed are available at the F’ and G’ function generator outputs, and can exit the CLB through its F and G outputs. c1 (2 a l l l l [we or no a: l l L. I -_D .._. .. or". a 03—1 0' mm m ' , a. Let I u I! can :: RIG?!" 1— m..— __w -I Figure A.3. CLB’s Function Generators being used as Read/Write Memory Cells. 111 Configuring the CLB’s function generators as read/write memory does not affect the func- tionality of the other portions of the CLB, with the exception of redefinition of the control signals. The H’ function generator can be used to implement Boolean functions of F’, G’, and D1, and the D flip-flops can latch the F’, G’, H’, or D0 signals. Groups of CLBs can be used together to form a memory array of the desired length and width. This capability is useful for a large range of applications, including FIFO buffers and large state machines. Input/Output Block User-configurable I/O Blocks (IOBs) provide the interface between external package pins and the internal logic (see Figure A.4). Each IOB controls one package pin and can be defined for input, output, or bidirectional signals. Two paths, labeled 11 and 12, bring input signals into the array. Inputs are routed to an input register that can be programmed as either an edge-triggered flip-flop or a level-sensi- tive transparent latch. Optionally, the data input to the register can be delayed to compen- sate for the delay on a clock signal that first passes through a global buffer before arriving at the IOB, thereby eliminating any hold time requirement on the data at the external pin. The 11 and 12 signals that exit the block can each carry either the direct or registered input signal. Output signals can pass directly to the pin or be stored in an edge-triggered flip-flop. 112 in 137 t lt_.§§ _ "ulcer" t fl- 3: 1f --------------——-———---—— Figure A.4. Block Diagram of the XC 4000 Input/Output Block. Figure A.5. Typical CLB Connections to Adjacent Single-length Lines. 113 Optionally, an output enable signal can be used to place the output buffer in a high-imped- ance state, allowing three-state outputs or bi-directional pins. The output (OUT) and out- put enable (OE) signals can be inverted, and the slew rate of the output buffer can be controlled to minimize power bus transients when switching non-critical signals. The out- put buffers are capable of sinking upto 12 mA. Two adjacent output buffers can be “wire_ANDed” externally to sink upto 24 mA of current. There are a number of programmable options in the IOB. Programmable pull-up and pull- down resistors are useful for tying unused pins to Vcc or ground to prevent unnecessary power consumption. Separate clock signals are provided for the input and output registers. These clocks can be inverted, allowing either falling-edge or rising-edge triggered flip- flops. As with the CLB’s registers, a global set/reset signal can be used to set or clear the input and output registers whenever the RESET net is active. Embedded logic attached to the IOBs contains test structures compatible with IEEE Stan- dard 1149.1 for Boundary scan testing, permitting easy chip and board-level testing of LCA-based applications. Programmable Interconnect All internal connections are composed of metal segments with programmable switching points provided to implement the desired routing. Routing is scaled to the size of the array, i.e., the number of routing channels varies among the devices in the family, increasing with the array size. 114 In previous generations of LCAs, the logic inputs were located on the top, left, and bottom of the block, and outputs exited the block on the right, favoring left-to-right data flow through the device. For the third generation, XC 4000 family, the CLBs inputs and outputs are distributed on all four sides of the block, providing additional routing flexibility (see Figure A.5). In general, the entire architecture is more symmetric and regular than the pre- vious generations, and is more suited to well-established placement and routing algo- rithms developed for conventional mask-programmed gate array design. There are three main types of interconnects. characterized by the relative length of their segments: single-length lines, double-length lines, and long lines. The routing scheme var- ies with the size of the array and is designed to minimize the resistance and capacitance of the average routing path, resulting in significant performance improvements. The single-length lines are a grid of horizontal and vertical lines that intersect at a “Switch Matrix” between each block. Figure A.5 illustrates the single-length interconnect lines surrounding one CLB in the array. Each Switch Matrix consists of programmable n-chan- nel pass transistors used to establish the desired connections between the single-length lines; for example, a signal entering on the right side of the Switch Matrix can be routed to a single-length line on the top, left, or bottom sides, or, any combination thereof if multiple branches are required (see Figure A.6). Single-length lines are normally used to conduct signals within a localized area and to provide the branching for nets with fanout greater than one. 115 Figure A.6. Single-length Line Switch Matrix Connections. Jr. fit .r. _ L‘ 3‘ qua} CLB CLB njxr 9% if: ..r L Lt- {J 3 ft; >4 3 rJ—x x ‘v-l L— 131 T13 u I? . it: Klmij Figure A.7. Double-length lines in CLB Array 116 The number of possible connections through the switch matrix has been streamlined com- pared to the previous generations of the LCA architectures, in order to reduce capacitative loading to minimize routing delays and increase performance. However, a much richer set of connections between the single-length lines and the CLB’s inputs and outputs more than compensate for this reduction in switch matrix options, resulting in overall increased “routability”. The function generators and control inputs to the CLB (Fl-F4, Gl-G4, and C1-C4) can be driven from any adjacent single-length line segment (Figure A5). The CLB’s clock (K) input can be driven from one-half of the adjacent single-length lines. Each CLB output can drive several of the single-length lines, with connections to both the horizontal and vertical planes. As implied by their name, the double-length lines consist of a grid of metal segments twice as long as the single-length lines. As shown in Figure A.7, a double-length line runs past two CLBs before entering a switch matrix. These lines are grouped in pairs, with their switch matrices staggered so that each member of the pair has a switch matrix nearby alternating CLBs in that row or column of the array. As with single-length lines, all the CLB inputs except K can be driven from any adjacent double-length line, and each CLB output can drive nearby double-length lines in both the vertical and horizontal planes. Double-length lines provide the most-effective implementation of intermediate length, point-to-point interconnections. Long lines are a grid of metal interconnect segments that run the entire length or width of the array as shown in Figure A.8. Additional long lines in the vertical plane can be driven 117 only by special global buffers designed to distribute clocks and other high fanout control signals throughout the array with minimal skew. Long lines are intended for high fan-out, time-critical signal nets. Each vertical long line has a programmable “splitter switch” at its center, allowing the long line to be used as two independent routing channels that each run half the height of the array. CLB inputs can be driven from a subset of the adjacent long lines; CLB outputs are routed to the long lines via three-state buffers or the single-length interconnect lines. Programmable interconnect points allow signals to pass between the long lines and some single-length lines at their intersections. However, there are no routing corrections between the double-length lines and the other types of interconnect. Three-state Buffers Associated with each CLB in the array are a pair of three-state buffers that can be used to drive signals onto the nearest horizontal long lines above or below the block. The three- state buffer’s input can be driven from nearby single-length lines; the buffer’s enable can come from nearby vertical single-length lines or long lines. A three-state buffer with simi- lar connectivity is located near each I/O block along the right and left edges of the array. These buffers can be used to implement multiplexed or bi-directional busses on the hori- zontal long lines. Programmable pull-up resistors attached to these long lines facilitate their use for wide wired-AND logic functions Special long lines running along the perimeter of the array can be used to “wire-AND” 118 signals input through nearby 1085, for applications such as the fast decoding of wide address busses entering the array. Figure A.8. Long Line Routing Resources with Typical CLB Connections. [1] [21 [3] [41 [5] [6] [7] [8] [9] [10] [11] [12] 119 Bibliography The Programmable Gate Array Data Book, Xilinx Inc., 1992 XAC T Development System Reference Guide, Xilinx Inc., January 1993. Xilinx Synopsys Interface User Guide, Xilinx Inc., July 1993. Xilinx Synopsys Interface, Preliminary notes, Xilinx Inc., December 1993. Freeman, Ross, “User Programmable Gate Arrays”, IEEE Spectrum, pp. 32-35, December 1988. RPM Emulation System, Quickturn Systems, Mountain View, CA 1990. Walters, S., “Prototyping ASICs in reprogrammable hardware meets system require- ments”, Computer Design, pp. 76, May 1990. Gokhale, M., Holmes, W., Kopser, A., Lucas, 8., and Minnich, R., Sweely,D.,‘ and Lopresti, D., Building and Using a Highly Programmable Logic Array, IEEE Com- puter, 24, pp. 81 - 89, January 1991. Arnold, J. M., Buell, D. A., and Davis, E. G., Splash 2, Proc. 4th Annual ACM Symp. on Parallel Algorithms and Architectures, pp. 316 -322, 1992 Arnold, J. M., The Splash 2 Programming Environment, Proc. of IEEE Workshop on FPGAs as Custom Computing Machines, April 1993. Arnold, J. M., and McGarry, Margaret A., Splash 2 Programmer’s Manual, SRC Technical Report SRC-TR-93-107, September 1993. Gokhale, M., Kopser, A., Lucas, 8., and Minnich, R., The Logic Descriptor Genera- tor, SRC Technical Report SRC-TR-90-Oll, 1990. [13] [14] [151 [16] [17] [18] [19] [20] [211 [22] 120 Erdogan, S. S., VHDL Modeling and Simulation of the Back-Propagation Algorithm and its mapping to the RM, IEEE 1993 Custom Integrated Circuits Conference, 1993. Pavlidis, Theo, Swartz, J ., et. al., “Fundamentals of Bar Code Information Theory”, IEEE Compter, pp. 74-86, 1990. Farrokhnia, F., Multi-channel Filtering Techniques For Texture Segmentation and Surface Quality Inspection, Ph. D. Thesis, Dept.of Electrical Engg., Michigan State University, 1990. Wong, K. Y., Casey, R. G., Wahl, F. M., Document analysis system, IBM J. Research and Development, 6:642 - 656, November 1982. Jain, A. K., Bhattacharjee, S. K., and Chen, Yao, Text Segmentation Using Gabor Filters for Automatic Document Processing, Machine Vision and Applications, 5:169 - 184, 1992. Jain, A. K., Chen, Yao, Bar Code Localization using texture analysis, Proc. 2nd Int. Conf on Document Analysis and Recognition, pages 41 - 44, Ibaraki, Japan, October 1993. Gabor, D., Theory of communication, J. Inst. Elect. Engr, 93:429 - 457, 1946. Pryor, D., Thistle, M., and Shirazi, N., Text Searching on Splash 2, Proc. of IEEE Workshop on FPGAs as Custom Computing Machines, April 1993. Gupta, R. K., and Micheli, Giovanni De, Hardware - Software Cosynthesis for Digi- tal System, IEEE Design and Test of Computers, pp. 29 - 41, September 1993. Berthet, C., Rampon, J., and Sponga, L., Synthesis of VHDL Arrays on RAM Cells, IEEE Transactions, pp. 726 - 731. 121 [23] Lis, Joseph S., Gajki, Daniel D., Synthesis from VHDL, IEEE Transactions, pp. 378 - 381, November. [24] Wang, Kuochen, and Kuo, Sy-Yen, Computer - Aided Modeling and Evaluation of Reconfigurable VLSI Processor Arrays with VHDL, IEEE Transactions on Com- puter-Aided Design, no. 2, pp. 185- 197, February 1992. [25] Nagaswamy, Vijay, Berry, Neerav, and Dangelo, Carlos, Specification, Planning, and Synthesis in a VHDL Design Environment, IEEE Design and Test of Computers, pp. 58 - 68, June 1992. [26] Brown, Stephen, Dean, Routing Algorithms and Architectures for F held-Programma- ble Gate Arrays, Ph.D. dissertation, University Of Toronto, Canada, January 1992. [27] Brown, Stephen, Rose, J, Vranesic, 2., G., A Detailed Router for Field-Programmable Gate Arrays”, Proc. IEEE International Conference on Computer Aided Design, pp. 382-385, November 1990. [28] Francis, R., J., Rose, J ., and Vranesic, Z., “Chortle: A Technology Mapping Program for Lookup Table-Based Field-Programmable Gate Arrays”, Proc. 27th Design Automation Conference, pp. 613-619, June 1990. [29] Rose, J., 8., Francis, R., J., Chow, P., and Lewis, D., “The Effect of Logic Block Complexity on Area of Programmable Gate Arrays”, Proc. 1989 Custom Integrated Circuits Conference, pp. 5.3.1-5.3.5, May 1989. [30] Rose, J ., 8., Francis, R., J., and Chow, P., “Architecture of Programmable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency”, IEEE Journal of Solid State Circuits, Vol. 25, No. 5, pp. 1217-1225, October 1990. [31] Singh, 8., Rose, J ., Lewis, D., Chung, K., and Chow, P., “Optimization of Field Pro- [32] [33] [34] [35] [36] [37] [38] 122 grammable Gate Array Logic Block Architecture for Speed”, Custom Integrated Circuits Conference, 1991. Perry, Douglas, VHDL, McGraw Hill, Inc., 1991. Harr, E. Harr, Alec G. Stanculescu, Applications of VHDL to Circuit Design, Kluwer Academic Publishers, 1991. Mazor, Stanley, and Patricia Langstraat, VHDL, Synopsys, Inc., Kluwer Academic Publishers, 1992. Navabi, Zainalabedin, VHDL - Analysis & Modeling of Digital Systems, McGraw Hill, Inc., 1993. J ayaraman, Bhasker, A VHDL Primer, Prentice Hall Series in Innovative Technol- ogy,1992. Kung, S. Y., Digital Neural Networks, Prentice Hall, 1993. Dhameja, S. and Rover, D. T., Mapping Compute- and Data-Intensive Algorithms onto a FPGA Based Custom Computing Machine, Canadian Microelectronics Cor- poration Conference on Field Programmable Devices, June 1994. HICHIGQN STATE U 1 [11111111] 1111111 rrrrr II 3014