I. \L‘ "21“?“ I I 36"" l "3.3 \ I,I ‘. \ll." .QIIE‘H'J‘ .“ : \t ".1”. “I! I. ‘1. I: .l' 4.}... ‘\"‘.'I'.'2'I-.Ilc“|“ . |.I.u'\jq In.“ . | u..“I.II‘..1.SI‘-\y .I.‘ ..1 “I -, . ‘.‘\“..I VI'I)\II‘V‘ . 12- w IIINIIL'I It'd-:E'J, ,- J !."IV I .|.".' \‘l I '\sls'.\."..."lq,l.‘..I. .ty I “D”; 2|\:“\. ’l‘} 1 1|“ 2; } ' I '~‘ .-.I I‘ 'L‘ In .N .'-' \'H I I" .. ‘ 2 2 ' ’a' A I I . 2 2w 2 2. I .' 5'. l .o'I.‘ '.'I'. III" .4. “I. I. 0.. . , W. II. I2}. .I.‘ \ r‘ \‘I - " H8." \ L 5' ‘3. ?Q" . . «W. i-‘m 3'.» 4 ““13““; \ ".‘.‘ I lk . .‘ . r ‘ .. . " IL ',I ”I ‘ , ‘l‘:\v‘ ‘ . ‘ ‘|:l\.'L".\ ‘l ' . I" " “:""l'.';\ . ."n" I57... ;' . I . In 4 ‘ -" I’ ‘ '- .2... , - , ....I I my" \ .""H u'. ‘ ' 1‘.l..‘I',“‘. .‘I' LN“; Iv I I 2!» . ... I“ ' v ,I 2 2 'I':",'.I ..'l -.2 ' ' 2 . . : '. L" "W2 ,, .- ' 2w 2"-'.‘I'-1'2' 2' ';-‘ ' I ‘t- -.‘LII'-.;.Iu,v‘.‘:, 2mm I“ ,2 -. .' 2: ‘ z ' '2'2 1' 2~!-.. ' "ﬂu. " 24:25:! I ‘ ‘l ' ‘l 1“ I I- .‘- I '9' u. I.’ . I 2 .I '4 I.- 22 «uh-III?» 2'Yr1:.‘9':2.I'.~.22:~II..~‘-.i :1 2‘ 2 “ .' ' II, J. L, ‘l ”:2 ' Y :2“ ,l'2l' ).l “ 1' ‘ 'uu' 2-;"'~~"’I.~*: ' '4 t W ‘ , ".; " W ,I’I‘. “3.2". "22 " '13-" ~"":" .1 2y." '.2_ , 4/2 I'2;».- -I. [I]. ‘ [.l VI I . I“ WIIIHIIHWIH. 'kd’!’ I. ll ' ”(I III. I. II I] ‘1 V'l ‘1" YHI|l".‘h:;‘l II}, I I‘V'LH' V" :-'- H I ‘6‘ " "Vi 191 ”W In} “11"" ‘ M " I"""" iI‘II‘lrnJ‘r‘I'l' '* I"I‘I . H 1 1,1 I. I,> A, I I 2 ' . 'I .I 1":i'n‘. "' ' 2'. I. I“: I ‘ f” "(4' h V 2., It.“ I”! I'M)”, I? I‘M...” . .. . ‘ I 'u' ‘ I I" “I :r) ’1' "I '2 I ' I nl' C 'I' :1'! F ' 21"" -' 2.2".II'.’ ' I $23M ‘ ‘ 1.1” “ U ' .2. I ' “I" ’t 'l,,' , " II1I . _ L: "I i. I 'I '| I H. ‘II‘ (IHy‘ “I“ . 'IH' “ﬂu“ ' l ‘I .‘I fHésﬁ C751 llilllllllllllilillllllillillllllllllllllyl 293 2 This is to certify that the thesis entitled Computation Design Alternatives with Microprocessor—Based Systems presented by Sigurd Leland Lillevik has been accepted towards fulﬁllment of the requirements for Ph.D. degreein E'GC-frical Engineering W 2/7/7/ 0-7 639 LIBRARY Michigan Sta» University ‘ a 4964“ COMPUTATIONAL DESIGN ALTERNATIVES WITH MICROPROCESSOR-BASED SYSTEMS BY Sigurd Leland Lillevik A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1978 ABSTRACT COMPUTATIONAL DESIGN ALTERNATIVES WITH MICROPROCESSORPBASED SYSTEMS By Sigurd Leland Lillevik For a microprocessor-based system, the computational section reduces to combinations of four elemental computational design alter- natives (CDA's). The described research examines and characterizes these CDA's and develops a structured approach to computational section design which incorporates a rigorous, theoretic foundation. The DIRECT CDA incorporates a single microprocessor (uP) and memory, so it executes adds and multiplies in software. Next, the sec- ond CDA contains a uP, memory, and arithmetic unit (AU). This AU CDA, then, performs multiplies in hardware. For the third, a uP, memory, and calculator chip (CALC) comprise the CALC CDA, and it executes adds and multiplies with the CALC chip. Finally, several uP's and memories in a Master/Slave arrangement implement the multiple-uP (muP) CDA. It applies the concept of parallel execution, but performs all adds and multiplies in software (like the DIRECT). A common set of attributes facilitates comparison of CDA's. Using such attributes, Multiattribute Utility Theory (MDT) assesses a numeric quantity, the utility, to represent each CDA's usefulness. Thus, design involves selecting the CDA with the greatest utility. Sigurd Leland Lillevik When attributes satisfy eight axioms plus utility and preferential inde- pendence definitions, utility assessment embodies an additive utility function; i.e., N u(x) = Z kiu1(xi) a utility 181 x = (x1, x2, x3...., xN) = consequence N - number of attributes ui(xi) - marginal utility of attribute xi N k 8 scaling constants, where 2 k = l. i 1.1 1 Application of MDT to three examples--linear regression, matrix inversion, and fast Fourier transform computations-~illustrated the use of the additive utility function and led to CDA characterization. Here, a consequence consists of an ordered-triplet whose components correspond to the precision, execution-time, and cost of a CDA; thus, N - 3. For instance, the first example produced the following marginal utilities: DIRECT - (60, 00, 100), AU - (00, 100, 53), CALC - (100, 19, 00), and muP I (33, 97, 62). The above function with weight vector E'- (0.3, 0.3, 0.3) finds a weighted average of marginal utilities; DIRECT = 55, AU - 51, CALC - 40, and muP - 64; hence, the "best" design consists of the muP CDA. In characterizing the CDA's, the examples revealed that the AU exhibited fastest execution-time. Also, the muP's parallel execu- tion surpassed the DIRECT's speed even though they both implement come putations entirely in software. But this increase varies significantly with the type of example and the severity of the Master's overhead pro- gram. The CALC executes slower than the DIRECT or muP for single/double- precision, but faster for triple/quadruple-precision. All costs climb Sigurd Leland Lillevik with greater problem dimensions due to increased data (not program) memory, yet the muP costs grow more rapidly because of repetitive hard- ware and software. Finally, costs vary little with changes in precision. So designers will be able to synthesize more advanced uP-based systems through judicious use of these elemental CDA's, their character- istics, and MUT. Also, these characteristics indicate that new LSI devices should include enhanced computational features. With MUT method- ology developers can now examine and characterize future uP-based systems as technology advances. ACKNOWLEDGMENTS No single individual has influenced my professional growth and development more than Dr. P. David Fisher, my major professor. For the past several years he has guided and motivated my career through his exemplary enthusiasm, hard work, aggressiveness, and superior knowledge of the field. Mere words can not express my appreciation. Perhaps I can best thank him by extending this type of friendship to others throughout my professional career. Also, I wish to express my gratitude to my committee members for their guidance of my doctoral program: Dr. S. R. Crouch, Dr. J. J. Forsyth, Dr. J. B. Kreer, and Dr. D. K. Reinhard. And to my wife, Sandi, I extend my appreciation for her typing of the dissertation. ii Chapter II. III. IV. VI. TABLE OF CONTENTS 1 Microprocessors........... ......... ........ ....... .2 Solid-state memories.............................. .3 L81 support chips................................. 4 Summary.......................... ..... . ....... .... COMPUTATIONAL DESIGN ALTERNATIVES.............. ........ l Elemental CDA's.................... ...... . ....... . 2 Attributes........................................ 3 Precision......................................... 4 Execution—time.................................... 5 6 costOOOOOOOOCOC0.00.00.00.00 ..... OOOOOOOOCOOOOUOOO SWty...‘OOOOOOOOOC‘OOOOOOOOO 0000000000 .0... ..... MULTIATTRIBUTE UTILITY THEORY................. ..... .... 1 Multiattribute utility theory definitions ......... 2 Utility existence axioms.......................... 3 Additive utility functions........................ 4 Marginal utility functions........................ 5 Summary........................... ...... .......... MPLICATIONSOOOOOOOOO ........... 0...... ........ .0000... 1 General approach.................................. 2 Linear regression example......................... 3 Matrix inversion example.......................... .4 Fast Fourier transform example.................... 5 Additional design considerations ....... ........... 6 Summary.................................. ......... CONCLUSIONS.. ..... .... ................... . ............. REFERENCES....... .............. . ....................... iii b \DmO‘L‘ 11 11 16 17 17 28 31 33 33 35 37 41 43 44 44 48 67 76 107 108 114 121 Table 2.1 3.1 3.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 LIST OF TABLES I Representative characteristics of computational deViceSoooooooooooooooooooooooooooooooooooooooooo ....... Add and multiply execution-times for various CDA's...... Estimates of cei for various CDA's............... ....... Estimates of A Mi (1 i 4) for Example-l......... ...... 1, Estimates of A4, M4 for Example-1....................... Estimates of P Di (1 # 4) for Example-1 ...... . ........ 1’ Estimates of P D for Example-l... ........... ......... 4’ 4 Typical consequence set and utilities for Example-l (101 samples)............................ ..... Typical consequence set and utilities for Example-1 (105 sampleS)ooooooooooooooooooooooooooooooooo Estimates of A Mi (1 i 4) for Example-2............... 1’ Estimates of A M 4 D1 (1 # 4) for Example-2......... ..... . 4, for Example-2.0000000000000000. ooooo Estimates of P1, Estimates of P D for Example-2....................... 4’ 4 Typical consequence set and utilities for Example-2 (2nd order).................... ............... Typical consequence set and utilities for Example-2 (5th order)IIOOOOOOOOOOOOO0...... ......... 0... Estimates of A1, M1 (1 - l, 2) for Example-3. ..... . ..... Estimates of A , M for Example-3 ....................... 3 M for Example-3 ....................... Estimates of A4, 4 iv Page 10 32 32 50 50 56 56 65 66 70 70 75 75 85 86 90 90 9O Table 5.16 5.17 5.18 5.19 5.20 Estimates of Pi’ Di (1 f 4) for Example-3 .............. .. Estimates of P D for Example-3 ..... . .............. .... 4’ 4 Typical consequence set and utilities for Example-3 (23 samples).......... ..... ..... ............... Typical consequence set and utilities for Example-3 (27 sample8)00.000000000000000 ooooooooooooooooo Typical consequence set and utilities for example-2(2nd order, improved multiply algorithm) ........ Page 95 95 105 106 111 Figure 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 4.1 5.1 5.2 5.3 LIST OF FIGURES Block diagram of a generalized microprocessor- based systm.........OOOOOOOOOOOOOOOOOO000......0...... Block diagram of the DIRECT computational design altemativeOOOOOO.........OOOOOOOOOOO0.0... ..... Block diagram of the AU computational design a1ternative........................ ..... . ....... Block diagram of the CALC computational deSign alternative...0............OOOIOOOOOOOO ....... 0. Block diagram of the multiple-microprocessor computational design alternative.............. ...... ... Multiple-precision addition flowchart. ......... . ..... .. ~"Operation" flowchart for CALC CDA..................... "NMber-entry" flowcmrt for “LC CDA. O O O O C O O C O I C O O O O O O Multiple-precision addition time for various CDA's..... Multiple-precision multiply flowchart for DIMCT CDA.0......0....O0............OOOCOOOOOOOOOO.... Multiple-precision multiply flowchart for AU CDAOOI.0......000......I......COOCOCOOOOICOOOOOO0.0. Multiple-precision multiply time for various CDA's ..... Indifference curves for computational de81gn altematives.oooooooooooooooooooooooooo ......... General approach to selecting a computational deSign alternative.......OOOOOOOOOOOOO ..... 0.0.0.000... Linear regression series flowchart... ..... ............. Relation between Master and Slaves for Example-loosens...coco-60.0.0600... ooooooooooooo o oooooo vi Page 12 14 14 14 15 19 20 21 23 24 26 27* 40 45 49 52 Figure 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 Linear regression parallel flowchart: Master ........... Linear regression parallel flowchart: Slave-2 .......... Single-precision execution-time versus sample Size for Example-10..........OOOOOOOOOOOOOOOOOO ......... Double-precision execution-time versus sample Size for Example-10....0...........IOOOOOOOOOOOOOO ...... Triple-precision execution-time versus sample Size for Example-100............OOOOOOOOOOOCOOOO ........ Quadruple-precision execution-time versus sample Size forBumple-10000000000000000.00.000.00... ...... ... Single-precision costs versus sample Size for Example-loooooooooooooooooooooooooooooooooooooo Double-precision costs versus sample Size for Example-1.000.000oooooooooooooococo...ooooooooo Triple-precision costs versus sample Size for Example-1........OOOOIOOOOOOOOOOO ..... .0 ..... .0 Quadruple-precision costs versus sample . Size for Example-1...................................... Matrix inversion series flowchart.......... ..... ........ Relation between Master and Slaves for Example-2.0.0...00............OOOOOOOOOOOOOOOOO000...... Matrix inversion parallel flowchart: Master... ......... Matrix inversion parallel flowchart: Slave-J...... ..... Single-precision execution-time versus matrix order for Example-20............OOOOOOOOOOOOOOO0.0.0.... Double-precision execution-time versus matrix order for Example-2.0.0..........OOOOOOOOOOOOOOOO0...... Triple-precision execution-time versus matrix order for Example-2........OOOOOOOOOIO0......... 00000000 Quadruple-precision execution-time versus matrix order for Example-2..IO.......OOOOOOIOOOOOOOOOOO ........ Single-precision costs versus matrix order for Eitample-2 ....... ......OOOOOOOOII...0.0.0.0...O. ..... vii Page 53 54 57 58 59 6O 61 62 63 64 68 71 72 73 77 ' 78 79 so 81 Figure Page 5.23 Double—precision costs versus matrix order for Example-200000.. ..... oooo'ooooooooooooo ooooooooooooo 82 5.24 Triple-precision costs versus matrix order for Example-2.0000000...0.000000000000600 oooooooooo one. 83 5.25 Quadruple-precision costs versus matrix order for Example-2.000.ooooooooooooooooooooooooo ...... 000000 84 5.26 FFT series flowchart................ ................... 88 5.27 Relation between Master and Slaves for Example-300ooooooooooooooooooooooooooooooooooooooo ..... 91 5.28 FFT parallel flowchart: Master............... ..... .... 92 5.29 FFT parallel flowchart: SlaveeM....................... 93 5.30 Single-precision execution-time versus sample Size for Example-3......O......OOOOOOOOOO.....OOOOOOOO. 97 5.31 Double-precision execution-time versus sample Size for Example-300000000000ooooooooooooooooooooo ccccc 98 5.32 Triple-precision execution-time versus sample Size for Example-3.00.00ooooooooooooooooooooooooo oooooo 99 5.33 Quadruple-precision execution-time versus sample Size for Example-3.00000sooooooooooooooooooooooooooooso 100 5.34 Single-precision costs versus sample 8128 for Example-300....00000600000000.0000. oooooo 0.0.0 101 5.35 Double-precision costs versus sample Size for Example-30000000000000.000-oooooo ooooooooooooo 102 5.36 Triple-precision costs versus sample 8122 for Example-300.co-oooooooooooooooooo ccccccccc 0... 103 5.37 Quadruple-precision costs versus sample Size for Example-3.0.0.0.0....00.00.00.000...0.0.0.0... 104 5.38 Multiple-precision multiply time for various CDA's..... 109 5.39 Triple-precision execution-time versus matrix order for Example-2.................... ......... 110 viii CHAPTER I INTRODUCTION Technological advances made within the electronics industry during the early and mid 1970's stimulated the rapid development and broad acceptance of microprocessor-based systems. Today, these systems span a wide range of applications with considerable variation in compu- tational requirements. To meet these requirements, the electronics in- dustry introduced a host of large-scale integrated (LSI) devices that perform diverse computations; for example, linear regression, matrix inversion, and fast Fourier transform computations. Even though a plethora of hardware/software computational design alternatives confront today's system designer, no structured approach exists for them to sythesize optimal computational sections. So the purpose of the research reported here is to investigate and characterize computational design alternatives (CDA's), and to develop a more rigorous and structured approach to de- sign which incorporates a firm, theoretic foundation. Using the results of this research, designers of microprocessor- based (nP-based) systems will be able to synthesize advanced systems with improved performance. Second, these results will benefit LSI chip design- ers by influencing what prOperties they will embody in the next generation of devices. Since some features assist computations while others detract, these new LSI devices will contain the desirable characteristics and min- imize the action of the undesirable ones. Third, the results of this study will assist developers of uP-based systems by providing them with 1 the methodology to investigate and characterize future systems as tech- nology advances. Because the theoretic development begins at the axiom level, they may easily modify the methodology to fit their particular goals. Throughout the research reported here the study achieves several objectives as delineated below: 1) survey the present semiconductor market for LSI devices that facilitate computations, 2) generalize the properties of such devices, 3) identify a basic set of elemental computational design alter- natives, 4) determine common attributes for each member in the set, 5) find techniques to evaluate such attributes, 6) develop a decision mechanism for comparing CDA's which uses these attributes, 7) illustrate the use of this decision mechanism with represen— tative and contemporary engineering applications, and 8) use the examples to characterize the properties of each CDA in the elemental set. i To meet these objectives, the study only considers elementary hardware/ software tradeoffs because of straightforward and tractable analysis. But since the overall methodology remains the same with optimal, or advanced, hardware devices and software algorithms, no loss in general- ity occurs. In Chapter II a survey of today's semiconductor market identifies those devices and their characteristics that facilitate computations. Specifically, the discussion focuses on uP's, solid-state memories, and LSI support chips. These results establish the motivation and rationale for the entire investigation. With these devices, Chapter III defines a basic set of elemental CDA's which when combined may realize the computa- tional section of a generalized uP-based system. To facilitate comparison of each member in this set, it determines attributes common to the members that reflect the important features, qualities, and characteristics of each CDA, and it describes techniques to evaluate them. Next, Chapter IV presents a decision mechanism for selecting the "best" CDA which uses the attributes found in the previous chapter. And this involves eight axioms upon which the theoretic foundation rests, several independence definitions, and a central theorem which delineates a specific decision function. Finally, the three examples in Chapter V--linear regression, matrix inver- ’ sion, and fast Fourier transform computations--use the results of the pre- vious two chapters to clarify the techniques used to evaluate attributes, and to elucidate the specific decision function in selecting the "best" CDA. Additionally, this chapter explores the effect of alternative algor- ithms. Not only do these examples typify representative and contemporary engineering problems, but they vary widely in their mathematical complex- ity. During this process the applications exemplify the overall proced- ure of this investigation and, moreover, they lead to conclusions which characterize each CDA. CHAPTER II CHARACTERISTICS OF COMPUTATIONAL DEVICES Recent advances in processing technology led to LSI devices» that exhibit enhanced features at low cost. Consequently, uP-based sys- tems now function in a myriad of applications which range from medicine(1) (2) (3-5) (6) And other ex- to communications to agricultu‘p to business. amples abound. Many nP-based systems require advanced mathematical cap- abilities and a multitude of computational design alternatives may accomr plish these calculations. In this chapter a survey of today's semicon- ductor market identifies those LSI devices and their characteristics that facilitate computations. Specifically, the discussion focuses on the per- formance trends of uP's, solid-state memories, and LSI support chips in the light of present technology. These results establish the rationale and motivation for the investigation reported in the remaining Chapters. 2.1 Microprocessors In 1972, the Intel Corp. introduced a 4-bit word length, 10 (7) Other usec. cycle-time, p-channel uP to the electronics industry. firms soon followed suit with uP's of their own. This original device required three distinct power-supplies and a two-phased clock to exe- cute its 45 instructions, and a complete microcomputer entailed about 4 chips. Soon, a second generation device entered the semiconductor market which exhibited nearly twice the performance of its predeces- (8) sor. By using the then recently developed n-channel technology, 4 which achieved higher density and increased speed, it displayed a 5 nsec. cycle-time, an 8-bit word length, and an instruction set of 80. Further advantages included single power-supply operation, and only a single- phase clock. And a mere 2 chips configured a complete microcomputer. Presently, the semiconductor market maintains a third generation uP (9’ 10) This circuit which, again, doubles its forerunner's performance. uses advanced n-channel or integrated-injection logic (12L) technology, attains a 1 nsec. cycle-time for an 8 or 16—bit word length, and sells for about $10.00; it includes an instruction repertoire of a minicomput- er (well over 100). Yet a more important addition involves on-chip mem- ory and input/output (I/O) logic which significantly reduces package count and, thus, costs. Consequently, today's semiconductor market truly sup- ports a "one-chip" microcomputer with many advanced features. An analoguous trend which parallels these metal-oxide semiconduc— tor (MOS) advances saw bipolar "bit-slice" uP's exploit the speed-power characteristics of lowbpower Schottky transistor—transistor logic (TTL). Unlike MOS uP's these devices do not contain a microprogram programmable logic array (PLA) and, hence, the user provides the microprogram;(11-13) this allows the use of any desired instruction set. With pipelined arch- itectures they produce 125 nsec. cycle-times, and many small computer firms now offer bit-slice, uP-based minicomputers as either new products or upgraded replacements.(14) From this discussion it becomes clear that the uP business is extremely competitive, which generally forces product lines to expand rapidly. Often, by the time a uP-based system reaches the marketing phase it contains obsolete parts and components. 2.2 Solid-State Memories As with uP's and bit-slice devices, technological advances have remarkably altered both the price and performance of solid-state memor- ies. Over the past several years average prices (mdllicents-per-bit) dropped at an annual rate of 35% and densities (bits-per-chip) increas- ed 601 annually.(15’ 16) For example, memories cost about $0.01 per bit and a typical chip contains from 1,000 to 8,000 hits. Consequently, meme ory system prices plunged while speed increased (e.g., 1 nsec. access times). Memories fall into two main divisions depending on the "access- ibility" of an arbitrary location. Random access (or parallel) implies that any memory address may follow the other, but for sequential (or ser- (17’ 18) For se— ies) memories each address occurs in a specific order. quential memories, the CCD and bubble show much promise and may eventu- ally replace the electromechanical serial memories such as disks, tapes, drums, etc. Although they now cost slightly more than the others, they contain no moving parts to wear out and exhibit faster access times.(19) Three additional subdivisions further classify memory types: ROM/RAM, static/dynamic, and MOS/bipolar. First, ROM stands for read on- ly memory and RAM for random access memory but, unfortunately, common terminology connotes RAMiwith read/write memory. So both ROM and RAM consist of random access memories where a nP can only read from a ROM, and can either read from, or write to, a RAM; Also, ROMs differ from RAMs in that they retain their contents during power-down (nonvolatile), whereas RAMs lose their contents (volatile). Additionally, ROMs occur (20) For the first, before man- in both mask and field programmable types. ufacturers "cook" the semiconductor they alter the circuit to contain the correct contents. With field programmable ROMs the end user applies specific voltages and currents to the chip that define its contents. In general, this process purposely destroys part of the circuit, but ROMs with a new kind of FAMOS memory cell can be reprogrammed.(21) They contain a quartz window which when exposed to the correct ultra Violet (uV) light resets the memory for the next programming cycle. Second, with static memories information remains in bistable flip-flops, but with dynamic memories information becomes the charge, or lack of charge, on a semiconductor capacitor and, hence, needs continued refresh due to nonideal leakage resistance. Although dynamic memories operate at higher speeds and consume less power than statics, still they may necessitate more costly auxiliary components such as multiple power supplies or critical clock circuits.(22) Third, the MOS/bipolar class- ification refers to the process technology used to manufacture the meme ory. With p-channel MOS memories their characteristics include low cost, excellent packing density, but slow speed; n-channel MOS properties basic- ally differ in speed; i.e., slightly faster.(23) Another MOS technology, complementarydmetal-oxide semiconductor (CMOS), employs both p and n-chan- nel transistors and exhibits reduced power consumption with an increased speed-power product, but at the expense of less density and greater costs. As with all bipolar processes, Schottky TTL enjoys fast circuit operation and plays a dominant role in the digital industry.(24) But it dissipates considerable power and occupies much silicon area for decreased densities and high cost. Perhaps an excellent compromise technology, bipolar 12L reveals good speed and density at less power dissipation and cost than Schottky TTL.(25’ Therefore, many types of solid-state memories exist with widely differing characteristics, and to select one for an applic- ation involves careful consideration of several tradeoffs. 2.3 LSI Support Chips The uP's described in Section 2.1 offer only general purpose in- struction sets that often contain few sophisticated (or specialized) op— erations: particularly, I/O and arithmetic. To incorporate these auxil- iary functions would require additional pins on the chip, and most manu- facturers limit themselves to 40 (for economical yields) which eliminates such special operations (time-multiplexing the pins causes reduced speed and complex timing). Yet some firms concentrate on standard "support ’chips" that facilitate these external functions; e.g., data communica- tion, data conversion, operator interaction, computation, etc. Specifically, computational support chips deal with the arith- metic and trigonometric functions not found in uP's today. For example, (26, 27) new and creative algorithms helped semiconductor firms produce multiply circuits with 8-bit, 100 nsec., 1W operation that sell for about (28,.29) $100.00. And with an eye toward floating-point formats and oper- ations, a 24-bit, 200 nsec., 5W multiply chip obviates the need for mult- (30) iple-precision algorithms. A remarkably versatile support chip con- tains both 16 and 32-bit, fixed and floating-point, arithmetic and trig- onometric functions where all transfers occur over an 8-bit, bidirection- _ (31) a1 bus and in one 24-pin package. The ubiquitous calculator chip, when interfaced to a uP, pro- vides extremely powerful processing power with literally no software dev- (32’ 33) Additionally, they incorporate both fixed and floating- elopment. point formats, but function slowly (typically 50 msec. for an add). Be- cause they often require multiple power supplies, logic-level shifting circuits, and special encoding/decoding hardware, implementation costs involve around $200.00.(34) The support chips for computations just described only illus- trate the wave of new LSI function modules created by technology advan- ces. Furthermore, many people predict the concept of LSI software where ROMs contain complete program packages tailored for some specific (35, 36) goal. Here, program ROMs matched with unique hardware arrange- ments drastically reduce software development time. 2.4 Summary From the preceding survey of today's semiconductor market it becomes clear that many LSI devices facilitate computations. Represent- ative characteristics of uP's, solid-state memories, arithmetic units, and calculator chips (see Table 2.1) illustrate the enhanced features and low cost of present devices. So with these characteristics a designer of uP—based systems may synthesize CDA's to perform advanced mathematical computations. Thus, this table of specific values defines the character- istics of LSI devices used throughout the remaining chapters. 10 Table 2.1 Representative characteristics of computational devices a) uP's b) 8-bit word length 8 nsec. cycle-time $10.00 cost Calculator devices d) multiple-precision word—length 50 msec. execution-time $200.00 cost Solid-state memories 1-8k bits/chips 1 nsec. assess-time $0.01/bit cost Arithmetic units 8-bit word length 100 nsec. execution- time $100.00 cost CHAPTER III COMPUTATIONAL DESIGN ALTERNATIVES With the numerous LSI devices available to designers today, syn- thesis of advanced CDA's involves careful and tedious evaluation of sev- eral tradeoffs. This chapter defines a basic set of elemental CDA's which use the representative devices delineated in Chapter II. To fac- ilitate comparison of each member in this set, it determines attributes common to each member that reflect the important features, qualities, and characteristics of each CDA, and it develops techniques to evaluate them. So synthesis of advanced CDA's now reduces to judicious combina- tions of these basic members. 3.1 Elemental CDA's A generalized uP-based system.may consist of several sections, each with specific and unique responsibilities (see Fig. 3.1). In such a system each section communicates to the others over a common system bus. The "computation" section functions to accomplish some distinct arithmetic calculation which the particular application requires. Be- cause of technological progress, various combinations of hardware and software may implement the needed computation. Thus, a computational design alternative (CDA) represents: the specific combination of hard- ware and software required to accomplish a particular computation. All "computational" sections of a uP-based system can be decom- posed into four "elemental" CDA's, each with their own different 11 12 .amumhm mommmiuommououeouoﬁa vouwamuosmm u no amuwmwo 3...on Tm meow; A onBzoo MHW .\||.:|||\)ruunuunr/ - race—m: I m: — p. ~uu>aem \1::.||:\2(...||.;/ o o o — hug—ow: m: _ Him>aem \\|:||:|\)(|:|::|/. ’ anon—oz \Jhs ‘15 ’I mam «am—Hwy...“ m: f l6 responsible for routing data and preliminary results from Slave to Slave, and for forming the completed result. The above muP CDA Master/Slave arrangement does not necessarily produce optimal results; an architecture designed for a unique problem, or class of problems, can potentially decrease execution-time and reduce costs. But the proposed muP CDA does present a simple, general-purpose architecture that solves many problems well and, additionally, lends it- self to straightforward analysis. For each elemental CDA a unique mixture of hardware and soft- ware accomplish computations. Thus, the properties of each CDA vary con- siderably. So judicious combinations of the four CDA's (DIRECT, AU, CALC, and muP) will realize the computation section of a uP-based system. 3.2 Attributes Comparison of a group of objects occurs through evaluation of a common subset of characteristics, or properties. Moreover, the select- ed subset of attributes must reflect the important features and qualities of each member. When comparing a set of CDA's, several attributes sat- isfy these requirements: precision, execution-time, cost, power dissi- pation, circuit complexity, programming language, circuit reliability, packaging demands, maintenance schedule, etc. Too few attributes res- ults in incomplete examination of the objects while too many attributes may cause unnecessary confusion. For the above list, all but the first three attributes depend strongly on the remainder of the uP-based sys- tem and, thus, fail to depict attributes indigenous to the CDA's. But precision, execution-time, and cost represent convenient and illustra- tive attributes for demonstrating the principal characteristics of CDA's. 17 3.3 Precision Two common number representations, fixed-point and floating- point, both identify discrete values on the real-number line. For the fixed-point format an implied binary-point always lies between two spe- cific bits in the word, and with floating-point numbers the binary-point varies (this requires storage of the position). In the research reported here, all numbers conform to the fixed-point format. Thus, precision in- volves the quantity of 8-bit memory words used to form a particular nume ber; e.g., p = 1 for single-precision, p - 2 for double precision, etc. Since the nP's perform two's complement arithmetic, all numbers corre- spond to multiple-precision, fixed-point, two's complement quantities. 3.4 Execution—Times Execution-time determines the number of seconds needed to come plate a specific computation. Since addition/subtraction and multipli- cation/division both correspond to complementary arithmetic and digital logic functions, the execution-time of an addition roughly equals a sub- traction and a multiplication roughly equals a division. These assump- tions greatly facilitate analysis of execution-times. Thus, if the numr ber of adds and multiplies can approximate the execution-time of an ap- plication program, then these values multiplied by the time to perform an add and multiply sum to give the execution-time. Symbolically, let A - number of adds 3 I number of multiplies add time m I multiply time B I Ti - Aia1 +Mi'mi I execution-time (3.1) 18 where, l, DIRECT CDA 2, AU CDA 3, CALC CDA, and 4, muP CDA. In Equation 3.1 the specific CDA architecture dictates expressions for a1 and mi, while the algorithm of the application program configures con- stants A1 and M1. Multiple-precision arithmetic operates on each word of the num- ber individually to produce the final result; e.g., consider the mult- iple-precision add flowchart in Figure 3.6. Here, corresponding words of the two operands add to produce the partial sum and the carry ripples through from word to word. Since each pass through the loop involves about 10 instructions, then for to e cycle-time (8 nsec.) p 8 precision, and a1 = lOpto. (3.2) Also, the AU and muP CDA's employ identical add algorithms, so a2 = lOpto, and (3.3) a4 - lOpto. (3.4) But with the CALC CDA each operand requires decoding and encoding from binary to BCD formats, and each arithmetic operation demands "digit en- try" (simulation of key depresses) and "function" time (see Fig. 3.7 and 3.8). Decoding/encoding tasks and the answer reads occur in usecs., but when compared to the digit entry and function delays (msecs.) they contribute no significant time to the add, or multiply, execution-times. To determine the approximate number of digits sent to the calculator chip each two's complement number roughly ranges in magnitude i 28p-l, l9 Cm) I CARRY-O, I30, Lt-p, OV=O I I R - (A+I) I ’ . 9 Single-precision I R R+ I -_— I addition. ‘ Add R - R+CARRY -' previous Carry. - l _ Lemma J i I (C+I) - R I l I I=I+1, L=L+1 I NO L 8 0? YES (: EXIT :) Figure 3.6 Multiple-precision addition flowchart. 20 C ENTER D Write A / / / / write Function YES NO Read C Cu; 3 Figure 3.7 "Operation" flowchart for CALC CDA. 21 Cm) l I Write SIGN Write Digits \ \ \ \ 1 7—— L - 0? (m 3 Figure 3.8 "Number-entry" flowchart for CALC CDA. 22 and on the average the number decoded falls in the middle, 289-2. For 1 §_p §_4 these numbers contain around 2p decimal digits; e.g., if p I 8p-2 - 26 I 64, then the uP sends 2 decimal digits to the calcu- l and 2 lator chip. With these ideas the execution-time of a multiple-precision add instruction for the CALC CDA becomes te I digit entry time (40 mseca.) ta I "add" function time (90 msec.), and a3 I 2(2p + 1)te + ta. (3.5) Equations 3.2 through 3.5 yield estimates for the multiple-precision add times of all CDA's and Figure 3.9 illustrates these values for var- ious precisions. In this figure the most noticable observation concerns the difference in add-time "magnitudes": nearly an order of 3. By reducing the mathematical operation of multiplication to re- petitive adds, the product becomes the multiplicand added to itself the multiplier number of times. For a p-precision, two's complement, fix- ed-point binary number multiplied by another, the result yields a 2p- precision product. Thus, the DIRECT CDA (without a hardware multiply) can use the algorithm depicted in Figure 3.10. The 2p—precision adds contribute, using Equation 3.2, 20pto time and on the average (see the development of Eq. 3.5) this occurs 28p-2 times each multiplication; i.e., the multiple-precision multiply “a - (zopto)<28P") 4p-1t I 80132 0. (3.6) And since the muP CDA uses the same software, . 4p-1 m4 80p2 to. (3.7) With the AU CDA the problem changes dramatically due to its 8-bit by 23 o 10 q, 6) cl (I C) 10’1 23’ 10'2 a E S 10"3 0 Ill 25 X 10-4 L I: I X 10"5 1 2 3 4 5 PRECISION x- DIRECT, AU, muP O- CALC Figure 3.9 Multiple—precision addition time for various CDA's. 24 (:_ ENTER :) [LI-MultiplierI L Product-0 1 Product I Product + Multiplicand I ~ [hm] No YES Crxrr D Figure 3.10 Multiple-precision multiply flowchart for DIRECT CDA. 25 8-bit hardware multiply. If the uP needs two I/O instructions to write the parameters into the AU, one to wait for the multiply to finish, and two to read the 16-bit product, then one single-precision multiply re- quires (5to) time. But for multiple-precision numbers the task becomes much more difficult; consider the notation: A I (ap +'a + ... + al) I multiplier p-l B I (bp + bp_1 + ... + bl) I multiplicand, and C I AB I (a b + a‘b P + ... + a b p p p 1) p-l + (a bp + ap-lbp-l +,... + ap-lbl) + ... + ... + albl), p-l + (albp + albp-l I product. Thus, multiple-precision multiplies for AD CDA involve p2 single-preci- sion multiplies (partial products) and p2 2p~precision adds (see Fig. 3.11); combining, 2 2 1:12 I p (5:0) + p (20pto) - 5p2(1 + 4p)to. (3.8) For the CALC CDA the analysis of Equation 3.5 directly applies to find- ing the multiple-precision multiply time; only the "multiply" function replaces the "add" function time; i.e., te I digit entry time (40 msec.) tm I "multiply" function time (120 msec.), and m3 I 2(2p + 1)te + tm. (3.9) So Equations 3.6 through 3.9 give estimates for the multiple- precision multiply times of all CDA's, and Figure 3.12 illustrates these values for various precisions. From this figure the most strik- ing feature pertains to the rapid rise of the DIRECT multiply time; 26 Cm) i I IAIO, IBIO, LlI-p, L2I-p, PIO, PPI0 I Single- I PPI(A+IA)*(B+IB) I- -- --precision , ‘multiply. H P. H 1' 2p-precision I 1,-pr I _ _ _ addition. I IAIIA+1, IBIIB+1 I NO “P 1.1-0? YES Lia-p, IAIO, IBIIB+1, L2IL2+1 L2 I 0? YES (: EXIT #:> Figure 3.11 Multiple-precision multiply flowchart for AU CDA. 27 10“ 103 102 ’r 101 U) 0 III 8 g 100 E4 u A z A a A X S -1 (J 10 ill E5 > 10"2 )t < (D -3 1° 0 10"“ (f 1 2 3 5 PRECISION X - DIRECT, muP A‘ CALC o - AU Figure 3.12 Multiple-precision multiply time for various CDA's. 28 for p I 1 it lies below the CALC, by p I 2 it equals the CALC, and for p 3_3 it exceeds the CALC considerably. Such action results from the repetitive adds which comprise the DIRECT multiply time. Also, the AU multiply time remains the fastest because of its hardware multiply, but it increases quicker than the CALC. This event occurs due to the larger number of AD multiplies required with greater precision, than the incre- ment in digit entries for the CALC CDA. Now, once the constants A and i M have been found for an application, Equation 3.1 with Table 3.1 can i give an estimate of the execution-time for each CDA. In this Section the flowcharts for various multiple-precision add and multiply instructions provide only the basic outline. Among the functions added to an implemented algorithm include initialization, zero and sign checks, overflow/underflow detection, etc. But these extra du- ties do not contribute significantly to the overall execution-time, nor do they represent more than second and third-order terms in the execu- tion-time Equations 3.2 to 3.9. 3.5 Costs The monetary expense incurred with each CDA could include sev- eral components, many of which depend on the remainder of the uP-based system, so this development focuses on the semiconductor parts cost. For all CDA's two principal terms add to give the total cost, one cor- responds to an elemental (or basic) CDA cost and the other to an appli- cation dependent cost: C I elemental CDA cost ei cai I application cost, and C1 I Ce1 + Cai’ l §_i §_4; (3.10) I total cost (in $'s). 29 First, the elemental cost term Ce relates to the basic expense of pro- 1 curing a CDA and its mathematical subroutine's memory. And three main factors comprise C ei’ IOi I I/O device cost Ri I mathematical subroutine's cost CPU1 I uP cost, and c‘31 - 101 + R1 + crui; 1 _<_ 1 _<_ 4. (3.11) Second, the application cost term Ca pertains only to the added ex- 1 pense of the application memory. This involves two terms, one for pro- gram.memory and one for data storage: wo I cost per word ($0.08) P I program memory 1 Di I data memory, and C81 . wo(Pi + pDi); 1 5_ 1 _<_ 4. (3.12) Because term Ce in Equation 3.10 does not depend on the application it i can be determined now, yet term Ca must be deferred until an applica- 1 tion program is defined. Hence, the three terms 10 R and CPU in 1’ i’ 1 Equation 3.11 need evaluation for each CDA. For the DIRECT CDA the I/O devices and uP costs involve simple estimates; since no I/O devices reside on the bus I0 I $0.00 and, to- 1 l I $10.00. But estimating R1 involves much more analysis and approximation. From Figure 3.6, the multiple-precision add day, an estimate for CPU flowchart, if each step involves about 1 word of memory, than 30 words corresponds to a reasonable estimate of the number of words for the in- struction. Similarly, the multiply instruction illustrated in Figure 3.10 uses the add instruction so the number of words roughly equals 20. A summation of these values yields 50 words and doubling this to account 30 for the subtract and divide subroutines results in 100 words. So 100 words at a cost of wb I $0.08 per word gives R1 I lOOwb I $8.00. With the AU CDA the expense of the AU changes I0 ; today, a 2 reasonable estimate of IO2 I $100.00. Still, CPU2 remains the same as CPUl, or CPU2 I $10.00. Following the procedure which determined R , the add instruction contributes 30 words while the AU multiply instruc- tion (see Fig. 3.11) attaches an addition 45 words. Together, they sum to 75 words which when doubled for the subtract and divide subroutines 2 I 150wb I $12.00. Next, the CALC CDA, like the AU CDA, embodies an expensive I/O' give 150 words; this sets R device (the calculator chip) and a contemporary estimate for this term assigns IO3 I $200.00. Yet the CPU estimate continues at CPU I $10.00. 3 With this CDA the exact same flowchart, see Figure 3.7 and 3.8, can acc- complish all arithmetic subroutines because only the function code (a parameter) need change from instruction to instruction; e.g., from add to multiply. These figures suggest that about 50 words can retain the program, and R3 I SOwb I $4.00. Analysis of the last CDA, the muP alternative, for the Ce term 4 parallels the DIRECT CDA except in quantity. Since the muP CDA repli- cates the DIRECT CDA hardware and software, and if it engages S Slaves, then I0 I $0.00, R I SR1, and CPU I (S +'1) CPU The CPU term mult- 4 4 4 l' 4 iplies CPU1 by (S + 1) because the Master uP creates an additional term. Thus, Table 3.2 delineates the estimates for I01, R1, CPUi, and Ce1 for all CDA's, l §_i §_4, based on 1978 device costs. From this table the elemental cost of the muP CDA grows rapidly as the number of Slaves increases. It surpasses the AU CDA by S I 6 and the CALC CDA by S I 12. Also, the contribution of R to Ce for the AU and CALC CDA's 1 i is insignificant because of the expensive I01 term. 31 3.6 Summary The preceding sections defined a basic set of four elemental CDA's: DIRECT, AU, CALC, and muP. In addition, they identified three important attributes-Iprecision, execution-time, and cost--which facil- itate comparison of these elemental CDA's. Precision involves the quan- tity of 8-bit words used to represent a number, execution-time the numv ber of seconds to complete a computation, and cost the dollar value of LSI devices used. Equations 3.1 and Table 3.1 determine execution-time, while Equation 3.10 and Table 3.2 specify cost (both use precision as a parameter). For execution-time and cost, these values result from terms indigenous to each CDA and from application derived terms. Using elemen- tary multiple-precision flowcharts for arithmetic operations, the chapter estimates these indigenous terms. Such an approach offers straightforward and tractable analysis. In Chapter V, the text presents techniques to estimate these application terms, and illustrates the effect of alterna- tive algorithms on the DIRECT and muP CDA's performance; but first, Chap- ter IV presents a decision mechanism for selecting the "best" CDA. 32 Table 3.1 Add and Multiply execution-times for various CDA's. i i 4p-1 10pto 80p2 to 10 t 5 2(1% )t P o P P o 2(2p+l)te + ta 2(2p+l)te + tm 4p-l lOpto 80p2 to Table 3.2 Estimates of Cei for various CDA's. IO1 Ri CPUi Cei __ $8.00 $10.00 $18.00 $100.00 $12.00 $10.00 $122.00 $200.00 $4.00 $10.00 $214.00 -- $(S)9.00 $(S+l)10.00 $(S)l9.00 + 10.00 CHAPTER IV MULTIATTRIBUTE UTILITY THEORY Using the attributes--precision, execution-time, and cost--this chapter describes a decision mechanism for selecting the "best" CDA en- titled Mnltiattribute Utility Theory (MUT). This begins with several definitions that explain the notation used throughout the remaining sec- tions. Next, the Chapter presents eight axioms upon which the theoretic foundation of MUT rests, and it shows that the above attributes satisfy these axioms. In addition, it verifies that the attributes also fulfill the assumptions of a specific decision function; i.e., an additive util- ity function. Finally, this chapter describes techniques to assess this specific decision function. Thus, these concepts provide a decision mechanism for selecting the "best" CDA using attributes precision, exe- cution-time, and cost. 4.1 Multiattribute Utility Theory Definitions As a guide to notation several definitions are explained that will be used in the sections that follow.(41) Definition 4.1: Consequence Space Let consequence space, denoted i I X1 represent a rectangular subset Of a finite N-dimensional x X2 x X3 x...x XN, Euclidean space. Definition 4.2: Consequence Let the consequence i, denoted i I (x1, x2, x3,..., xN), depict a specific point in consequence space where 33 34 belongs to E . x 1 1 Definition 4.3: Attribute Let the attribute x correspond to a particular value of i dimension ii; e.g., x1 Definition 4.4: Consequence Set belongs to ii . Let the consequence set C, denoted C = {x1, x2, x3,..., xM}, consist of the set of M consequences. Definition 4.5: .Relation For consequences i and E , the relation § > § means that E i j ----- i j i is preferred to ij' Similarly, £1 < ij means E1 is not prefer- red to ij, and 21 I £3 means £1 is equally preferred to £1. Definition 4.6: Operation For mutually exclusive consequences E and I and probability a, i 3 0 §_a §_l, the operation w I axi + (1 - a)xJ represents the consequence E with probability a and consequence i with i probability (1 - a). 1 Definition 4.7: Utility Let the utility of consequence i, denoted u(x), describe a scalar quantity which indicates the usefulness of consequence x. Definition 4.8: Marginal Utility Let the marginal utility of attribute x denoted ui(x1), des— 1’ ignate a scalar quantity which indicates the usefulness of attribute xi. 35 4.2 Utility Existence Axioms The theoretic structure of MUT rests on a foundation composed (42) of eight axioms. Von Neumann and Morgenstern have shown that if a consequence set satisfies these axioms, then there exists a numerical utility for each member of the set. Formally, let E1, ij, and 2k denote any three members of the consequence set, and for 0 §_a,8 §_1: Axiom 4.1 One and only one of the following three relations must hold; x > x , x < x , or x I x . i j i J 1 J Axiom 4.2 If x > x. and x3 > xk, then x1 > xk. Axiom 4.3 If E > E , then i > ax + (1 - a); 1 1 j’ Axiom 4.4 If I < i , then i < ax +>(1 - a); 13 i 1 j' Axiom 4.5 If E _>_i > ER, then there exists an a such that axi + (l - u)1-tk > £1. Axiom 4.6 If x1 < J < xk, then there exists an a such that axi + (l - 01);:k < 3&1. Axiom 4.7 The operation ax i‘+ (1 - a);j I (1 - a); ‘+ ax J 1' Axiom 4.8 The operation (1(8)?i +,(1 - B); ) + (1 - a);j I maxi + (1 - a8); . j 1 36 Theorem 4.1 If a consequence set satisfies the preceding eight axioms, then a numeric utility exists for each member of the set. When MUT is applied to selecting CDA's the consequence set con- tains exactly four members (one for each CDA) with three common attri- butes (precision, executionrtime, and cost). Thus, M I 4 and N I 3. Axiom 4.1 deals with the "completeness" of the consequence set. This assumption rules out the possibility that; i) neither of two CDA's is preferred, while ii) both CDA's are undesirable. In Axiom 4.2 the as- sertion considers the "transitivity" of preferences; e.g., if one CDA is preferred to a second which is preferred to a third, then the first CDA is preferred to the third. This property is quite plausible and commonly accepted. For Axiom 4.3 if CDA one is preferred to two, then it will always be preferred to the combined event because CDA two can occur with probability (1 - a) (Axiom 4.4 states the dual of Axiom 4.3). Now, if CDA one is preferred to two and two is preferred to CDA three as in Axiom 4.5, then for some a (sufficiently large) the combined event is preferred to CDA two. Here, a provides a likely base for the numerical estimate of the preference for CDA one to two, over CDA two to three (Axiom 4.6 states the dual of Axiom 4.5). These Axioms, 4.5 and 4.6, provide credible "continuity" assumptions. In Axiom 4.7 the statement claims that the order of combined events may vary; this follows, accord- ingly, since the constituents result from alternate events. Finally, Axiom 4.8 states that combination events composed of CDA constituents may proceed in two successive steps or one complete operation. Such a "distributive" property is usually accepted as standard practice. 37 4.3 Additive Utility Functions Axioms 4.1 through 4.8 guarantee that utilities exist for all members of the consequence set C, still the exact functional form remains unknown. In general, a utility function involves multidimensional oper— ations on the attribute values, but under certain conditions this func- tion reduces to a much simpler form composed of many unidimensional functions; symbolically, uci) - f {u1(§), u2(§), u3(§),..., uN(§)}. (4.1) Consider the following definition: Definition 4.9: Utility Independence of ii? Then, X1 is utility independent of ii-if ones prefer- ence order over lotteries on i; with ii-held fixed does not x...x iﬁ, and let £1 be a member depend on the fixed amount ii. (43) has shown that if i; is Utility independent Of 2: for all dimensions, then Equation 4.1 takes a "quasi-additive" form. Keeney Theorem 4.2 If ii is utility independent of 3%:for i I 1, 2, 3,..., N, then _ N N N u(x) .151 kiui(xi) + 1&1 j.i+1 kijui(xi)uj(xj) N N N + 2 Z Z kijmui(xi)uj(xj)um(xm) +..., (4.2) 1-1 in+1 ij+1 where k , k k i ij’ ... are scaling constants. ijm’ For the three attributes-precision, execution-time, and cost-~1otter- ies over any attribute will not vary as the specific values of the remaining attributes change; e.g., more precision is always preferred 38 to less, less execution—time to more, and less cost to more. Thus, since ii is utility independent of ii-for all dimensions, then Equation 4.2 holds. Yet in Equation 4.2 seven scaling constants must be evaluated, many of which represent the affect of "cross-product" terms between (44) dimensions. Again, Keeney has shown that under an additional con- dition these "cross-product" terms drop out. Definition 4.10: Preferential Independence Let x23-- X1 x...x 1_1 x X1+1 x...x Xj-l x Xj+1 x...x XN’ and let 12?]- be a member of .1213. Then 3E1 x i1 is preferentially independent of iii-if ones preference order over lotteries on i; is with iitheld fixed does not depend on the fixed amount of "Tj" Theorem 4.3 If ii is utility independent of ii-for one i, and ii x‘is is preferentially independent of iiE-for all j I i, then _ N u(x) I X 1.1 k1u1(x1) (additive), or (4.3) N l + ku(x) I w {1 + kk u (x )} (multiplicative), (4.4) i_1 i i i where k and the k1 are scaling constants, 0 < ki < l and k > -1. N N In Equation 4.3 X k I 1, and in Equation 4.2 if X k > 1 then i i iIl iIl N -l < k < 0 and if 2 k < 1 then R > 0. 1-1 1 Equation 4.3 represents an additive utility function since the X 39 utility equals a weighted sum of marginal utilities, while Equation 4.4 depicts a multiplicative utility function because the utility equals a weighted product of marginal utilities. To distinguish between these (45) two equations Keeney offers the following corollary: Corollary 4.1 - 1 - 2 - l - 2 -— - Let x1 , xi , xj , and xj be distinct values of X1 and X1, and let §- take on some constant value. Define Gamble A as 11 1, 2 1, §-—) or (£12, E 2 §—-) each with probability 0.5, 13 J ’ 13 - l - 2 - - 2 - l - and Gamble B as (xi , x , x-—) or (x1 , xj , x13) each with 11 probability 0.5. If one is indifferent to Gamble A and B, (:21 then Equation 4.3 holds (additive). For use of either Equation 4.3 or 4.4 the consequence set must satisfy the preferential independence assumption in addition to the util- ity independence assumption of Theorem 4.2. The preferential indepen- dence assumption states that the indifference curves between two dimen- sions do not vary as the other dimensions change. With indifference curves<46> a contour runs through a reference point which indicates the boundary between desired and undesired attribute pairs as in Figure 4.1. As the fixed amount E23 changes the other attributes change, but the indifference curve remains constant. Hence, both conditions of Theorem 4.3 hold and the utility function adheres to the form of Equation 4.3 or 4.4. o - 1 - 2 - l - 2 In Corollary 4.1 assume x1 > x1 and xj > x.j , then Gamble A is between a preferred consequence (E11, £11, £339 and a non-preferred consequence (E12, E12, Ei3) which both occur with probability 0.5. Sim- ilarly, Gamble B is between two consequences that both contain a prefer- red and non-preferred attribute and which occur with probability 0.5. 40 .mo>wuoauouam swumov Hmaoaumuanaoo you mo>uao mucoummmavoH H.¢ ouowﬁm .coamaooun new mawulaoauouoxm Au .umoo was :Oﬁmaooum An .umoo new maaulsoauoooxm Am Assaulooauaooxmv Aunoov Aumoov H- o a- o ano a o , o \a; x .l H T.a w .o .d m 1 1 r. .l N n T.~ n m I. I. . m... m. ... 1% ...... m , . .I q |.¢ .L 4‘? «I, .1tw Jily.’ l 'llll’llllll 41 For attributes precision, execution-time, and cost neither Gamble A nor Gamble B represent a preferred wager and, consequently, Corollary 4.1 specifies the additive utility function of Equation 4.3. Thus, the util- ity function for the CDA's of Chapter III consist of a weighted sum of individual marginal utilities. In this section the use of an additive utility function assumed both utility and preferential independence: two strong assumptions. Fortunately, Dawes, et a1.(47) (48) and Einhorn, et a1. point out that even modest deviations from utility and preferential independence rarely affect the ultimate number u(x) and, even less, the rank ordering of the u(x) values (utilities). They report that with monotonic attributes- where more is preferred to less, or less is preferred to more-the two independence assumptions cause little trouble. 'So for monotonic attri- butes precision, execution-time, and cost, the utility and preferential independence assumptions are plausible and, if not to an approximation, acceptable. ‘4.4 Marginal Utility Functions As with the utility of a consequence, the marginal utility of an attribute indicates its usefulness relative to the other attribute values. Since the precision attribute describes a discrete variable its marginal utility must consist of a discrete function. And for the exe- cution-time and cost attributes, both continuous variables, the marginal utilities require continuous functions. Other qualities desirable of marginal utility functions include a common numerical range; any inter- val will suffice but a convenient choice is 0 §_ui(xi) §_100. Finally, marginal utility functions should preserve the proportional distance I’IIIIIIIII’IIIIII’ 42 between attribute points; e.g., values twice as far apart should denote marginal utilities twice as far apart. Using the previous ideas Edwards<49> has proposed the follow- ing procedure for determining marginal utility functions. First, the most and least desirable values and attribute are found and, then, a simple straight line is defined which yields 100 for the most desirable attribute and 0 for the least desirable attribute. This procedure yields the marginal utility functions for the attributes precision, execution- time, and cost. Precision Let p denote the precision where p I l for single-precision, p I 2 for double-precision, etc; u1 I (p - 1) 33.3; 1.5.p.§.4. (4.5) Execution-time Let T signify theiexecution-time, and T and T the maxi- max ‘min mmm.and minimum values, respectively; (T - T) max u2 I 100. (4.6) (Tmax - 1min) Cost Let C indicate the cost, and C and C the maximum and --- max min ‘minimum values, respectively; (me - C) u - 100. (4.7) Hence, for any set of CDA consequences Equations 4.5 through 4.7 convert the attribute values to marginal utilities, and Equation 4.3 forms the utility of each consequence as a weighted sum of its marginal utilities. 43 So the best member, or members, of the consequence set consist of those alternatives with the largest utility. Not all attributes possess linear marginal utility functions as (50) suggested in this section, and Raiffa presents several techniques to (51) shows that assess nonlinear marginal utility functions. But Edwards for monotonic attributes the straight-line procedure produces a close, first-order approximations to the nonlinear approach with sample correl- ation coefficient 0.99. Since precision, execution-time, and cost all represent monotonic attributes, the linear approximation in this Section is credible and acceptable. 4.5 Summary In thisczhapter, Multiattribute Utility Theory (MUT) adapted to selecting the "best" CDA, provides a decision mechanism which uses the attributes precision, execution-time, and cost. Here, MUT assigns a numeric quantity to each CDA which indicates its usefulness with respect to the other alternatives in the consequence set. Since the above attributes satisfy Axioms 4.1 through 4.8, Theorem 4.1 guarantees that utilities exist for all consequences. Furthermore, because the attributes also fulfill the utility and preferential independence assump- tions of Theorem 4.3, the additive utility function holds which forms the utility of a CDA as a weighted sum of its marginal utilities (Eqs. 4.5 through 4.7 convert attributes to marginal utilities). In the next chapter, three examples illustrate the procedure for obtaining marginal utilities and, moreover, the use of this additive utility function. So the "best" alternative consists of the CDA with the largest utility. CHAPTER V APPLICATIONS In this chapter, three examples illustrate the results of the previous two chapters; they clarify the techniques used to determine attribute values and for typical consequence sets they elucidate use of an additive utility function. These examples-1inear regression, matrix inversion, and fast Fourier transform computations--exemplify the over- all procedure of this investigation and, moreover, they lead to conclu- sions which characterize each CDA. Not only do these applications typi- fy representative and contemporary engineering problems, but they vary widely in their mathematical sophistication. First, a discussion of the general approach follows below. 5.1 General Approach The general approach to using MUT applied to selecting CDA's involves five principal steps (see Fig. 5.1). First, the application must be defined; this includes stating all assumptions and conditions under which the computation remains valid. Then, the computation is expressed using detailed and complete notation. Since some parameters may vary their potential effect upon the computation must be considered. Next, the most difficult step in Figure 5.1 concerns determiningIthe attribute values; i.e., precision, execution-time, and cost. To accomr plish this, begin by preparing a program flowchart which implements the computation using a series approach. And by examining it for sections 44 45 (:7 ENTER ::> Define the Application 1 Determine Attribute Values 1 Find the Marginal Utilities Form the Utilities - l Select the Alternative C EXIT D Figure 5.1 General approach to selecting a computational design alternative. 46 which may be executed simultaneously, or in parallel, construct a second program flowchart which implements a parallel approach (Master and Slaves). All but the muP CDA uses the parallel approach to assist in finding attribute values. From these flowcharts the total number of elementary operations; e.g., adds/subtracts and multiplies/divides, or constants A1 and Mi’ can be estimated for each CDA. By multiplying these terms by the multiple-precision add and multiply times (see Eq. 3.1 and Table 3.1), a and mi, the execution-times are found as follows: 1 T1 I Aiai + Mi mi. Again, the two flowcharts can be used to estimate both the program and data-storage memory requirements, P1 and Di respectively, for each CDA. They yield the application cost term (see Eq. 3.12), Cai’ as follows: °a1 ' wo(Pi + 1)]’1” and together with the elemental CDA cost term (see Eq. 3.1 and Table 3.2), Cei’ give the total cost of each alternative: Ci I Cei + Ca 10 To determine program storage memory requirements consider the contribu- tions of an arbitrary loop and subroutine call: loop LDR COUNT IREG I COUNT NEG /REG I -COUNT LOOP --- INC IREG a REG + 1 47 JNZ LOOP /REG I 0?, NO JMP LOOP subroutine call LDl ADl /OPERAND 1 ADR LD2 AD2 IOPERAND 2 ADR LD3 AD3 IRESULT ADR CALL SUB Thus, for estimating P each loop and each subroutine call contribute 1 four words. In the third step of Figure 5.1 the attribute values are con- verted to marginal utilities as explained in Chapter IV. Precision, a discrete variable,obeys a discrete function, while execution-time and cost, both continuous variables, follow a linear transformation. And for all attributes the marginal utilities lie within the same interval; i.e., [0, 100]. The next step invokes an additive utility function (shown to be valid in Chapter IV) upon the marginal utilities; it forms the utility of each CDA as a weighted sum of it's marginal utilities. Hence, prior to the use of this function a "Decision Maker" must assess the value of the weights, or ki's, subject to the condition that N 151 k1 I 1. Finally, the last step of Figure 5.1 concerns selecting the CDA with the largest total utility. Or if more than one alternative is 48 to be selected, then choose those with the greatest utility. By varying the consequence set, some alternatives may result in equal utilities which indicate the "break-even" point between those CDA's. Also, dif- fering values of the k 's produce new utility values which illustrate i the "sensitivity" of a CDA to each attribute dimension; thus, the util- ities in the dimension of emphasized ki increases, while the others decrease. 5.2 Linear Regression Example In this example a set of N ordered pairs {(xi, yi)l i I l, 2, (52) 3,...,N} are fit by the method of least-squares to a straight line y I bo + blx. The solutions to the normal equations nbO + bl 2x1 I Zyi, 2 b0 2x1 + b1 2x1 I inyi, yield the unknown constants b and bl; i.e., n 2x1?i - <2x1)(2yi) b :3 l 2 2 N 2x1 - (2x1) b I—Z—y—i'. _ bi o N l N ' For this example it is assumed the x's are fixed variables and the y's independent random variables having normal distributions and with come 2 mon variance 0 . The series program flowchart contains six principal steps and no major loops (see Fig. 5.2). To estimate the total number of adds and multiplies, A and M1, the contributions of each step can be summed as 1 shown in Table 5.1. These terms, when multiplied by the multiple-pre- cision add and multiply times, combine to give the series execution-time 49 CENTER D l A I in l B I 2x12 [ C I Zyi Step-1 FLI 'Step-Z l..J... Step-3 ‘ tep-4 —|I -H ”'1“ II a. [11 l D I iny1 N*B _ AIA Step-5 C-Bl*A L... N ( EXIT :) . * - * l31_ND AC I._J__.l Step-6 __'_J [.11 Figure 5.2 Linear regression series flowchart. 50 Table 5.1 Estimates of Ai’ Mi (1 I 4) for Example-l. §tep _Adds- Multiplies 1 N .. 2 N N 3 N -— 4 N N 5 2 5 6 l 2 Totals A1 I (4N + 3) Mi I (2N + 7) Table 5.2 Estimates of A M for Example-1. 4’4 Step Adds Multiplies l 3N -- 2 ‘ 2N -- 3 N N 4 4 -- S 2 5 6 l 2 Totals A4 I (6N + 7) M4 I (N + 7) 51 Ta a T1 = (4N + 3);;i + (_2N + 7)mi; 1 3‘ 4. (5.1) By examining Figure 5.2 it is possible to identify those sections of the series flowchart which can be executed simultaneously, or in parallel. Specifically, four slaves can perform Steps 1 to 4 once they receive the correct x's and/or y's. The responsibility of the Master then be- comes data routing, not data computing as shown in Figure 5.3 and 5.4. Again, the total number of adds and multiplies can be found by summing the contribution of each step in the parallel flowchart as detailed in Table 5.2. By multiplication of these terms by the multiple-precision add and multiply times the parallel execution-time is found as Tp I T4 I (6N + 7)a4 + (N‘+ 7)m4. (5.2) (Note: the "SLAVES BUSY" term can be deduced from Figure 5.5) An important result occurs by defining R, the ratio of series to parallel execution-time as T ._9 R T P (53) and assuming m I Be . From Equations 5.1 and 5.2 i i (4N + 3)ai + (2N + 7) (8a1) , R " (6N + 7)ai + (N + 7) (8ai) . 20N + 59 , l4N+63 and as N grows large limR IQI 1.43; N‘+ a 14 this represents a maximum.decrease in execution-time by a factor of 1.43 due to simultaneous, or parallel, execution. Although the Slaves 52 HR 1:1. mass _ .. _ _ a. _ a _ _ a _ _ _ . _ . _ N T QUHDQOM VQOM IIYTI is: Tl sum>mam mumpmam Num>mem Hum>mam muons: 53 Cm) $1, 82, S3IxyZ — Step-l /:3, S4Iy1 /L _ St tep-2 Slaves Busy? A931, B-sz, c,-s3 13-54 —' N*D - Ac [:1‘ N*B - A*:}_ _ Step's ._ * [30- c __B___1 A ]__ _ (:_ EXIT #:) H7 rh Step-3 r‘w p-4 l U) l H (D l—L' p-6 I In I ("f (D Figure 5.4 Linear regression parallel flowchart: Master. 54 (:¥ ENTER :> [LI-N,II01 B I O E'IDW [1-1” [WI] YES (:_ EXIT :) Figure 5.5 Linear regression parallel flowchart: Slave-2. 55 execute in parallel, the Master's overhead involves vast data routing and, thus, a limit to the reduction in execution-time. If the sampled- data originated with the Slaves, then R would decrease. When determining the application memory cost, Cai’ all sub- routine calls (adds/subtracts and multiplies/divides) require four words of program.storage and each loop demands an additional four words of program storage. As with execution-time, the series and parallel pro- gram.flowcharts (Fig. 5.2, 5.4) assist in estimating the application memory cost. Similarly, the contributions of each flowchart step add together to yield the final result. Table 5.3 outlines the steps and their contributions for the series flowchart; these terms combine as follows: cai I wb [80 + p(ZN + 6)]; i I 4. (5.3) But for the mpP CDA (i I 4) the slaves create additional terms of pro- gram.and data memory storage. Figure 5.5 depicts a worst-case slave flowchart, Slave Two, which illustrates the additional contributions due to the four slaves. Adding these terms to the Master's terms gives (see Table 5.4) the application memory cost Ca4; i.e., C I wb [112 + p(7N + 6)]. (5.4) a4 Combined with the results of Chapter III Equations 5.1 through 5.4 produce the set of attribute values for this example (see Fig. 5.6 through 5.13). Similarly, the techniques of Chapter IV convert these values to marginal utilities and, finally, to utilities as shown in Tables 5.5 and 5.6. From inspection of the figures and tables, several conclusions become clear. First, the AU CDA always possesses the fastest execution- time due to its hardware multiply and divide. For all cases, the my? 56 Table 5.3 Estimates of Pi’ Di (1 I 4) for Example-1. Step, Program Data 1 8 N +'l 2 12 1 3 8 N + 1 4 12 l 5 28 1 6 12 1 Totals Pi I (80) Di I (2N + 6) Table 5.4 Estimates of P D for Example-l. 4’ 4 Step Proggam Data 1 16 N 2 12 N 3 -- -- 4 4 4 S 28 1 6 12 l Slaves (4) 40 SN Totals P4 I (112) D4 I (7N + 6) EXECUTION TIME-(SECS.) Figure 5.6 Single-precision 10 10 10 10 10 10 10 10' 10 10' 57 Example-1. execution-time versus sample size for A} [A )K E! A 3 E5 6) (A )I [J (J 7L [I Q) X ti o <> 101 102 103 1o4 105 SAMPLE SIZE X - DIRECT A- CALC (3- AU EJ- muP 10 10 10 10 10 10 EXECUTION-TIME (SECS.) 10 10' 10 Figure 5.7 Double-precision execution-time versus sample size for 58 A A % 4, as ‘ 4’ o <> o o 101 102 103 104 10 SAMPLE SIZE X - DIRECT A' CALC 0 - AU D - muP Example-1. EXECUTION-TIME (SECS.) 59 10 it E! 105 is E] 104 )I A El 103 G) )K A El 102 Q) X gs 101 I <3 . 100 CD _ <5 10 1 101 102 103 104 105 SAMPLE SIZE )( - DIRECT £5»- CALC C) - AU [3 - muP Figure 5.8 Triple-precision execution-time versus sample size for Example-1. EXECUTION-TIME (SECS.) 6O 107 X 106 [J V 105 )5 X 10“ A o 103 )' (D 102 €ﬁ D 101 45 C 10° C 10"1 1o 102 103 104 105 SAMPLE SIZE X - DIRECT A- CALC O - AU 0 - muP Figure 5.9 Quadruple-precision execution-time versus sample size for Example-l. COST (S's) 61 105 E] A Q) i 10“ M £3 103 X H 102 i 101 101 1o2 103 104 105 SAMPLE SIZE X- DIRECT A " CALC 0- AU 0 ‘ m“? Figure 5.10 Single-precision costs versus sample size for Example-1. COST ($'s) 62 10 10 10 1o 101 , 1o1 102 103 104 105 SAMPLE SIZE X - DIRECT A " CALC C) ' AU [3 - muP Figure 5.11 Double-precision costs versus sample size for Example-1. COST ($'8) 10 10 10 10 10 10 Figure 5.12 Triple-precision costs versus sample size for 63 [I d1 [13 X 101 102 1o3 104 10 SAMPLE SIZE )(- DIRECT ZS ‘ CALC 0" AU D - muP Example-l. COST ($'s) 64 106 CF 105 A 104 103 g D 2 E3 10 x 101 101 102 1o3 104 1o5 SAMPLE SIZE )(-DIRECT A: ‘ CALC <>- AU :3 - muP Figure 5.13 Quadruple-precision costs versus sample size for Example-l. 65 ﬂ Hm.o .H.o .H.ov u mw .HH.o .m.o .H.ov a am..HH.o .H.o .m.ov a Hm..Hn.o .m.o .n.ov u s "ouoz as «HHH.H No as . as so am omH.~ a mm N o NHHH.~ NH mm Na oe AH Hmsa.m m OOH s mm Neon.H Nm mm mH Hm ooH muses.» N o H ooH Hmoo.m Hm HH me mm o Humo.a H as m 1A.. .25.. me am: HHm .zm: am am: as: PI. : .Amoamaom HoHv HIquamxo you moauﬁaaua can now mononuomsoo Hmuﬁmha m.m manna 66 Amoo aHoo aHoov ' mm oAHoo .moo «Hocv I N” oAHoo aHoo awoov l Hnulﬂ. QAMQO Q .m.o .n.ov n x "ouoz o mNNH.H ma am on we no sme~.H o mm N on qm~¢.o an om om mm o nmam.¢ m ooa q OOH «MHQ.H om ca ON no ooa Hmou.h N 00 H 50 cmom.e co ma cc «q oo mme~.n A co m a 1mm am: mm .26.. mm 1M? H m aw: mar. 2w. z .Amoanamm GHvHIoHaawxo you moauwaau: was uom moaosvmmcoo Hmoaazh o.n manna m 67 CDA shows reduced execution-time when compared to the DIRECT CDA even though they perform multiplies and divides in software. Here, R I 1.5 which agrees well with the calculated value and, thus, confirms the validity of the m1 I 8ai assumption. Also, the DIRECT/muP CDA's for p I l, 2 execute faster than the CALC, yet for p I 4 this reverses because of the repetitive adds associated with the software multiplies and divides. All costs rise with larger N due to additional data-stor- age memory, especially after 103 samples, but the muP costs go up high- er (more memory needed). From Table 5.5, for weight vector Kg, the muP CDA represents the "best" CDA for the typical consequence set and small problem dimensions (i.e., 101 samples). Similarly, Table 5.6 reveals that the AU CDA depicts the "best" CDA for large problem dimensions (i.e., 105 samples). As the ki's vary, the CDA's with strong marginal utilities in the dimension of the emphasized k show increased utility, and the 1 rank ordering of the CDA's changes accordingly. For example, in Table 5.5 the CALC CDA exhibits the greatest "sensitivity" to the precision _;; varies from 40 to 82, respectively. Also, the DIRECT and muP CDA's dimension as the weight vector changes from E; to i.e., its utility (for NA) in Table 5.6 correspond to "break-even" alternatives. 5.3 Matrix Inversion Example For the second application a square, nonsingular matrix A of order N is inverted. A standard approach, that of Guassian elimina- (54) 1 tion 3 used; this technique performs a sequence of elementary row 11: operations on the partitioned matrix [A, I] to yield the matrix [1, A- where I is the identity matrix. The series flowchart for this technique includes two principal steps imbedded within two major loops as shown in Figure 5.14. Here, 68 RU)- R(J)-a(I,J)R(I C gm 3 Figure 5.14 Matrix inversion series flowchart (Gaussian Elimination technique).(55) 69 Step-1 replaces row I with itself divided by the pivot entry a(I, 1): hence, the pivot entry becomes unity. Once completed, Step-2 uses row I to reduce the remainder of column I to zeros. Together with the two loops, Steps 1 and 2 transform the left submatrix to the identity matrix and the right submatrix to the inverse of A; i.e., [1, A‘l]. To estimate the total number of adds and multiplies, A and Mi’ sum up the contri- i bution of each step and multiply this by the number of times through each loop (see Table 5.7). These terms, when multiplied by the multi- ple-precision add and multiply times, combine to give the series execu- tion-time 3 2 T8 T1 2N (N - 1)ai + 2N m i I 4. (5.5) 1; Equation 5.5 agrees with published results.(56) By examining Figure 5.14 it is possible to identify those sec- tions of the series flowchart that can be executed simultaneously, or in parallel. Since Step-2 only requires the results of Step-1, N slaves can perform Step-2 simultaneously and the Master flowchart now involves just one loop (see Fig. 5.15 and 5.16). Again, to find the total number of adds and multiplies sum the contribution of each step in the parallel flowchart as detailed in Table 5.8. (Note: for an estimate of the "SLAVE BUSY" term use Figure 5.17) Multiplying these results by the multiple-precision add and multiply times yields the parallel execution- time Tp =- T - 2N2(N + l)a4 + 2N2 (5.6) 4 “‘4' Again, the ratio of series to parallel execution-times produces an indi- cation of the decrease in execution-time due to simultaneous, or paral- lel, execution. Using Equations 5.5 and 5.6 with the assumption 1111 I 8ai 70 Table 5.7 Estimates of A , Mi (1 I 4) for Example-2. Step Adds Multiplies 1 - 2N2 2 2N2(N - 1) 2N2(N - 1) 2 3 Totals A1 I 2N (N - 1) Mi I 2N Table 5.8 Estimates of A4, M4 for Example-2. Step Adds Multiplies 1 2N2 -- 2 2N2(N - 1) -- 3 2N2 2N2 Totals A4 I 2N2(N + l) M I 2N2 71 .Nloﬂoamxm sow 323m was nouns: sooauon soda—30m mad shaman Tl HH + He 3oz .II-VT-TII HBOM A-IIIIIIIII maHa r!» .ze _ , a as .x_ a. _ r Ha — _ eHaz I’m _ I m _ MILIIHIL Tl was: III'TI maﬁa—B I'— Iv. zuu>aHm m-o>mHm Num>mHm H1m>mHm nouns: 72 lStep-Z l S(J) I R(I) J I I ] WU} _. /Z _ Slaves Busy? C EXIT D Figure 5.16 Matrix inversion parallel flowchart: Master 73 (m) I LN] [PM] a(K, J).- a(K, J) -a(I, J)a(I, K) NO K I 2N? YES C m D Figure 5.17 Matrix inversion parallel flowchart: Slave-J. 74 2N2(N - 1)ai + 2N3(8ai) TS T 2 2 p 2N (N + l)ai + 2N (8ai) _ 18N - 2 . 2N + 20 And as N grows large, lim.R --l§ = 9, N +_w 2 the execution-time decreases, at best, by a factor of 9. Since this problem yields a greater R than Example-l, it exhibits greater parallel traits and, thus, better fits simultaneous execution concepts. Again, the Master's overhead must pass a new row to (N - 1) Slaves for each column and, eventually, this action becomes more time consuming than the parallel execution of the Slaves. In determining the application memory cost, C , all subroutine ai calls (add/subtract and multiply/divide) require four words of program storage and each loop demands an additional four words of program stor- age. By using the flowcharts of Figures 5.14 through 5.17 the estimates for the program and data storage can be determined by summing the con- tributions of each step. Table 5.9 delineates the steps and their con- tribution for the series flowchart; combining terms 2 Cai wb(28 + pZN ), i i 4. (5.7) But for the muP CDA (i = 4) the N slaves create additional program and data storage requirements. When added to the Master's terms (see Table 5.10) they sum to give the muP application memory C - wb [4(3N + S) + p2N(N + 1)]. (5.8) a4 Finally, the results of Chapter III and Equations 5.5 through 75 Table 5.9 Estimates of P Di (1 # 4) for Example-2. i, Step Program Data 1 12 2N2 2 16 -- Totals P - 28 D - 2N2 1 1 Table 5.10 Estimates of P4, D4 for Example-2. Step Program Data 1 8 ZR 2 12 .. 3 -- -- 2 Slaves (N) 12N 2N Totals P4 - 4(3N + 5) D4 8 2N(N + 1) 76 5.8 combined to produce the set of attribute values for this example (see Fig. 5.18 through 5.25). Additionally, the techniques of Chapter IV convert these values to marginal utilities and, lastly, to utilities as shown in Tables 5.11 and 5.12. Again, these figures and tables expose several important CDA features. The AU CDA continues to perform the task fastest while the muP CDA always beats the DIRECT CDA. But as N increases the last two execution-times separate and for even small N, R a 5 already. Similar- ly, for p - 1, 2 the CALC CDA shows the slowest speed and for p 8 4 the order changes; i.e., the CALC CDA finishes before either the DIRECT or my? CDA's. With respect to costs, they do not change as dramatically as in Example-l because the program and data-storage memory require- ments do not rise as sharply. Still the muP costs grow exceedingly fast as N changes due to the increased number of Slaves as reflected in the elemental cost term Ce4' For both typical consequence sets and weight vector Kg, the tables reveal that the muP CDA represents the overall, "best" CDA. In Table 5.11 for Ed’ the DIRECT and muP CDA's illustrate "break-even" utilities; i.e., 87 and 86, respectively. Thus, these CDA's correspond to equally useful alternatives. And finally, Table 5.12 for weight vector E5, shows the "sensitivity" of the AU CDA to emphasis in execution-time; it increases from 52 (with 2;) to 86, or by 34. 5.4 Fast Fourier Transform Example In this final example an input sequence x(n) is transformed from the discrete time domain to the discrete frequency domain x(m) (57) using the fast Fourier transform (FFT) algorithm of the discrete EXECUTION-TIME (SECS.) 10" Figure 5.18 Single-precision execution-time versus matrix order for 77 A A X E! ¢> é U (5 C) 0 1 2 3 4 MATRIX ORDER X - DIRECT A - CALC (>- AU [1 - muP Example-2. 78 10 10 101 4L T 2L i «5 u 3",: cu AL ‘3 :7 n E E c g 10_1 é o O 10"2 $ d> 10'3 1 2 3- 4 5 MATRIX ORDER X - DIRECT A - CALC o - AU D - muP Figure 5.19 Double-precision execution-time versus matrix order for Example-2. 79 103 X 102 U £5 2}) 3 A 101 U a"; a i 5 10° 2 g (D 5 c5 10'1 CD -2 1 m V 10"3 1 2 3 4 5 MATRIX ORDER x- DIRECT A— CALC 0- AU D- 11111? Figure 5.20 Triple-precision execution-time versus matrix order for Example-2. 80 10 10 10 A c3 102 E 5 z E 101 A 8 S 4? 100 A Q o) 10"1 9? .. > 10 2 1 2 3 4 5 MATRIX ORDER x - DIRECT A- CALC c>-AU :1. mp? Figure 5.21 Quadruple-precision execution-time versus matrix order for Example-2. COST ($ '8) 81 103 o O O i 102 :1 [I x x 101 1 2 3 5 MATRIX ORDER x- DIRECT A— CALC 0- AU U- muP Figure 5.22 Single-precision costs versus matrix order for Example-2. 82 103 A A i 1; c) (D 3 102 H U1 0 L) C1 ['1 7 101 ~ 1 . 1 2 3 4 5 MATRIX ORDER X - DIRECT A - CALC o - AU a - ““11, Figure 5.23 Double-precision cost versus matrix order for Example-2. COST ($'s) 83 10 10 10 MATRIX ORDER x - DIRECT A- CALC C) ‘ AU EJ-‘muP Figure 5.24 Triple-precision costs versus matrix order for Example-2. COST (3'8) 84 103 <> 0 ‘9 102 I] U E] a: X + 4‘ 1O1 1 2 3 4 5 MATRIX ORDER X - DIRECT A - CALC (D - AU [3 - muP Figure 5.25 Quadruple-precision costs versus matrix order for Example-2. 85 m 8.0 .Hd .18 a mm .35 £5 .18 n «m .26 .3. 6.8 u M .95 .md 3.8 u x "302 Hm Hmm0.e ow am me «5 mm HIMnm.¢ q mm N Go NMHM.N ma mm mm Hm mm Ammad m OOH e an NmmN.H Hm mm ma on 00H Mlmew.m N CO H OOH HWNN.N mm 0H no mm oo HmmH.¢ H co m Q 3. .25.. mm .zm: mm .26.. A e. .26.. as... 2w z .Auovuo vauvuloamamxo How mouuﬁawu: can you mononuomnoo Headmha HH.n manna 86 Amoo ..Hoc aH-OV ' mm aAHoo amoc aHocv I NW! oAHoo aHoo «moov ' Hm ARM-o m .m.o .m.ov u .m. ”muoz we NmHo.H so a» me as am ommﬂ.e s mm N co «mss.~ «a as «a as me Nmﬁe.m m can s mm ~mm~.H an on as mm cos Numoo.s N co H ooH HmNN.m as as me an ac Nmmm.o H as m G if .25.. Am .26.. mm .26.. A m cm: as: am 2 .Auovuo nunvmloaqawxo Mom moauuaau: was uom monasaomaou kuﬁqza NH.n manna 87 Fourier transform (DFT). If it is assumed that the input sequence is real and even, then the DFT yields N-l 2wmn X(m) - Z x(n) cos { N }, (5.9) n-O \ N-l =- 2 x(n)Wm, where (5.10) n-O N a number of samples, x(n) - input sequence, and X(m) - transform sequence. Using matrix notation Equation 5.10 becomes i - ﬁn}; but if N - 2L, then a can be factored into L terms, — L — - X - ( ﬂ W1) x. (5.11) i-l This feature greatly reduces the execution-time of the DFT by minimiz- ing the number of multiplies. Figure 5.26 illustrates the series pro- gram,flowchart. The series flowchart contains three principal steps within three basic loops. Much of the algorithm deals with altering loop control variables which do not demand a large part of the total execution-time. Additionally, the process defined in Step-l must find the weight coeffi- cient WIE which involves a cosine function as in Equation 5.9. A Mac- laurin series expansion of the cosine yields x2 x4 x6 cosx=1--2-!-+—4—!-+—6—!-+Rn, (5.12) Rn = remainder. Thus, Step-l can use this expansion to determine the weight coefficient 88 (mm) I LIA-N/2,IE=1,J-I | I Ic-O,ID=IA,R-1 ,4 I IE - IC/IA, z .. Wm, M - IC I- “' “ISM"1 I A - Y(M), B =_Z*Y(M + IA) I— — -'Etep-2 Y lStep-3 I M-4M+1 I N0 = ID-l? YES Ic-IC+2*IA, ID=ID+2*IA, K-K-I-l I NO KFIB? YES I IA-IA/z , 1352*13, J-J+1 17 NO JIL? YES C ... D Figure 5.26 FFT series flowchart (Cooley, Tukey technique ). (58) 89 WIE. To complete the matrix multiplication of Equation 5.1L Steps 2 and 3 perform the required multiplies and adds. As with the other examples, the contributions of each step sum to give the total number of adds and multiplies (terms A and Mi)° But since the CALC CDA contains the cosine 1 function in hardware, the contributions of Step-l (or Eq. 5.12) reduces to a simpler form. Table 5.13 outlines the steps and their contribution for A. M i = l, 2. When multiplied by the multiple-precision add and 1’ 1‘ multiply times these terms produce T8 = Ti - [ZNlogzN + 6(N - 1)]a1 + [NlogzN + 18(N - 1)]mi; i = 1, 2. (5.14) (59) Equation 5.14 agrees with published results. Similarly, Table 5.14 depicts the estimates of A3 and M3 which give T8 3 T3 = [2Nlog2N + 27.2(N - l)]a3 + [Nloglem3. (5.15) By examining Figure 5.26 it is possible to determine those sec- tions of the series flowchart that can be executed simultaneously, or in parallel. Since Steps 2 and 3 only require the weight WIE, N slaves can perform these tasks and the Master flowchart now contains far less mult- iplies (see Fig. 5.27 and 5.28). Again, to find the total number of adds and multiplies sum the contribution of each step in the parallel flow~ chart as detailed in Table 5.15. (Note: for an estimate of the ”SLAVES BUSY" term use Fig. 5.29) multiplying these results by the multiple-pre- cision add and multiply times yields the parallel execution-time. Tp - T4 . [(6N + l)log2N + 6(N - 1)]a4 + [logzN + 18(N - 1)]m4 (5.16) 90 Table 5.13 Estimates of A1, Mi (1 = l, 2) for Example-3. Step Adds Multiplies l 6(N - l) 18(N - l) 2 -- NlogzN 3 2Nlog2N - 0 Totals A 1 Table 5.14 Estimates of A - 2Nlog2N + 6(N - 1) Mi - NlogzN + 18(N - l) M. for Example-3. 3’ 3 Step Adds Multiplies l 27.2(N - 1) —- 2 - NlogzN 3 2Nlog2N -- Totals A3 I ZNlogzN + 27.2(N - 1) M3 - NlogzN Table 5.15 Estimates of A M for Example-3. 4’ 4 Step Adds Multiplies 1 6(N - l) 18(N - l) 2 2Nlog2N -- 3 4Nlog2N -- 4 logzN logzN Totals M4 = logzN + 18(N - 1) A4 = (6N + l)log2N + 6(N - l) 91 maﬁa «loanaoxu uom mm>mam can sauna: awesome aowumaom n~.m ouswwm NH _ NH Ha. Aa + av mmmm — aH<3 TI ”:53 I _. I_I H. mmmm maﬁa: _zﬁglw: ii Zlﬂ>dﬂm Ml0>mﬂm ~1m>m~m HIU>MHW “mums: 92 <:: ENTER. :) I I IA=N/2,IB-l,J=-1 I ,_I I IC=0, ID-IA,R=1 I E - IC/IA, 2 - VIE, M 3 IC 1— —- —[Step-l “' Step-2 IS(M) - z, B; S(M + IA) - z, A I— —-- —Ecep-3 M-M+lj NO M! ID-l? YES IIC=IC+2*IA, ID-ID+2*JA, R=X+1 NO KFIB? YES I IA t IA/Z, 18 = 2*IB, J 3 J+1 j- YES I N0 aves ‘\4¥EEﬁV/rv — -IEtep 4 NO J- YES \L?/ I (: EXIT :) Figure 5.28 FFT parallel flowchart: Master. 93 Cm) I Y0!)- Y(M) + B*Z V C EXIT D Figure 5.29 FFT parallel flowchart: Slave M. 94 When the ratio of series to parallel execution-times is determined, it gives an indication of the deCrease in execution-time due to simultaneous execution. Using Equations 5.14 and 5.16 with the assumption mi = 8ai Ts [2Nlog2N + 6(N - 1)]ai 1+ [NlogZN + 18(N - 1)](8ai ) [(6N + l)log2N + 6(N - 1)]ai + [(logzN + l8(N-l)](831) R.— 1ONlog2N + 150(N - l) ’ (6N + ”108211 + 1500: - 1) For large N, lim R _Ilgi- 1.66, N1+ w 6 the executionetime decreases, at best, by a factor of 1.66. Like Example-1, parallel execution does not reduce the execution-time consi- derably because the Master's overhead demands much time. But if it did not calculate weight coefficients, i.e., the number of samples is known a priori; then the Master's overhead would lessen for a larger R. When determining the application memory cost, Cai’ all sub- . routine calls (add/subtract and multiply/divide) require four words of program storage and each loop demands an additional four words of program storage. As with execution-time, the program.flowcharts (Fig. 5.26 through 5.29) assist in estimating the application memory cost. Simi- larly, the contributions of each flowchart step add together to yield the final result. Table 5.16 outlines the steps and their contribution for the series flowchart; these terms combine as follows: Cai ' W6[44 + p(N'+ 8)]; i f 4. (5.17) But for the muP CDA (i = 4) the N slaves create additional program and data storage requirements. When added to the Master's terms (see Table 5.17) they sum to give the muP application memory 95 Table 5.16 Estimates of P1, D (i = 4) for Example-3. i Step Program Data 1 24 N + 8 2 12 .. 3 8 -- Totals Pi I 44 D1 I N + 8 Table 5.17 Estimates of P D for Example-3. 4’ 4 Step Program Data 1 24 8 2 4 -- 3 4 -- 4 -- -- Slaves (N) 8N N Totals P1 I 8(N + 4) D1 I N + 8 96 c wo[(32 + 8N) + p(N + 8)]. (5.18) a4 Lastly, the-results of Chapter III and Equations 5.14 through 5.18 combine to produce the set of attribute values for this example (see Fig. 5.30 through 5.37). Using the techniques Of Chapter IV, these values are converted to marginal utilities and, moreover, to utilities as shown in Tables 5.18 and 5.19. As with the other examples, these figures and tables illustrate several points. Again, the AU CDA displays the fastest speed and the muP always exhibits quicker execution-time than the DIRECT CDA. And for N I 27, R I 1.6 which agrees well with the prediction. Similarly, the DIRECT/muP CDA's lag behind the CALC CDA in speed for p I l, 2; still it surpasses them for p I 3, 4. With respect to costs, the muP CDA begins high and for modest N ends up extremely high due to all the additional Slaves (term Ce4)' Costs for the DIRECT CDA climb faster than the AU and CALC CDA's because the application memory term (cal) begins to dominate the elemental cost term (Cel)° In Table 5.18 for weight vector EA, the DIRECT and CALC CDA's correspond to "break-even" utilities; i.e., 55 and 56, respectively. Also, for the typical conse- quence set in Table 5.19 the CALC CDA depicts the overall, "best" CDA (E'I ﬁg) by quite a large score (by 20 utility values). This results from the large expense of the muP CDA and large problem dimensions (27 samples). Finally, Table 5.18 illustrates the "sensitivity" of the DIRECT CDA to the cost attribute; as k changes from E; to R3, the DIRECT CDA's utilities vary from 55 to 87, or by 32. 97 1O4 103 A A A 102 . 553 A E E 101 I z . A o ) g 0 i c: 10 I! I ., 0 -1 10 0 10"2 (I 23 24 25 26 SAMPLE SIZE X - DIRECT A - CALC c)-.AU EJ- muP Figure 5.30 Single-precision execution-time versus sample size for Example-3. 98 104 103 45 AS A A A §§ 1O2 ' 55 5’3 E I a. ti? 2 O H §§ 101 IE til :3 “P 100 G) 0 10"1 23 24 25 26 SAMPLE SIZE )( - DIRECT 13" CALC (3 - AU [3 - muP Figure 5.31 Double-precision execution-time versus sample-size for Example-3. 99 105 104 X C] I f‘ 103 ’ II 15 5‘3 x A 3 UP [5 E3 102 III [I' E E: 8 5. a: 101 G? (P C) 100 0 10"1 23 24 25 26 2 SAMPLE SIZE )( - DIRECT .2: -CALC () - AU [1 - muP Figure 5.32 Triple-precision execution-time versus sample size for Example-3. 100 106 105 J EE’ )1 x n u I a; U E A S 3 :2 10 45 0 E3 kl :3 102 C) 100 GP * 6 7 23 24 2S 2 2 SAMPLE SIZE XL' DIRECT [8"CALC 0 - AU 0- muP Figure 5.33 Quadruple-precision execution-time versus sample size for Example-3. COST ($'s) 101 10“ EB 103 C! A A A A A c1 5 (b 0 O O < 102 L X X x 101 1 _ 23 24 25 26 27 SAMPLE SIZE x— DIRECT A- CALC 0- AU EI- muP Figure 5.34 Single-precision costs versus sample size for Example-3. COST ($'s) 102 104 [I 103 II A I d) (D C) E! 102 X X 101 _ _ 23 24 25 26 SAMPLE SIZE )(-'DIRECT A: - CALC C>- AU :1 - muP Figure 5.35 Double-precision costs versus sample size for Example-3. 103 104 U] 103 9! 33 E! E n ‘3 45 45 AS A: C) C) C) O 102 rm X ’ it )i 101 I: , 23 24 25 26 27 SAMPLE'SIZE )(- DIRECT [5- CALC ()-.AU EJ- muP Figure 5.36 Triple-precision costs versus sample size for Example-3. COST ($'s) 104 104 I] 103 0! AA AA 45 C) c) Q) I 102 r X X )l 101 23 24 25 27 SAMPLE SIZE )(- DIRECT £5 - CALC o - AU [3 - muP Figure 5.37 Quadruple-precision costs versus sample size for Example-3. 105 w Am.o .H.o .H.ov u ms .AH.o .m.o .H.ov n «s .AH.o .H.o .m.cv u as .Am.c .n.o .m.ov u s "Suez mN NMNN.H mm mm mm Nm om HMNH.N e mm N 00 waN.N NH «0 mm on we NmHm.H m OOH e on thN.H on mm mH on OOH NImNm.m N co H OOH Hu¢m.N hm NH no mm 00 Nmom.n H co m 3 ES: mm Em: me. .zm: AM .26.. m5... 2m z .AMOHaamm MNvaOHnadKo new mmHuHHHus was uom mucosuomaou HmuHmaa wH.n OHnma 106 m 85 £5 #55 a mm. .35 55 .H5V n mm. .25 .55 5.8 u M .25 .55 5.8 u x "muoz oo mmm¢.N MH Hm on me ha NmNN.m q mm N Hm Nmoo.N ca Nu co cm 00 mmoN.¢ m 00H q mm Nmom.H Na om ON we OOH HIMH¢.m N 00 H ooH HmN¢.m mm NH no mm oo «mmN.H H co n 3 ES: mm Em: mm Em: AM .26.. max. 2.. z .ACOHaamm NNvaOHaaoxo now MOHuHHHu: can you mucosaomnoo HouHahH mH.n OHnma 107 5.5 Additional Design Considerations For the previous three examples, the overall results depend on both the specific software algorithms used and the hardware devices and architectures selected. Because the research project reported here con- siders only nominal values for the above, the overall results of these examples may shift dramatically with alternate approaches. But design engineers need only combine their specific approaches with the methodology to discover the changes. In this section, an alternative software algorithm illustrates this procedure. Often, several algorithms may implement a specific task and each displays widely differing characteristics; e.g., execution-time, memory (60) Consider the effect of an improved demand, hardware overhead, etc. multiply algorithm.to reduce execution-time for the DIRECT and muP CDA's (terms m1 and mh). Only a reduction in add times or the number of adds will decrease multiply times, and since a reduced add time only requires, a faster uP, the standard "shift-and-add" algorithm improves the number of adds.(6l) For a p-precision number this requires approximately 8p shifts and an average of 4p, 2p—precision adds. Using Equation 3.2 as the 2p-precision add time and assuming that the p-precision shifts require p cycle-times, mi . 41>(209to) + 8P(Pto) I 88p2to, and (5.19) m = 88p2to. . (5.20) \ These equations reveal a dramatic improvement in the multiple-precision 108 multiply times for the DIRECT and muP CDA's as shown in Figure 5.38. Here, the decrease varies with precision; i.e., the greater the precision, the greater the improvement. And by p I 5 the DIRECT CDA actually executes faster than the AU CDA. . To illustrate the overall improvement in execution-time with an application, the above equations used with Equations 5.5 and 5.6 form the series and parallel execution-times for matrix inversion (see Fig. 5.39). This figure demonstrates a decrease in execution-time by roughly three orders of magnitude for triple-precision. Finally, the techniques of Chapter IV convert these attribute values to marginal utilities and utilities as depicted in Table 5.20. Comparison of this with Table 5.11 (the original table) shows great increase in the DIRECT and mm? CDA's marginal utilities due to improved execution-time. Conse- quently, the marginal utilities of the CALC CDA drop because it now possesses the slowest execution-time. As before, the Ubest" CDA varies as the weight vector E changes. 5.6 Summary From careful analysis of the attribute values for the preceding three examples, several trends and conclusions can be deduced. With respect to execution-time, the AU CDA always performs the task the fastest because it contains a hardware multiply and divide. But the results of Chapter III predicted this would occur. Next, the muP CDA invariably surpasses the DIRECT CDA in execu— tion-time yet they both accomplished adds and multiplies in software. Although this decrease varies significantly from example to example, it primarily depends on the degree of "parallelism" indigenous to the EXECUTION-TIME (SECS . ) 109 10" 103 102 j 101 100 .15 i5 10‘1 I 10 2 I . (1* 10'3 10" I 1 2 3 a PRECISION X - DIRECT, muP A- CALC o - AU *- DIRECT, muP (improved) Figure 5.38 Multiple-precision multiply time for various CDA's. EXECUTION-TIME (SECS.) 110 103 102 x I c: 9’ ca 101 0 10 ¢* 96 -1 d) {3- «B- 10 , 96’ -EE- (D {3- 10"2 I «e 10’3 l 2 3 4 S MATRIX ORDER X“ DIRECT A- CALC .9..- muP (improved) 0- AU 0- muP *- DIRECT (improved) Figure 5.39 Triple—precision execution—time versus matrix order for Example-2 . 111 w 55.5 .5.5 .5.55 n 5m .55.5 .5.5 .5.55 n 5x .55.5 .5.5 .5.55 a 5m .A5.5 .5.5 .5.55 n m "muoz 55 5555.5 55 55 55 55 55 5-555.5 5 55 5 55 5555.5 55 55 55 55 55 5555.5 5 . 555 5 55 5555.5 55 55 55 55 555 5-555.5 5 55 5 555 5555.5 55 55 55 55 55 51555.5 5 55 5 m 555 .25: 555 .55.: 555.. .55.? 5 ... .55: 55.55.. :5 5 .Assuauome hHmHuHsE vo>oquH .uovuo manNIOHeamxm you mOHuHHHus new mom Deconvomsoo HmOHeza oN.m CHoma 112 specific computation and the severity of the Master's overhead. By extend- ing the idea of parallel (or simultaneous) operation, even the AU CDA can realize a decrease in execution-time. For single-precision problems the CALC CDA computes its tasks the slowest. Now, as precision increases the CALC CDA execution-times fail to grow as fast as the other CDA's (specifically, the CDA's with software multiplies); thus, it becomesfaster than the DIRECT/muP CDA's. Since the CALC CDA always finds the answer to a fixed number of digits, greater precision only requires additional digit entries, and this con- tributes little to the execution-time. As the problem dimensions increase the total costs grow for all CDA's due to the extra data memory (not program memory) demands. And for the muP CDA costs enlarge dramatically faster than the other CDA's because of repeated hardware and software. Thus, parallel execution can potentially create enormous costs. ‘ This chapter provides typical consequence sets, marginal utili- ties, and utilities fOr various weight vector E; i.e., Ea I (0.3, 0.3, 0.3), E1 =- (0.8, 0.1, 0.1), 122 . (0.1, 0.8, 0.1), and 123 = (0.1, 0.1, 0.8). The first does not emphasize any attribute and, thus, corresponds to the average, while the other three emphasize one attribute at a time. For all examples, the typical consequence sets contain alter- natives with equal utilities (for some R), or "break-even" CDA's. These CDA's represent equivalent alternatives with respect to the others in the consequence set. When the weight vector R varies, the utility of CDA's with strong marginal utilities that correspond to emphasized k dimensions increase, while CDA's with weak marginal utilities decrease. This fol- lows mathematically since the "sensitivity" of the additive utility func- tion to a change in k yields the marginal utility; i.e., i 113 a — a N Wu“) 3 -3—R— .2 k1‘11“? ' 010:1). i 1 i=1 Finally, this chapter explores the overall effect of an alterna- tive multiplication algorithm for the DIRECT and muP CDA‘s. As precision increases, the degree of improvement increases; e.g., for p I l the instruction execution-time magnitudes differ by roughly two orders of magnitude, but for p I 5 it increases to five orders of magnitude. When applied to the matrix inversion example, this algorithm reduces the computation execution-time by three orders of magnitude for triple- precision. With respect to utilities, the improved multiplication instruction increases the utilities of the DIRECT and muP CDA‘s, while it decreases the CALC CDA's utility because of slow speed. So the overall effect involves shifting the DIRECT and muP CDA‘s utilities higher. CHAPTER VI CONCLUSIONS Continued advances in semiconductor device processing technol- ogy led to LSI devices with enhanced performance at reduced cost. Con- sequently, microprOcessor-based (uP-based) systems now function in a myriad of application areas with widely differing computational complex- .ities. Since numerous LSI devices may accomplish these advanced calcula- tions, many designers follow ad hoc approaches to synthesizing computa- tional sections. So the purpose of the research reported here was to in- vestigate and characterize these computational design alternatives (CDA's) and to develop a more rigorous and structured design approach which in- corporates a firm, theoretic foundation. Using the results of this study, designers of uP-based systems will be able to create advanced systems with increased performance. For example, careful use of these CDA characteristics may lead to in- creased executionItime at lower costs. Second, these results will benefit LSI chip designers by influencing what properties the next generation of devices will possess. Because some properties aid computations and others hinder, the new devices will contain the desirable traits and min- imize the action of undesirable ones. Third, the methodology developed in this study assists developers of uP-based systems by providing them the rig- orous theoretic foundation to investigate and characterize future systems as technology advances. Since this development begins at the axiom level, they can easily modify the methodology to suit their specific orientation. 114 115 Within the research reported here, the investigation attained several objectives. First, it surveyed the present semiconductor market for LSI devices which facilitate computations and it generalized their key properties. With these devices, it defined a basic set of elemental CDA's which may be combined to realize the computation section of a gen- eralized uP-based system. To provide for comparison of these members, it determined common attributes and techniques to evaluate them. Next, the investigation developed a decision mechanism for selecting the "best" CDA using these attributes. Finally, through three representative and contemporary engineering examples, it illustrated the use of this deci- sion mechanism; in the process, it characterized the properties of each CDA. By satisfying these objectives, the investigation achieved the overall purpose of the research project. During this investigation the research project reached several principal results. First, it defined a basic set of four elemental CDA's: DIRECT, AU, CALC, and muP. The DIRECT CDA consisted of a uP and memory connected via the system bus. This CDA described the simplest technique to execute a calculation because the memory contained all arithmetic sub- routines. Next, the second CDA employed a DP, memory, and an arithmetic unit (AU) all joined by the system bus. It differed from the DIRECT CDA in that it performed multiplies and divides with the AU. The third CDA incorporated a DP, memory, and calculator chip to accomplish arithmetic operations. With this memory, the program.must simulate depressing the keys as with a hand held calculator, and it must read back the answers and decode them when the requested function finishes. Finally, the muP CDA applied the concept of simultaneous, or parallel, execution: here, a Master uP directs several Slave uP's to execute sections of the 116 problem together, then it adds the partial results for the completed an- swer. Thus, each Slave memory contained a replica of the DIRECT CDA memory, yet the Master's memory held an "overhead" program. To facilitate comparison of these CDA's, the investigation identified and used three common attributes: precision, execution-time, and cost. Precision involves the quantity of 8-bit words used to repre- sent a number; hence, all numbers corresponded to a multiple-precision, fixed-point, two's complement format. For execution-time, this value gave the number of seconds needed to complete the requested computation. If the number of add and multiply instructions indicates the execution- time, then these values, multiplied by the time to perform each instruc- tion, add to give the execution-time. With costs, this quantity repre- sented the dollar value of LSI devices used in a CDA. Two terms combined to produce the total cost; one resulted from the basic expense of pro- curing a CDA, and the other pertained to the additional expense of the application memory. For both execution-time and cost, these values oc- curred from terms indigenous to each CDA and from application derived terms. Multiple-precision flowcharts for arithmetic operations assist— ed in estimating these indigenous terms. The CALC CDA add time exceed- ed the other CDA's (all the same) by three orders of magnitude; i.e., msecs. compared to usecs. For multiplies the CALC CDA exhibited roughly 1 sec. operation, while the AU CDA functioned in about 1 msec. But the DIRECT and muP CDA's varied considerably with precision since repetitive adds implemented the multiply; i.e., they ranged from msecs. to 3.5 mins. With respect to indigenous costs, the DIRECT CDA differed from the AU and CALC CDA's by roughly one order of magnitude less ($100's com- pared to $10's), and the muP CDA varied dramatically with the number of 117 Slaves. For more than 12 Slaves, the muP CDA represented the largest indigenous cost. Next, the investigation considered a viable decision mechanism to select the "best" CDA which used the above attributes: Multiattri- bute Utility Theory (MUT). Here, MUT assigned a numeric quantity, the utility, to each CDA which indicated its usefulness with respect to the other alternatives. Because the attributes satisfied eight axioms, and utility and preferential independence definitions, the decision mechanism reduced to an additive utility function; i.e., 3 u(x) I Z kiui(x1) I utility (6.1) iIl 3? I (x1, x2, x3) I consequence x1 I precision, x2 I execution-time, x3 I cost u1(x1) I marginal utility function 3 k I scaling constant, where Z k I l. i 1.1 1 Here, marginal utility functions converted specific attributes of a dime ension to a numeric quantity which represented its usefulness. So, the utility of a CDA involved a weighted sum of their marginal utilities, and the "best" CDA consisted of the alternative with the greatest numeric utility. By analyzing three examples, the investigation clarified the tech- niques used to evaluate attributes, and it elucidated use of the additive utility function. These applications--1inear regression, matrix inver- sion, and fast Fourier transform computations--exemplified the overall procedure of this research project and, moreover, led to conclusions that characterized each CDA. 118 For the linear regression example, a typical consequence set in- cluded the following marginal utilities: DIRECT I (60, 00, 100), AU I (00, 100, 53), CALC I (100, 19, 00), and muP I (33, 97, 62). With the weight vector E'I (0.3, 0.3, 0.3), Equation 6.1 formed the utility of each CDA as the average of its marginal utilities; i.e., DIRECT I 55, AU I 51, CALC I 40, and muP I 64. Since the my? CDA possessed the larg- est utility, it represented the "best" CDA in the above consequence set. As the values in the weight vector changed, the utility of CDA's with strong marginal utilities that correspond to emphasized E dimensions in— creased, while the others decreased, and the "best" CDA varied accord- ingly. So determining the weight vector E involves important evaluation. Similarly, the other examples strengthened these results. In characterizing the CDA's, the three examples revealed that the AU CDA always exhibited the fastest execution-time because of its hardware multiply and divide instructions. For the DIRECT and muP CDA's, which both perform adds and multiplies in software, the muP's speed in- variably surpassed the DIRECT CDA due to its simultaneous, or parallel, execution. But this increase varied significantly from example to exame ple, and it primarily depended on the degree of "parallelism" found in the specific example and the severity of the Master's overhead. With respect to precision, the CALC CDA represented the slowest CDA for sin- gle-precision, yet as precision increased it becomes faster than the DIRECT and mu? CDA's and still slower than the AU CDA. Since the CALC CDA always finds the answer to a fixed number of digits, greater preci- sion only requires extra digit entries and this contributes little to the execution-time. As the problem dimensions grew, so did the total cost of all CDA's because of additional data (not program) memory demands. 119 For the muP-CDA, costs enlarged dramatically because of its repeated hard- ware and software. Finally, precision changes did not alter the costs significantly since the memory requirements of the problem.dimensions dominate those of added precision. AS‘With any research project, the investigation reported here points toward several areas of additional study. For example, rather than characterize CDA's through three specific applications, perhaps a generalized algorithm would induce broader conclusions. If integer N measures the size, or dimension, of a generalized algorithm, then the time complexity, T(N), expresses the number of seconds required to come plete the task. Similarly, the space complexity of the algorithm, S(N), denotes the number of memory words needed to execute the task (this quantity relates closely to the attribute cost). 80 for fixed execution- interval and memory size, such expressions could give limits to the pro- blem dimensions for an entire class of problems. Besides extending this research project to match technological growth, the fundamental concepts of MUT could be used to investigate and characterize various muP systems in detail. With enhanced throughput, improved reliability, increased system response, and modular expansion capabilities, the muP arrangement deserves much attention. Still a flock of problems exist which need careful research before implementa- tion progress can occur. But MUT paves the path since it can facilitate analysis of both hardware and software. Finally, since the methodology developed in this research pro- ject does not assume any specific hardware devices or software algorithms, it could aid a parametric study of either these two tOpics. Such. 120 investigations may lead to optimal design strategies for uP-based systems which consider both hardware and software tradeoffs. REFERENCES REFERENCES l) R. K. Jurgen, "Electronics in medicine," IEEE Spectrum, vol. 15, pp. 68-72, Jan. 1978. 2) H. L. Van Trees, E. V. Hoversten, "Communications satellites: Looking to the 19803," IEEE Spectrum, vol. 14, pp. 42-51, Dec. 1977. 3) S. L. Lillevik, P. D. Fisher, and A. L. Jones, "A predictive CMOS- based instrument for agriculture," Microcomputer Design and'AppZi- cations (ed. S.C. Lee). New York: Academic Press, 1977, pp. 275- 286. 4) P. D. Fisher and S. L. Lillevik, "Mbnitoring system optimizes apple-tree spray cycle," Electronics, vol. 50, pp. 125-127, Nov. 24, 1977. 5) S. L. Lillevik, P. D. Fisher, and A. L. Jones, "A predictive field instrument for agricultural production," Proc. IEEE Microcomputer '77 CbnfL, pp. 137-142, Apr. 1977. 6) R. K. Jurgen, "Electronic funds transfer: Too much, too soon?," IEEE Spectrum, vol. 14, pp. 51-54, May 1977. 7) C. C. Foster, "Something new--The Intel MOS-4 microcomputer set," comp. Arch. News, vol. 1, pp. 16-17, Apr. 1972. 8) J. L. Ogdin, "Survey of 8-bit microprocessors reveals wide choice for users," EDN, vol. 19, pp. 44-50, June 20, 1974. 9) R. L. Petritz, "The pervasive microprocessor: Trends and pro- spects," IEEE Spectrum, vol. 14, pp. 18-24, July 1977. 10) P.W.J. Verhoftstad, "Technology for microprocessor hardware," .Proc. IEEE Spring COMPCON, pp. 277-282, Apr. 1976. 11) S. Y. Lau, "Design high-performance processors," Electronic Design, vol. 25, pp. 86-95, Mar. 29, 1977. 12) S. Y. Lau, "Bit-slice micrOprogramming saves software compatibil- ity," EDN, vol. 23, pp. 42-46, Mar. 5, 1978. 13) J. Nemec, G. Sim, and B. Willis, "A primer on bit-slice processor," Electronic Design, vol. 25, pp. 52-60, Feb. 1977. 14) G. F. Muething Jr., "Designing the maximum performance into bit- slice processors," Eﬂectronics Design, vol. 25, pp. 52-60, Feb. 1977. 121 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 122 G. C. Feth, "Memories: Smaller, faster, and cheaper," IEEE Spectrum, vol. 13, pp. 36-43, June 1976. N. Lindgren, "Semiconductors face the '805," IEEE Spectrum, vol. 14, pp. 42-48, Oct. 1977. T. S. Bush, "Local memory for a high-speed digital test system," IEEE Trans. Instr. ananeas., vol. IM-26, pp. 217-220, Sept. 1977. P. Franson, "Special report: Semiconductor memories," EDN, vol. 22, pp. 46-58, June 20, 1977. E. A. Torrero, "Bubbles rise from the lab," IEEE’Spectrum, vol. 13, pp. 28-31, Sept. 1976. P. Franson, "IC's and semiconductors--suppliers rush to perfect new processes," EDN, vol. 22, pp. 88-95, Dec. 15, 1977. R. Greene, G. Perlegos, P. J. Salsbury, and W. L. Mbrgan, "The big- gest erasable PROM yet puts 16,384 bits on a chip," Electronics, vol. 50, pp. 108-111, Mar. 3, 1977. R. Proebsting, "Dynamic MOS RAM's: An economic solution for many system designs," EDN, vol. 22, pp. 61-66, June 20, 1977. E. A. Torrero, "Solid-state devices," IEEE Spectrum, vol. 15, pp. 36-40, Jan. 1978. E. R. Hnatek, "Current semiconductor memories," Cbmputer Design, vol. 17, pp. 115-126, Apr. 1978. E. A. Torrero, "The multifacets of 12L," IEEE Spectrum, vol. 14, pp. 28-36, June 1977. L. G. Gardner, "A survey of some recent contributions to computer arithmetic," IEEE Trans. Cbmp., vol. 25, pp. 1277-1282, Dec. 1976. S. Sanyal, "An algorithm.for nonrestoring division," Computer Design, vol. 16, pp. 124-127, May 1977. S. waser and A. Peterson, "Real-time processing gain ground with fast digital multiplier," Electronics, vol. 50, pp. 93-99, Sept. 29, 1977. "Multiplier integrates accumulator, provides l75-nsec multiply- accumulate," EDN, vol. 22, p. 92, Aug. 5, 1977. "24-bit single-chip multiplier eases floating-point computations," EDN, vol. 23, p. 156, June 5, 1978. "Am9511: Arithmetic processing unit," prepared by Advanced Micro Devices, Inc. (Sunnyvale, Calif.), 1977. 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 123 "Number-processing controller cuts cost of software development," computer Design, vol. 16, pp. 128-129, Aug. 1977. P. D. Fisher and S. M. Welch, "Wedding calculators and instruments," Electronics, vol. 47, pp. 100-109, May 16, 1974. W. W. MOyer, "Interfacing calculator chips as microcomputer pre- processors," computer Design, vol. 17, pp. 187-191, May 1978. T. P. Hughs, D. H. Sawin III, and D. R. Hadden Jr., "LSI Software," Proc. IEEE Microcomputer '77 Conﬁ, pp. 46-53, Apr. 1977. P. Franson, "Advent of standard chip software sure to ease system implementation," EDN, vol. 22, pp. 21-22, Oct. 5, 1977. J. L. Baer, "Multiprocessor systems," IEEE Trans. Camputers, vol. 25, pp. 1271-1277, Dec. 1976. A. J. Weissburger, "Analysis of multiple-microprocessor system architectures," Computer Design, vol. 16, pp. 151-163, June 1977. J. W. Bowra and H. C. Torng, "The modeling and design of multiple function-unit processors," IEEE Trans. camputers, vol. 25, pp. 210- 221, Mar. 1976. P. M. Russo, "An interface for multi-microcomputer systems," Proc. IEEE.EaZl COMPCON, pp. 277-282, Oct. 1976. W. C. Giauque and T. C. Peebles, "Application of multidimensional utility theory in determining optimal test-treatment strategies for streptococcal sore throat and rheumatic fever," Qpns. Res., vol. 24, pp. 933-950, Sept. 1976. J. Von Neumann and 0. Mbrgenstern, Theory of'Games and Economic Behavior. Princeton, N.J.: J. Wiley and Sons, 1967, pp. 15-31. R. L. Keeney, "Quasi-separable utility functions," Naval Res. Log. Quart., vol. 15, pp. 367-390, 1968. R. L. Keeney, "Multiplicative utility functions," Opns. 323., vol. 22, pp. 22-34, Jan. 1974. ibid., p. 24. K. R. MacCrimmon and M. Toda, "The experimental determination of indifference curves," The Review of'Economic Studies, vol. 36, pp. 433-451, 1969. R. M. Dawes and B. Corrigan, "Linear models in decision making," Psychological Bull., vol. 81, pp. 97-106, Feb. 1974. H. J. Einhorn and R. H. Hogarth, "Unit weighting schemes for deci- sion making," Organizational Behavior and Human Perfbrmance, vol. 13, pp. 171—192, Apr. 1975. 49) 50) 51) 52) 53) 54) 55) 56) 57) 58) 59) 60) 61) 124 W. Edwards, "How to use multiattribute utility measurement for soc- ial decision making," IEEE Trans. sys. Man and Cybern., vol. SMC-7, pp. 326-340, May 1977. . H. Raiffa, Decision Analysis: Intro. Lectures on Choices Under Uncertainty. Reading, Mass.: Addison-Wesley, 1968. W. Edwards, "Social utilities," The Engineering Ebonomist, sum. sym. Ser. , vol. 6, pp. 119-129, 1971. . J. M. H. Olmsted, Advanced Calculus. New York: Appleton-Century- Crofts, 1961, pp. 524-525. R. Grossman and J. Conway, "Minicomputer Systems Directory," EDN, vol. 21, wall chart, June 5, 1976. C. G. Cullen, Matrices and Linear Transformations. Reading, Mass.: Addison-Wesley, 1967, pp. 188-192. '1'. M. Walker, An Introduction to Computer Science and Algorithmic Processes. Boston: Allyn and Bacon, 1971, pp. 324-325. D. G. Moursund and C. S. Duris,.Elementary Theory and’Application of’Numerical Analysis. New York: McGraw-Hill, 1967, p. 64. J. W. Cooley and J. W. Tukey, "An algorithm for the machine compu- tation of complex Fourier series," Math. Computation, vol. 19, pp. 297-301, Apr. 1965. W. D. Stanley, Digital Signal Processing. Reston, Virginia: Pren- ibid., p. 259. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and.Analysis of'COmputer Algorithms. Reading, Mass.: Addison-Wesley, 1974, pp. 2-33 0 F. J. Hill and G. R. Peterson, Digital systems: Hardware Organ- ization and Design. New York: J. Wiley and Sons, 1973, p. 159