.1 5.... a r :1! .9537. ~ ‘- ‘Kl .. . 4..Lt5.1n.¢:i~‘ fittinix .3 .. {121. a «K...- .. :19.‘ . .. :3. -u. v. . s. t... l 15.73 \I . ; .I. .29 (.3 ...u..........l.vnji (Elixl. )2 .>v.’f‘ .53. . \. :1; I h-L. .I. . .. 1!... .4. 5.! zwwmmmflalwm . . 5...?)‘5 ELLE-E. . 3!? r 2 Lu . . 3.. . .:: . .: . . . .Jz: v ‘ T... , mttxr. P? .... . V... . . , . . ., . , Rubnfinfi. i UNIVERSITY LIBRARIES lllllllllllll'\lllllll | Hill” | l 3 1293 01026 LIBRARY Michigan State Unlverslty This is to certify that the thesis entitled PERFORMANCE DATA MODELING, TRANSFORMATIONS, AND MULTIPLE-DOMAIN ANALYSIS METHODS presented by Abdul Waheed has been accepted towards fulfillment of the requirements for M.S. degreein ElECt. Emir. fimmm Major professor Date 11/11/93 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution PERFORMANCE DATA MODELING, TRANSFORMATIONS, AND MULTIPLE-DOMAIN ANALYSIS METHODS BY Abdul Waheed A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1993 ABSTRACT PERFORMANCE DATA MODELING, TRANSFORMATIONS, AND MULTIPLE-DOMAIN ANALYSIS METHODS BY Abdul Waheed Observed data is regarded as a useful source of analysis for systems involving a great deal of complexity in terms of their physical structure and functioning. A parallel computer system is also a system of this type with a parallel program having the processors work concurrently and interact with one another to solve a problem. Therefore, parallel program monitoring and visualization based on the data observed and collected during the execution of a program are considered usable and effective methods to analyze the performance of the system and program. Conventional performance visualization and analysis methods and tools can provide information about the higher-level program abstractions to the programmer, but are often less effective when the performance data volume and complexity increase due to an increase of problem and/or system size. We have developed a matrix based model to represent, process and analyze the performance data to more rigorously navigate through this complexity. Matrix transformations are applied to a performance data matrix to transform the data to spatial, temporal, event, and frequency domains, for the purpose of better visualization and analysis focused on specific aspects of the program. Specific aspects of program behavior are analyzed using appropriate analysis techniques that are prevalent in statistical signal processing and digital image processing, and this in turn, contributes to the understanding of the macrosc0pic behavior of the whole program. This approach has been prototyped by using commercially available data analysis tools and conventional performance visualization tools, and it meets the objectives of scalability, extensibility and efficacy for a variety of program performance analysis applications. To the Creativity iii ACKNOWLEDGEMENTS The author acknowledges the invaluable assistance and guidance of everyone who has been involved with this work. The accomplishment of this work would not have been possible without Dr. Diane Rover’s guidance of the research work, patience to listen to the author’s ideas and encouragement for applying them to an entirely new area. Moreover, this thesis could not have such an excellent flow without her meticulous editing. The author is grateful to Dr. Hassan Khalil and Dr. Majid Nayeri for their contributions to author’s vision and understanding of the related areas of this multi-disciplinary research work. Dr. Richard Enbody of the Department of Computer Science deserves special thanks for providing general evaluation of this work, despite his very busy schedule. Some of the nCUBE-2 programs used for the case studies presented in this thesis were originally written by Edgar Kalns of the Department of Computer Science, who deserves our appreciation for allowing us to work with them. An intemet survey was conducted to assess the use of matrices in related areas of computer science. The author acknowledges the contributions from all the respondents. Relevant publications of the respondents have been appropriately referenced. iv Table of Contents List of Tables viii List of Figures ix Chapter 1 Introduction 1 1.1 Motivation ................................................................................................................ 2 1.2 A Model for Performance Trace Data ...................................................................... 6 1.2.1 Matrix Representation .......................................................................................................... 8 1.2.2 Examples from Other Areas -- . ................................................. 8 1 .2 2. 1 State-Space Analysis of Linear Systems ............................................................... 9 l ..2 2. 2 Application to Digital Filter Design ............................................ 11 1.2.2.3 Simulation ........................................................................................................... 12 1.2.2.4 Representation of Visualization Data .................................................................. 12 1.3 Matrix Transformations ......................................................................................... 14 1.4 Multiple—Domain Analysis Approach .................................................................... 15 1.5 Scope of this Work ................................................................................................. 19 Chapter 2 Related Work 22 2.1 Analytical Methods ................................................................................................ 24 2.1.1 Markov Models .................................................................................................................. 25 2.1.2 Queuing Models ................................................................................................................. 27 2.1.3 Petri Nets ............................................................................................................................ 30 2.1.4 Simulation Modeling ......................................................................................................... 32 2.1.5 Analytical Modeling Tools ................................................................... 32 2.2 Monitoring-Based Methods ................................................................................... 34 2.2.1 Conventional Methods.= - ............................................. -- =.37 2.2.2 Advanced Methods - - ..... . - - ....................................... 41 2.3 Application-Specific Methods ............................................................................... 43 2.3.1 Animation - ........ ............................................................................. 45 2.3.2 Application-Domain Visualizations ................................................................................... 45 Chapter 3 Performance Metrics and Trace Data Modeling 47 3.1 Performance Metrics and Preliminary Definitions ................................................ 47 3.2 Execution Trace Data and Matrix Representation ................................................. 51 3.3 Metric-Based Representation ................................................................................. 52 Chapter 4 Transformations to Analysis Domains 56 4.1 Performance Data Domain ..................................................................................... 57 4.2 Spatial Analysis Domain ....................................................................................... 58 4.3 Temporal Analysis Domain ................................................................................... 62 4.4 Event Analysis Domain ......................................................................................... 66 4.5 Frequency Analysis Domain .................................................................................. 69 4.6 Visualization in the Application Domain ............................................................... 71 Chapter 5 Multiple-Domain Analysis Methods 74 5.1 Analysis in the Spatial Domain ............................................................................. 75 5.1.1 Representation of Macroscopic States _ - - - ............................. 75 5.1.2 Change Detection via Image Subtraction .......................................................................... 77 5.1.2.1 Change Detection ................................................................................................ 78 5.1..2 2 Spatial Pattern Matching ..................................................................................... 78 5.1.3 Thresholding - ..................................................... 79 5.1. 3.1 Identification of Hot-Spots of Activity ................................................................ 80 5.1.3.2 Identification of Types of Change ....................................................................... 80 5.1.4 Image Enhancement - .......... ............................................................... 81 5.2 Analysis in the Temporal Domain ......................................................................... 82 5.2.1 Analysis of Trends and Phases .......................................................................................... 83 5.2.2 Analysis of Repetitive Patterns .......................................................................................... 86 5.2.3 Comparison of Temporal Patterns ..................................................................................... 88 5.2.4 Thresholding _ ................................................................. 89 5.3 Analysis in the Event Domain ............................................................................... 90 5.3.1 Representation of State Transitions ................................................................................... 91 5.3.2 Analysis of Trends and Patterns _ - -- ............................................. 91 5.3.3 Analysis of Event-Frequency Distribution ...... - -- - - - - - - ................................ 92 5.3.4 Markov Modeling .............................................................................................................. 94 5.4 Significance of the Frequency Domain .................................................................. 95 5.5 An Illustrative Example ......................................................................................... 97 5.5.1 Transformation to the Spatial Doma'm - - ....................................................... 103 5.5.2 Transformation to the Event Domain .............................................................................. 104 5.5.3 Transformation to the Temporal Domain ......................................................................... 105 5.5.4 Transformation to the Frequency Domain ....................................................................... 107 5.6 A Discussion on Applying Multiple-Domain Analysis ....................................... 109 Chapter 6 Case Studies 111 6.1 Context of Case Studies ....................................................................................... 111 6.1.1 The Program and System - _ ............................................ 113 6.1.2 Performance Metric and Result Summary - - .......................... 114 6.2 Representing the Machine Mapping .................................................................... 116 6.2.1 Method: Performance Images ......................................................................................... 116 6.2.2 Example: Linear Solver ................................................................................................... 117 vi 6.3 Vrewing the Macroscopic State of Systems ......................................................... 117 6.3.1 Method: Performance Images ..... - -- ............................................... 117 6.3.2 Example: SLALOM ....................................... . - - .......... - ........ .-118 6.4 Analysis by Small Multiples ................................................................................ 119 6.4.1 Method: Image Subtraction and Binary Thresholding - - - - -- - - "120 6..4 1.1 Example: SLALOM - ................................................... 120 6.4.1.2 Example: SLALOM ............................................................................................ 121 6.4.2 Method: Spatial Pattern Matching .................................................................................. 122 6.4.2.1 Example: Comparison among Data Distributions ............................................ 123 6.4.2.2 Example: Comparison among Phases ..................................... 124 6.5 Analysis of Temporal Patterns ............................................................................. 125 6.5.1 Examples of Patterns Resulting from Message-Passing - ................. 127 6.5.1.1 Broadcast ........................................................................................................... 127 6.5.1.2 Global Summation ............................................................................................. 129 6.5.1.3 Synchronization ................................................................................................. 130 6.5.2 Analysis of Macroscopic Patterns .................................................................................... 131 6.5.2.1 Method: Moving Averages ................................................................................. 132 6.5.2.2 Example: Linear Solver Program ...................................................................... 132 6.5.3 Estimation of Repetitive Patterns .................................................................................... 135 6.5.3.1 Method: Autocorrelation ................................................................................... 136 6.5.3.2 Example: Pi Program - - ................................................ 138 6.5.3.3 Example: Linear Solver Program ...................................................................... 141 6.5.4 Estimation of Template Patterns ...................................................................................... 147 6.5.4.1 Method: Cross-correlation ................................................................. 148 6.5.4.2 Example: Linear Solver - - .................................................... 148 6.5.4.3 Example. Comparison between Data Distributions .......................................... 152 6.5.5 Identification of Arbitrary Patterns - - - .......................................................... 154 6.5.5.1 Method: Thresholding -- -- - ------- - 155 6.5.5.2 Example: Synchronization Points tn GE Phase ................................................ 156 6.5.5.3 Example: Identification of Arbitrary Patterns ................................................... 156 6.6 Analysis of Patterns using Multiple-Domains and Views ................................... 157 6.6.1 Analysis of Spatial Domain Patterns ............................................................................... 158 6.6.2 Frequency Distribution Patterns ...................................................................................... 160 6.7 Enhancing Aural Analysis ................................................................................... 163 Chapter 7 Future Directions 166 7.1 Implementation as a Toolkit. ................................................................................ 166 7.2 Markov Modeling ................................................................................................ 170 7.3 Modeling of Message-Passing ............................................................................. 170 7.4 Performance Prediction ........................................................................................ 171 7.5 Dynamic Load Balancing .................................................................................... 171 7.6 Database Management Systems ........................................................................... 172 List of References 174 Color Plates 182 vii TABLE 6-1. TABLE 6-2. TABLE 6-3. TABLE 64. TABLE 6-5. List of Tables Summary of program execution times. .................................................... 115 Summary of results of estimating loop timings for the pi program. ........ 141 Summary of results from estimating loop timings of GE phase .............. 145 Summary of results from estimating loop timings for GE phase with synchronization. ....................................................................................... 147 Summary of results from estimating broadcast timings for GE phase. 152 viii Figure 1-1. Figure 1-2. Figure 1-3. Figure 1-4. Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 2-6. Figure 2-7. Figure 4—1. Figure 5-1. Figure 5-2. Figure 5-3. Figure 5-4. Figure 5-5. Figure 5-6. Figure 5-7. Figure 5-8. Figure 5-9. Figure 5-10. Figure 5-11. Figure 5-12. Figure 5-13. Figure 5-14. Figure 5-15. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 6-5. List of Figures State-space representation of a system. ....................................................... 9 Digital filter with N = 3. ............................................................................. 11 Conversion of data for different tools. ....................................................... l3 Multiple-domain analysis approach from a user’s perspective .................. 19 Classification of performance analysis methods and approaches. ............. 23 Phases of an analytical study of a parallel system. .................................... 25 Markov Chain representation of the system-program behavior ................ 27 Ingredients of a basic queuing model. ....................................................... 28 Example of a petri net. ............................................................................... 31 Monitoring-based methods for performance analysis. .............................. 36 Application-specific methods for performance analysis ............................ 44 Differences between Event and Temporal Domain Performance Signals. 68 Application of image processing to performance analysis in the spatial domain. ...................................................................................................... 82 Smoothing of the performance signals by a digital moving averages filter. ........................................................................................................... 84 Impulse response of the moving averages filter. ........................................ 85 Application of signal processing to performance analysis in the temporal domain. ...................................................................................................... 90 Instrumented code for the example ............................................................ 98 Format of a PICL tracefile and various event types for the example instrumented code. ..................................................................................... 99 Space-time dynamics of broadcast. ......................................................... 100 Trace data matrix for the example program ............................................. 101 Metric-based data matrix for the example program ................................. 102 Performance image example .................................................................... 104 Calculation of event domain performance signal for the example program. ................................................................................................... 105 Performance signal in event domain x(n). ............................................... 106 Temporal domain performance signal w(n). ............................................ 107 Frequency domain spectrum of the temporal domain performance signal. ....................................................................................................... 108 Spectrum of the event domain performance signal. ................................ 108 Data distributions (here, P = 4 and N = 8). .............................................. 114 Performance image used to represent the machine mapping ................... 118 Machine mapping for uneven distribution. .............................................. 119 Performance images of PE utilization: even and uneven distributions. ..119 Performance images of GE in SLALOM (Series of 12 Snapshots Over Time). ....................................................................................................... 121 Figure 6-6. Figure 6-7. Figure 6—8. Figure 6-9. Figure 6-10. Figure 6-11. Figure 6-12. Figure 6-13. Figure 6-14. Figure 6-15. Figure 6-16. Figure 6-17. Figure 6-18. Figure 6-19. Figure 6-20. Figure 6-21. Figure 6-22. Figure 6-23. Figure 6-24. Figure 6-25. Figure 6-26. Figure 6-27. Figure 6-28. Figure 6-29. Figure 6-30. Figure 6-31. Figure 6-32. Change detection via image subtraction. ................................................. 122 Analysis by small multiples using image subtraction and binary thresholding. ............................................................................................ 123 Comparison among GE phases of different data distributions via template matching ................................................................................................... 125 Comparison among GE and BS phases of linear solver. ......................... 126 Spatial and temporal dynamics of Broadcast among 16 processors. ....... 128 Spatial and temporal dynamics of global summation. ............................. 129 Spatial and temporal dynamics of synchronization. ................................ 131 Space-time dynamics of backsubstitution algorithm (Paragraph). .......... 133 Analysis of macroscopic patterns in BS phase signal by using a moving averages filter. .......................................................................................... 134 Instrumented code for the exchange-add part of the pi program. ............ 138 Space-time dynamics of the pi program with trace-marks showing start of the fourth iteration. .................................................................................. 139 Comparison between estimated and actual loop iteration times for pi program. ................................................................................................... 140 Space-time dynamics of GE phase of Linear Solver. .............................. 142 Estimate of autocorrelation of the whole GE performance signal. .......... 143 Estimate of autocorrelation of a window of GE performance signal. ..... 145 Estimate of autocorrelation of a window of GE performance signal with synchronization. ....................................................................................... 146 Estimation of occurrence of broadcast calls during GE via estimate of cross-correlation ....................................................................................... 149 Precise identification of broadcast from GE with “proper” alignment of time scales ................................................................................................ 151 Estimation of the possibility of broadcast in BS phase of linear solver. .153 Comparison of GE phase implemented by using two different data distributions. ............................................................................................ 154 Identification of instances of synchronization in GE phase of the linear solver. ....................................................................................................... 157 Identification of instances of minimum system utilization during GE phase. ....................................................................................................... 158 Performance images taken after equal intervals during BS phase and analysis of last image. .............................................................................. 159 Frequency density histograms of two implementations of GE phase of linear solver. ............................................................................................. 161 Frequency density histogram of two implementations of BS Phase of linear solver. ............................................................................................. 162 Performance image from GE phase with its histogram. .......................... 163 Sound signals representing an auralization in the time and frequency domains. ................................................................................................... 164 Figure 7-1. Figure 7-2. Figure 7-3. Figure 74. Performance Analysis Environment (based on Pablo). ........................... 167 The toolkit approach. ............................................................................... 168 Practical implementation of the toolkit approach. ................................... 169 Operation of a database management system. ......................................... 172 Chapter 1 Introduction The analysis of program performance on parallel computer systems is a non-trivial task due to the complexity of their architectures and the programming paradigms implemented on them. Conventional methods that have been used for the performance analysis of sequential computer systems and programs often do not contribute towards enhancing our understanding of the behavior of parallel systems and programs. Several approaches have been proposed to tackle the performance evaluation problem of a specific parallel system using a specific performance analysis method or tool. However, we are still far from a comprehensive methodology to take up the task of performance evaluation of any program executed on any parallel system. This thesis is an effort to define certain aspects of this problem by appropriately characterizing the performance of a parallel computer system, and then to use this characterization to develop an analysis approach that could help analyze certain aspects of this task in a generalized manner. Program monitoring and Performance visualization are considered useful techniques for the analysis of parallel program performance. These techniques depend on the performance data obtained by executing the instrumented parallel programs on parallel systems. However, these methods are limited by several factors including their scalability, extensibility and lack of rigor in their analysis methodologies. The objective of our approach is to develop more rigorous methods for analyzing performance data visually as well as statistically, which are scalable and extensible and which represent some physical and/or logical structure of the parallel system and program. This approach supplements and complements program performance analysis on parallel systems using visualization. In the rest of this chapter, we will discuss the motivation for this work and the analysis approach used for this purpose. Then a brief introduction of the performance analysis domains, transformations, and analysis methods will be provided before considering them in detail in subsequent chapters. 1.1 Motivation We, as engineers and scientists, have learned to adopt certain approaches to analyze various systems or phenomena of varying degrees of complexity. Characterization of complex systems and behaviors by appropriate parameters (metrics) and modeling and analyzing them by available mathematical tools is a commonly followed practice. However, characterizing the system behavior by certain metrics and building up an accurate mathematical model of the system does not guarantee a realistic as well as practicable analysis. If mathematical model is built on simplifying assumptions, it makes the solution of that model simple but we have to compromise its accuracy in practical situations. On the other hand, if we do not make simplifying assumptions and build up an accurate model, its solution for simplest of problems will be non-trivial, and therefore, the model will be impractical for more complex problems. Such models that represent the physical structure of the system components (usually in the form of differential/difference equations) have been referred to as structural models [24]. These models use a collection of state variables in vector form, x(t), to represent the state of various elements of the system. The state variables are related with the dynamics of the system model using vector differential equations of the form: 15' (t) =f(X(t).U(t).t) (1.1) where U (t) is an input vector, x(0) are initial conditions, and tis the independent variable. Such models have been solved using analytical methods such as transform techniques, analog computers for more complex systems, and digital computers to numerically solve models of large systems. Digital filters were developed to numerically implement the transfer functions that result from transform solutions to study the structural view of the system. \Vrth the deve10pment of digital computers, another view of systems emerged. The models based on this view are concerned with the operations that it performs rather than the structure of system components used to perform an operation. Such system models have been termed as operational models [24]. The system is observed only at certain discrete times (called events) and is assumed to be composed of a set of processes operating concurrently. Process outcome is considered as a sequence of system states (discrete and continuous) and events (system state transitions). Such models have been used for discrete-event simulations of the systems. We consider a parallel program executed on a parallel computer as two entities of an intricate system whose behavior and performance are to be analyzed. Such systems have been modeled by various mathematical models such as queuing models, petri nets, Markov models, etc. These models tend to represent a combination of what has been termed above as structural and operational views. However, these models are often so intricate due to the complexity of the parallel computer system dynamics, that an average user of the system can not use them for routine performance and program debugging purposes. During the last few years, the practitioners of parallel systems have been able to develop another paradigm for the analysis of the behavior of the systems (parallel computers and programs). This paradigm has been termed as system monitoring [27, 28, 50, 86, 89] which views the system in a way that is closer to the operational view. Parallel systems are instrumented using specialized hardware and/or software libraries to observe the states of the system when certain events occur due to the interactions of a program with the computer system. This observed execution trace data is visualized using state of the art computer graphics facilities that are now available with almost every desktop workstation. Various performance instrumentation and visualization tools such as ParaGraph, Pablo, Seeplex, etc. [28, 50, 86] have been developed that are considered an integral part of parallel program development environments. Although the system monitoring technique has contributed greatly to simplify the process of analyzing the behavior of parallel systems, it has a few constraints. These constraints are listed in the following: 1. The monitoring environments are usually not based on any well-known physical and/or programming paradigms that are often used for parallel systems. The analysis per- formed by visualizing the observed (instrumented) data is entirely dependent on the type of events generating this data. Although this strategy works in many simple cases, the capabilities of such analysis are limited when the computational (logical) structure of the program is complex. 2. Instrumenting a parallel system generates a very large amount of data for an average sized program which is needed due to the generic nature of displays used for the pur- pose of program performance visualization. Instrumentation as well as visualization do not consider any specific system and programming paradigm, therefore, it is usually not possible to decide what not to observe. Consequently, long-running real application programs are very hard to instrument by using this technique. 3. When observed data are graphically presented to the user as a time-series plot or in the form of other displays specifically used for performance visualization, the lack of scal- ability is perhaps the first problem that a user encounters with such tools. The presenta- tion techniques often do not adequately scale with an increasing number of processors in the system or with increasing problem size. This problem is critical for various scien- tific and engineering computations that are typically solved on large-scale parallel pro- cessors and often involve large amounts of program as well as performance data. 4. Most visualization tools are restricted to presenting the visual information to the user. This is a much more effective way to assimilate the information by a human user, com- pared to the presentation of the same information textually. However, the presentation methods used by these tools are not extensible, i.e., they do not exploit the analysis potential of any particular representation of the observed trace data. 5. The lack of extensibility often makes the efficacy of the monitoring tool questionable. Program instrumentation is a costly procedure in terms of the time it takes to monitor the execution of various instructions and to log the event-specific data, and in terms of the perturbation of system behavior as a consequence of this intrusive procedure. A user expects effective analysis by the visualization tool after going through the expen- sive procedure of executing an instrumented program. One motivation of the work presented in this thesis was to try to overcome the above limitations of the system monitoring paradigm for analyzing the behavior of parallel systems. Although the system monitoring paradigm enjoys the acceptance by a wider cross-section of the user community, this does not mean that we can ignore other system analysis approaches. Most of the limitations of existing methods and tools for parallel system monitoring are due to the lack of any rigorous mathematical model that could be used to make this paradigm work as a generic analysis tool. Therefore, we have accepted the system monitoring paradigm due to its simplicity but are not indifferent to other system analysis approaches that are used for the analysis of parallel computer systems, in particular and other systems in general. One might conclude from the above discussion that the performance analysts and practitioners have entirely different points-of-views regarding the performance analysis. While one group considers the field a science, the other considers it an art. Consider the following excerpts which all relate to the “art” or “science” of computer system performance evaluation: Formation of models of physical behavior underlies science. Yet our ability to construct behavioral models of parallel and distributed programs is very limited. [1] . the emphasis on the “art” of computer design, of system management, of programming, and so on, is antithetic to the quantitative philosophy of performance evaluation. The performance evaluation viewpoint emphasizes the scientific method, well-planned and controlled experimentation, and careful use of mathematical techniques for prediction. [34] Designing an enlightening animation is truly a psychological and perceptual challenge... At present, creating effective dynamic visualizations of computer programs is an art, not a science. [13] Contrary to popular belief, performance evaluation is an art. Like a work of art, successful evaluation cannot be produced mechanically. Every evaluation requires an intimate knowledge of the system being modeled and a careful selection of the methodology, workload, and tools. When first presented to an analyst, most performance problems are expressed as an abstract feeling, like a rough sketch, by the end user: Defining the real problem and converting it to a form in which established tools and techniques can be used and where time and other constraints can be met is a major part of the analyst’s “art.” [109] The field, in general, has historically asked, “Is it art?,” both for computer architecture [111] and for computer programming [60]. Clearly, performance evaluation based on observing a parallel system’s behavior by monitoring program execution on an architecture also faces this question. This is especially true for the presentation and analysis of performance data using visual means. Parallel system monitoring can present fundamental information about program execution. While it is increasingly being deemed a powerful tool for parallel program performance analysis and debugging, it is typically considered to be more of an art than a science. What a user sees and hears can depend on his or her perception of the graphics and sound employed in the presentation of information. Moreover, the user often is largely unassisted in the analysis of this information. The relationship between what is seen and! or heard and program behavior is based on sensory perception and knowledge of the program. While this brings invaluable insights into the exploration process, there is a need to assist the user with analysis methods that support a more rigorous approach. VVrth this, the user gains not only more insights, but also the capability to substantiate sensory perception with analytical results. The question is, what are appropriate analysis methods that bring visualization and auralization closer to being a science? We address this question in the context of program monitoring of medium- and large-scale parallel systems. The overall objective of our approach is to develop more rigorous methods for analyzing performance data visually as well as statistically. The methods should be scalable with system and problem sizes, and extensible (so as to facilitate further analysis). Moreover, the results should represent some physical and/or logical structure of the parallel system state. The parallel system performance analysis approach that is presented in this thesis is founded on the idea of considering all relevant system analysis approaches and available mathematical tools as our candidate resources that should be applied appropriately to solve the problem at hand. As engineers, this is the approach that we have been using to analyze various systems and are aware of its capabilities. We use the instrumentation techniques that are available with most of the monitoring tools, but for the purpose of analysis, we have tried to extend the existing methods using the above approach. Specifically, this approach to performance analysis has three aspects: (1) modeling of trace data as a matrix for the purpose of analysis, (2) transformations applied to this matrix, and (3) performance analysis using multiple-domains. Subsequent sections introduce the motivation of these guiding principles for the definitions and analysis presented in subsequent chapters. 1.2 A Model for Performance 'Il'ace Data Most of the limitations of existing performance monitoring methods that were mentioned in section 1 are a direct consequence of the lack of any rigorous mathematical structure associated with the execution trace data. Most of the instrumentation systems log the trace data observed during the execution of a program in the form of a trace file (binary or ASCII text) whose fields are arranged according to their own semantics. This file often consists of trace records consisting of the time-stamp of an event (where the occurrence of the event results in logging a particular event as a record in the trace file) with the event- specific information. These trace records are used to update the visualization displays to “replay” the execution of a program. A trace file does not have any well-defined mathematical structure which could be used for rigorous development of analysis methods for the data. Some would probably argue that records of a trace file use the structure of a relational database and thus relational calculus (by considering a record as a tuple). This supports the implementation of various information processing tasks and relational data analysis, but not rigorous mathematical analysis explicitly. When the system size or the problem size increases, the displays of these tools can not adequately scale to accommodate large amounts of information over longer periods of execution time. Some of the tools do use various graphics functions such as scrolling a window with a scroll-bar or decreasing the resolution of the information but such techniques are questionable in two respects: . The macroscopic context of the information is lost if we scroll a display to accommo- date a larger (or longer over time) display, as such viewing of information is typically not coupled with any type of analysis that is performed on the data. . The resolution of the time-scale of various time-series plots of performance data is reduced in order to accommodate more data in a given display. Some tools (such as ParaGraph) usually discard the events occurring during the minimum step size to accommodate more temporal information. This is an effort to preserve macroscopic perspective but this type of approximation is made by sacrificing the precision and often leads to wrong results/assumptions about the program behavior. Neither of the above techniques genuinely improves the scalability of the tool. Our experiences with the existing visualization tools and the methods that are presented in this thesis indicate that scalability can not be achieved unless the data structure used for representing performance data is mathematically rigorous. If performance data are modeled by such a mathematical structure then the definition of the representation will not have to be changed with the size of system and/or problem. As it will be discussed in subsequent chapters, various generic analysis methods can be applied successfully once the underlying data structure conforms to certain rigorous definitions. We have selected matrices to model and represent the performance data. 1.2.1 Matrix Representation By definition, a matrix is an r X c rectangular array of numbers that are subject to certain rules; where r represents the number of rows and c the number of columns in the matrix. Matrices are considered a very useful notation to represent numbers and several techniques of matrix theory are used to analyze them in various contexts and disciplines. Particularly, from the systems analysis perspective, matrices and techniques of linear algebra are used to get a “global” view of the system and to understand its behavior [99]. The use of matrices and linear algebra is not restricted to the systems sciences only. These techniques are used in almost all those disciplines that deal with the analysis of large-scale systems characterized by a number of different parameters and variables. Matrices serve a dual purpose of presenting an overall “feel” for the state of the system as well as allowing powerful algebraic and numerical methods to analyze the system according to specific objectives. A large-scale parallel computer system is not an exception. It has a lot of concurrent activity that affects its overall state which is not easy to understand without powerful analysis techniques. Based on the state information (in consequence of specific events) obtained from such a system, we will introduce a technique (in chapter 3) to represent the system states (defined from event traces) in the form of a matrix and then subject them to the applicable and relevant analysis methods to understand the behavior of the system in response to a parallel program (a sequence of inputs to the system). 1.2.2 Exampr from Other Areas First let us briefly look at the application of matrix theory to some other systems and the analysis objectives in those disciplines. A few examples are presented here to appreciate the motivation for such analysis techniques that will be applied for the performance analysis of large-scale parallel computer systems. 1.2.2.1 State-Space Analysis of Linear Systems A discrete-time, linear, time-invariant system can be represented by a set of difference equations consisting of (state) variables that represent the internal state of the system at different locations within the system. This representation became generally known as state-space representation of the system [96, 113]. This differs from the conventional representation of the system in terms of the relationship between input and output signals (known as a transfer function). The state-space representation has two main features: . It represents the relationship between the input and the output of the system, as in the case of the transfer function approach. 0 The state-space description is complete in the sense that it gives information about the internal states of the system as well. The state variable representation characterizes the relationship between the internal signals (states) as well as the external signals. Due to these two features of this representation, a much more complete picture of the dynamic behavior of the system can be visualized. For this type of representation, the whole system can be considered as a “box” with internal states represented by n state variables v,(k) at discrete time k as shown in Figure 1-1. There are m inputs to the system, 1105) p p Y1”? 12(k) ’ Internal States of the System p y 2( ) i vmmm,mM) ..... 5 we 5 1 2 " > W) Figure 1-1. State-space representation of a system. represented by x,-( k) and there are p outputs represented by yi( k ). Such a system is called an n—dimensional linear system (as there are n state variables) and can be represented by two sets of linear equations [130]. The first one represents the states of the system, and the 10 second represents the outputs of the system as a function of state variables. These are written in the matrix form due to the linear relationship between the system parameters and the state variables. For the system shown in Figure 1-1, the state-space representation is: 'vl (k+ 1)- r 0 0" 321(k)- 1 0 wk“) = V2“) +Bx(k) (1.2) ......... Vn(k+1)J f’n an-lan—2a2 “1% ”am by L .l The B matrix is n X m, and x is m X 1 vector representing inputs to the system. The output equitation is given as: 31 (kf y2(k) = Cv (k) +Dx(k) (1.3) typ (“1 where matrix C is p X n, and D is a p X m matrix. The matrix on the right-hand side in the equation (1.2) is of dimensions n X n (call it A) and it describes the dynamics of the system under consideration. This completes the generalized state-space description of the system. Once the system is described by this type of a state model, then all the subsequent analysis is based on the four matrices A, B, C and D. As a matter of fact, these matrices provide a lot of information about the behavior of the system. This is accomplished by subjecting these four matrices to various matrix operations (such as determination of eigenvalues and eigenvectors, matrix multiplication, inversion, diagonalization, rank- determination, etc.) to get information about stability, controllability, observability and input-output relationships of the system. The physical realization can be changed by applying various transformations [96] on the given realization {A,B,C,D} of the system. All these analysis and design steps would have been very difficult to perform without using this state-space representation of the system, and still they could not have provided so much insight about the whole system. So, in the case of a generalized system, matrices 11 helped us to practically “dump” a lot of information about the state of the system which facilitates the analysis even when the system size becomes large. 1.2.2.2 Application to Digital Filter Design In general, a digital filter can be viewed as a system with inputs, outputs and internal states [84]. An N-point digital filter (let’s consider a FIR filter here) can be represented by the difference equation showing the input-output relationship as: N-l y(n) = 2 bk-x(n-k) (1.4) k=0 The digital filters are modeled by using adders, multipliers and delay elements. The realization of the digital filter specified by the above difference equation is given as in Figure 1-2 (for N=3). There are two delay elements in this filter (denoted by Z‘ 1), [:3 -1 x(n) - z-l - x(n ) x(n-2) + y(n) Figure 1.2. Digital filter withN: 3. therefore, according to the conventions of modeling such systems, the number of state variables that we choose to represent this system is two. An obvious choice of such state variables is to have state variables v1(n) and v2(n) such that the overall system can be written in the form of matrices as v(n+l) = Av(n) + Bx(n) and the output can be represented by the equation y(n) = Cv(n) + Dx(n). v(n+1) is a 2X1 vector. A and B are 2X2 and 2X1 matrices respectively, and C and D are 1X2 and IX] matrices, respectively. After representing this filter by the state-space model, there are two important design and analysis objectives that can be achieved immediately. These are listed as: 12 0 Useful information such as the stability of the system (filter) can be obtained directly from the eigenvalues of matrix A [84, 96]. Similarly the input/output relationship can be determined from matrices A, B, C and D. o More important is the capability to transform this filter realization to some other real- ization by applying some transformation techniques on the matrices A, B, C and D. This is very important in some practical cases. For instance, a filter can be converted to a parallel form (from a tapped-delay form) by diagonalizing the matrix A (and chang- ing B, C and D as well, correspondingly). This is very useful to take advantage of par— allel processing. Also the effect of round-off errors is less severe for certain types of realizations, therefore, this type of transformation helps [84]. This was another example of using matrix theory and state-space analysis to get a global view of the system and to analyze it. When the systems become large-scale, it becomes increasingly difficult to achieve the above objectives without using state-space analysis and linear algebra techniques. 1.2.2.3 Simulation Simulation is an alternative way to analyze the behavior of a system by developing a model for it and then testing this model [64]. The system can be characterized by discrete- time states and events (i.e., the changes in the states). This type of model is used to gain insight into the behavior of the system by observing the real system and then comparing the results from the model under the same types of conditions. This process not only validates the model but also provides useful information about the internal states of the real system which can not be obtained without severely disrupting the operation of the actual system. As an example of the use of matrices as a data representation structure, consider various simulations based on queuing models that often represent the number of customers waiting in a queue at given time and service time for a customer (and other stochastic variables of interest) as vectors. Simulation may be the only realization of a system if the real system is not yet constructed. 1.2.2.4 Representation of Visualization Data Examples of the use of matrices and linear algebra that we have considered so far have come from systems analysis. In this section, we consider an example that comes from 13 graphics and visualization. Visualization data in general comprises a multidimensional data set. There is no universally accepted way to characterize the format of this data [15, 112]. Almost every tool has its own data representation and storage format. There are some implicit rules that everyone wants to follow, such as representation of data in ASCH instead of binary format. This is mainly due to the self- documentation advantage of the ASCII format despite the trade-OE with performance [86]. The important aspect, however, is resolving the data semantics associated with the visualization data, so that it could be used by various tools. Format conversion of visualization data, whether in a raw form or rendered in two or three-dimensional images, is a practical problem when this data needs to be displayed, analyzed or printed. We only consider the aspect of data representation for analysis purposes. Visualization data, in general, require various types of analysis functions to obtain insight about the systems where it comes from. The choice of analysis techniques is often dependent on the objectives of analyzing data. These objectives can be so varied, even conflicting in nature, that no single visualization tool can support all types of analyses in an optimal manner. The user may have to import the data to other tools to deal with such situations. This often means an intermediate step of converting the data exported by one Visualization Translator . Visualization Tool x f x to y 1 Tool y (Data Format), (Data Format)y Figure 1-3. Conversion of data for different tools. tool to another format that the second tool can import and process as shown in Figure 1-3. One aspect of the matrix approach for performance data modeling needs to be mentioned Specifically: despite the powerful mathematical structure that a matrix is for the purpose of analysis, we do not advocate their use for actual processing of data. Researchers [75] have 14 already tried to optimize computer architecture as well as software development to utilize the matrix as only data structure. But it turns out that this might not be an optimal solution. Other information processing structures such as spreadsheets and database systems might out perform the processing of matrices. Therefore, we model the performance data as matrices only at a conceptual and functional level, such that the approach is beneficial for defining, developing and managing the analysis methods and algorithms. The actual coding of these methodologies or algorithms in specific programming languages and on particular computer architectures need not be restricted to using the matrix as the only data structure, and other more efficient implementations should not be overlooked. The actual implementation of the analysis methods and/or algorithms will thus not only benefit from the rigorous mathematical structure of matrices to develop analysis techniques that could overcome the limitations of existing tools, but also will be able to benefit from efficient implementation strategies. 1.3 Matrix Thansformations Matrix transform techniques are often used when systems are modeled using matrices. In general, transformation methods have always been a key to solving many engineering problems that are hard to analyze in the form (or domain) that they physically evolve. These transformations often result in the availability of mathematical tools and representations of the problem that greatly enhances its understandability, compared to what we have in the original domain where it actually evolved. Linear transformations have traditionally been done with matrices in the context of vector spaces [4, 101], and often result in the transformation of the shape of the problem to another shape (or format) which might have some advantages over the first. The commonly used (linear) transformations include: Fourier transformation to analyze the spectrum of a signal by converting it from time domain to frequency domain; Laplace and z-transforms to design continuous-time and discrete-time systems, respectively; and similarity transformation also to design and analyze both continuous- and discrete-time systems by converting the state-space realization of the system from its original form to another form. 15 Matrices have been used to implement various types of compiler optimizations, particularly to perform loop transformations. [66] describes a loop transformation toolkit that uses non-singular matrices for loop restructuring. The purpose is to restructure the parallel nested loops in such a way that the data dependencies are eliminated to have course-grain parallelism at outer loops. This method is based on transformation from one iteration space (with data dependence at outer loop) to another iteration space that does not carry data dependence. Without such transformations, it is not trivial to determine this type of loop restructuring. We have used matrix transformations in the context of modeling the performance data as a matrix. The purpose is the same as mentioned above, i.e., conversion of the performance data to forms that are easier to analyze. Performance data in its original form, even when it is modeled by a matrix, is not very useful for analysis due to a large number of dimensions. Usually, a number of parameters are recorded as fields in a record whenever some event occurs. Some times, more than one record must be logged in response to the occurrence of a specific event. All the fields and/or records related to an event contribute to the dimensionality of the performance data. The key to multiple-domain analysis (as will be explained in detail in next section and chapter 5) is to selectively hide some dimensions of this data, using appropriate transformation techniques. This process transforms the data from “performance” sub-space to another lower-dimensional subspace (typically one- or two-dimensional) which can be analyzed by several available mathematical tools. It will be shown in later chapters that the characterization of the performance data in transformed sub—spaces will further enhance the efficacy of the analysis. 1.4 Multiple-Domain Analysis Approach Researchers are experimenting with advanced quantitative and qualitative techniques for handling the complexity of performance data. These include the (automatic) identification of phases or patterns in program execution [17, 52, 73], where [17, 3] apply statistical methods often used in signal processing; the use of multivariate data plots [89] and the 16 identification, organization, and perusal of related data by multivariate analysis [29] and cluster analysis [87]; the use of three-dimensional visualizations [35, 95]; the use of multimedia [13, 39] and virtual reality [86]; characterization in the event and frequency domains [1]; and development of event-based models of parallel system behavior [1, 68]. [65] advocates the use of multiple-views i.e., presentation of performance data using several types of displays to understand various aspects of the problem. The consensus is that we need to call upon candidate resources to attack the problem of analyzing and presenting performance data. To this end, we have developed the multiple-domain analysis approach [122]. Consider this excerpt: The history of science shows us that a given problem can fruitfully be examined by different disciplines, partly because when it is viewed from new perspectives, implications of alternative assumptions are explored by researchers with dijferent backgrounds or interests, and partly because new techniques developed elsewhere are imported to explore areas left untouched by the discipline in which the problem originated. [5] The objective of multiple—domain analysis is to define performance in terms of analysis domains, where certain domains have their origin in disciplines other than parallel computer system performance evaluation. Multiple-domain analysis has been developed by: 1. defining a set of domains, 2. characterizing domains via performance images and signals (to be defined in chapter 4). 3. transforming performance data into usable forms in each domain, and 4. applying visualization and analysis techniques in each domain (using domain-specific tools). Depending on the types of analyses to be done on the performance data, we have identified four related analysis domains, which can be characterized either by performance images [93] or performance signals [94]: spatial, temporal, event, and frequency. [109] has emphasized the use of application domain to understand algorithm-specific behavior of programs by using application-specific displays and animations. The application domain is of interest as well, but is distinctly different than these general domains. These analysis 17 domains derive from the multiple dimensions associated with performance data. Each analysis domain focuses on a particular dimension of the data, hiding the details contributed by the other dimensions. Analysis domains are rigorously defined in chapter 5, but they are introduced here in an intuitive manner. The spatial domain emphasizes the range of values of a performance metric over the “space” of the system, where “space” represents some regularly-structured ensemble of processors (or processes). This domain can be characterized by performance images. The temporal domain represents the distribution of values of a performance metric during the execution time of a program. This domain can be characterized by performance signals. The event domain is similar to the temporal domain, except that a performance metric is expressed in terms of events and not time [1]. That is, the occurrence of an event (a state transition) is significant, not the time between events (the duration of a state). The frequency domain also is related to the temporal domain. It shows the spectrum of a performance signal and concentrations of energy at particular frequencies (rates of change of a performance metric). It emphasizes variations of a performance metric and is useful for the types of analyses that will be performed on performance signals. Considering the frequency domain is important, because fast changes often make themselves apparent to us as a rate of change. The event and frequency domains also can be characterized by performance signals. Performance analysis via these domains takes advantage of analytical tools specific to these domains. The multiple-domain analysis methods are based on strategies typical of scientific visualization, as both scientific visualization and performance visualization share the objective of rendering and projecting large volumes of multidimensional data. These strategies include focusing and linking [14] and the use of image and signal processing techniques to enhance the visual analysis [125]. The main contrast between scientific visualization and performance visualization is that the former renders the natural, physical structure of the system under study, and the latter contends with a hierarchy of structures, including the higher-dimensional logical structure of the system under study. Thus there 18 are challenges. Effectively applying image processing techniques to performance data relies on representing the performance in at most three dimensions, either rendering the physical structure of the machine or selected aspects of the logical structure of the system (program and machine) (i.e., exploiting logically related system data). For signal processing, accurate results depend on appropriately characterizing the nature of the signal projected from the performance data. Hence, the key to using image or signal processing methods for performance analysis is the meaningful transformation of the performance data into an image or signal. Performance images and signals enhance visual analysis. Images or signals become the basis of analysis and can be compared across architectures, algorithms, and problem sizes. Performance signals also enhance aural analysis, as auralization of program execution is getting more attention due to the multi-media concepts becoming popular in visualization [13, 38]. For example, signal processing techniques—filtering, correlating, etc—can be applied to aural data to assist the user in understanding what is being heard. Auditory signatures can be reinforced visually with graphs of signals, and the signals can be parameterized and classified. In short, a performance image or signal is to be used hierarchically with other, more conventional views, contributing to a macroscopic display of system state in a particular domain. The question of interest is: are advanced methods radical departures from what’s found in “normal” performance analysis tools? The answer is both “no” and “yes”. “No,” because tools often already support the presentation of data (e.g., multivariate or time-series plots); but “Yes,” because tools typically do not support the analysis of data. So the similarities provide a common denominator between conventional and advanced methods, and the differences help the user answer the questions about performance. Either way, there is utility to the user. For example, views of performance images and signals generated with the standard tools AVS (Application Visualization System) [8, 9] and Matlab (from The Mathworks, Inc.) [71], respectively, have analogs in ParaGraph [50] (such as the Processor Status and Aggregate Processor Utilization displays, respectively). Unlike ParaGraph, the standard tools provide many well-known and tested functions for analyzing the data in the views. Thus, the approach is based on integrating performance 19 tools and standard data visualization/analysis tools using a matrix representation of performance data in a performance analysis environment as a “toolkit”. 1.5 Scope of this Work In this chapter, we have introduced the motivation of this work in general terms. This has been an inter-disciplinary effort involving diverse areas such as parallel and distributed computing, program performance monitoring and modeling, matrices and linear algebra, statistical analysis (in general) and analysis of stochastic processes (in particular), systems theory, digital signal processing, and digital image processing. Figure 1-4 shows this approach from the perspective of a user of a parallel system. Various disciplines are the RESOURCES (from multiple disciplines) l °°'“"""' | Statistics Signal Processing Promoting I'm-e- Ami { Aim“ I Exploration | l Prediction | OBJECTIVES (in multiple-domains) Figure 1-4. Multiple-domain analysis approach from a user’s perspective. source of candidate resources that are called upon to achieve the objectives of visual, statistical, and various other types of analyses that will be presented in chapter 5. An analysis objective, such as performance prediction is a part of the future directions of this research work. In more specific terms, the primary objectives of this work can be divided into two categories: theoretical and practical. The theoretical objectives will be dealt with 20 implicitly throughout this thesis, and we will explicitly point them out as they apply to various practical implementations of the analysis methods. The theoretical objectives of this work can are: 1. To overcome the limitations (mentioned in section 1.1) of existing performance analy- sis techniques and tools. . To develop performance data presentation strategies that can enhance the visual and statistical analysis. . To glean information about high-level programming abstractions from performance data that typically consists of low-level details of system behavior. . To recognize the observed performance data obtained through instrumenting the paral- lel programs as a realization of a stochastic process (characterization of whose nature remains a future direction of this work). This will lead to the determination of various statistical characteristics of this data, in order to consolidate the theoretical foundation that is the basis of many analysis techniques presented in later chapters. . Try to narrow the gap between purely analytical techniques of performance evaluation and purely application-specific techniques. The following objectives related to the practical implementation of these theoretical objectives have been achieved, so far, by this research: 1. Modeling of performance data by using a matrix approach. . Transforming the performance data matrix representation to other lower-dimensional sub-spaces (domains). . Applying multiple-domain analysis methods after transforming the performance data to these domains, using statistical signal processing and digital image processing tech- niques on one- or two-dimensional transformed domains. These analysis methods depend heavily on the matrix representation of the performance data and its transforma- tions to other domains. Therefore, all the analysis methods presented in this thesis are rigorous and hence scalable and extensible. . Using generic, commercially-available data analysis and visualization tools to imple- ment the above analysis approach. For the case studies appearing in this thesis, AVS and Matlab have been used for this purpose. This chapter has presented the context and motivation of the research work presented in this thesis. A survey of the related work will be presented in chapter 2. Chapters 3 to 5 have been used to establish the theoretical fi'amework for the multiple-domain analysis methods. Readers that do not intend to go through the mathematical framework presented in chapters 3 and 4 can go directly to chapter 5 that discusses the multiple-domain analysis 21 methods, without loss of continuity. Chapter 6 is devoted to case studies to establish the validity of the multiple-domain analysis approach formulated in earlier chapters. Chapter 2 Related Work Performance data representation has not been a thoroughly researched area, but a number of performance tools have dealt with this problem, either explicitly or implicitly. Performance data is captured from the execution of a parallel program using available instrumentation options, i.e., hardware or software instrumentation [104]. This is often referred to as “raw” trace data because its format is governed by the semantics defined by the instrumentation system and is not very useful for analysis by the user (other than brute-force textual analysis). Performance analysis tools take this raw trace data and use it to drive their displays which are designed exclusively for the purpose of program performance visualization. Most of the visualization tools (with very few exceptions) use a specific instrumentation system to obtain the trace data that they can process. Therefore, the data semantics are handled within the tool. We have classified the related work on performance analysis methods, approaches and tools in three categories. These categories are: 1. analytical methods and tools; 2. performance monitoring and tools; and 3. application specific methods and tools. This classification is based on the analysis methods (or approach) that the tools use, i.e., either generic (formal and rigorous) analysis methods or application-specific methods. Figure 2-1 shows this classification in graphical form. As we move towards the left side of this figure the analysis methods become more generic and mathematically powerful. On the other hand, if we move to the right, the analysis methods become less rigorous and more program specific. We present the related work starting from generic methods and ending with application—specific methods. In a sense, generic and application-specific methods have entirely different approaches to the problem but share the common 23 objective, i.e., performance analysis of parallel computer systems. Each approach has certain advantages over the other. Parallel system monitoring based methods represent a compromise between the two. Monitoring-based methods depend on measurements of the system parameters when the system already exists, while analytical methods are important to answer “what if” type of questions at the design stage of a system. Application-specific methods are most useful when a programmer could portray an informative picture of an algorithm. There are other classifications of performance analysis methods, based on criteria different from that considered here. For instance, [69] classifies performance analysis methods into three areas: monitoring, benchmarking, and prototyping. Here, the objective is the design of a parallel system and not the performance evaluation of a program on an existing system. Similarly, several other classifications are possible. We discuss the methods and an overview of the work being done in each of these categories. For each, the first part discusses the approach, and the second part discusses the tools that are being used to implement the approach. Performance Analysis Methods Generic Applicationospecific> Application- Analytical Monitoring-Based Specific ll ll Markov Queuing pm] Simulation Algorithm Application- Models Ilodels u». Ilodels Animation Domsln Adv-need conventional WNW“ Methods Methods Figure 2-1. Classification of performance analysis methods and approaches. 2.1 Analytical Methods Performance analysts believe that the pace of developments in computer system design has always overwhelmed the development of adequate and unified theoretical characterization efforts for these systems [33, 56, 69]. The importance of appropriate analysis methods becomes obvious when the system complexity increases. Analytical methods are based on building appropriate mathematical models for the computer system (and computation) to better understand the system and provide insight to the designer. Such models are most appropriate for making various decisions at the design stage of such a system, by analyzing the behavior of the system using alternative architectural choices. The complexity of analytical techniques can result from the abstract nature of the performance evaluation objective, such as, optimizations applied to hardware, operating systems, compilers, networks, and so on. A number of commercially available parallel and distributed systems have been developed under different design and performance goals. Their relative merits are not yet fully understood to enable the system designer to use appropriate models for the system to analyze and predict the performance to compare various design alternatives. Therefore, the evaluation of better performance is still an actively researched problem. We will survey the discipline of analytical modeling based on the models/methods that are often applied to parallel systems. [33] divides an analytic study of a computer system into three parts: (1) model formulation; (2) model solution (or simulation); and (3) model validation. Analytical models are solved either symbolically or numerically, in order to obtain desired results. A model of a parallel system consists of two parts [69]: 1. description of the architecture, and 2. definition of the workload under which performance predictions are to be obtained. Figure 2-2 represents a general framework for evaluating analytical models according to the steps described above. It should be noticed that various architectural parameters are expected to behave in a non-deterministic (or random) manner under a given workload. Therefore, a realistic analytical model has to be stochastic, in general. In the following 25 Architecture Model and Model S - Performance olutlon work load ‘5 . “‘5 or 4 . . descri tion Formulation . Prediction p Simulation Figure 2-2. Phases of an analytical study of a parallel system. sections, we will discuss four generic parallel computer system modeling techniques: Markov models, queuing models, petri nets, and simulation models; and survey a few tools based on these modeling methods. 2.1.1 Markov Models Markov models (also known as Markov processes) are considered one of the best modeling techniques for stochastic processes, in term of their mathematical tractability. Markov models are a compromise between the real-world dependencies involved among various physical phenomena and the theoretical requirement of “independent” phenomena to make the calculations possible [88]. The class of Markov processes that is often used for computer system modeling is called Markov chain processes. Markov chains are built by considering the stochastic process (e.g., parallel computation) consisting of a set of states that are visited once or repeatedly over time. This set of states is termed as the state- space of the system. A Markov chain process is defined as one whose transition to a future state from its present state depends only on the present state, and not the complete past history of its states up to the present state. Therefore, dependencies for predicting a future state are restricted to only the present state which is independent of other states. This type of dependency structure proves to be a very powerful tool for building up models of fairly complex real-world phenomena that can be solved. There are three issues involved in modeling a computer system as Markov chain processes that are given by [100] as: 26 1. defining the process as a Markov chain according to the definition of a Markov chain given above, by analyzing the dependencies of various states of the process; 2. mapping computer system models to Markov processes; and 3. solving Markov processes. Since a multicomputer system exhibits very dynamic behavior which strongly depends on contention for and sharing of system resources at a given time, a model of such a system relies on representing the internal states of the system. A detailed representation of internal states (i.e., dependencies among them) is sufficiently complex and leaves us without any methods of solution of the model other than simulations. This will happen when we want to analyze the states of the system in response to instruction and data streams of individual programs. The representation of current and past states provides a “memory” of current and past states of the system. If this memory includes very few details, then it will be difficult to predict the future states accurately. However, inclusion of a number of states makes the numerical solution of the model impractical. As mentioned above, Markov processes are considered a compromise between these two extremes. We can make up a Markov model of a given computer system as described (in words) in the following [100]: Assume that we represent all possible states of the system by a set of mutually exclusive and collectively exhaustive states. Also assume that the fitture behavior of the system depends only on the current state of the system and is independent of previous states of the system. The times between corresponding entrances to and departures from (“holding times" or “transition times") are independent and identically (exponentially) distributed. Then the states of this system with a set of transition probabilities between these states correspond to a Markov chain process. After setting up this model, the next step is usually to represent the state transitions in a matrix form. This representation reduces various types of calculations regarding this system model to simple matrix manipulations [4, 88]. The transition probabilities represented by the matrix can also be represented graphically as shown in Figure 2—3. The circles show the states of the system during the execution of a program and pi]- are the probabilities of transition from state i to j. The solution of this model for more complex structures has also been an active area of research. [55] describe a technique of solving 27 P12 P13 P21 P32 P23 Figure 2-3. Markov Chain representation of the system-program behavior these models utilizing fully symbolic and computer-algebra based methods at the earlier stages of Markovian process analysis and using numerical methods at the latter stages. [4] provides some examples of solving these models by purely numerical techniques, which work well for predicting future states of the system with simple models. Markov models have been used to model various aspects of parallel and distributed computer system behavior and performance. [2] analyze the performance of the interprocessor communication architecture of the CM-2 by developing a discrete-time Markov chain model of its network architecture. The model shows the use of simplifying assumptions in a practical situation and predicted results compare favorably with the results revealed by a simulation study. [1] analyzes the time-dependent behavior of programs by using the program execution sequence to make up a homogeneous semi- Markov chain model. This model can predict the system performance with program parameters different than those used in the input program execution sequence. 2.1.2 Queuing Models Queuing models are useful analytical tools for the analysis of systems in which conflicts develop when several entities try to access the same resource simultaneously, creating a loss of time for the system as a whole to perform a job. Although queuing models can be taken as a special case of Markov chain models, their wide-spread use to model multicomputer network communication and contention of resources among various jobs 28 makes them worth discussion independently from the Markov models. The study of queuing models is a discipline in its own right, known as queuing theory. Queuing models are based on two abstractions: servers and customers. Servers are the resources of the system that have to be shared or utilized by the customers. A queue is another abstraction of this model where customers arrive and wait for the service, while the servers are busy to service other customers. This basic set up of this model is represented by Figure 2-4 which shows the servers by rectangles and customers by circles. A queue is shown by a line of customers with vertical bars on both sides. Queuing models consisting of more than one server are referred to as queuing network models. There are I Servers Customers o o o o served <— Exit after completion 0f service Customers waiting in queue OOOOO-‘r Arrival of new customers ‘\ o 0 Figure 24. Ingredients of a basic queuing model. three stochastic processes that have to be defined to parameterize this model. These processes (or random variables) are: (l) inter—arrival times of customers (input process), 29 (2) service time of each of the customers, and (3) number of customers waiting in the queue at a given time. The probabilistic definition of these parameters (i.e., their probability distributions) determines the queuing model for such a system. Once this model is determined, it can help answer the following questions: 1. What is the average waiting time for a customer in the queue? 2. What is the optimal number of servers needed for this system? 3. What about having a separate queue for each server? Will it be optimal compared to a single queue? It can be appreciated that these questions are of a general nature and can arise in any phenomenon that is being analyzed by queuing models. The answers are very important to analyze the performance of a computer system (e.g., its throughput and response time). For modeling a computer system, often three types of devices are encountered [58]. These devices are: . Devices that have a single server, whose service time does not depend on the number of jobs in the device. Such devices are called fixed-capacity service centers. For instance, CPUs in a system may be modeled as fixed-capacity service centers. . Devices that do not exhibit “queuing” behavior, and jobs spend the same amount of time in the device regardless of the number of jobs in them. These devices can be mod- eled as a service center with infinite number of servers and are called delay centers. A group of dedicated terminals is usually modelled as a delay center. - Devices whose service rates may depend on the load (i.e., number of jobs in the device) are called load-dependent service centers. An interconnecting network between the nodes of a parallel computer system is an example of a load-dependent service center. It is common practice to consider the service times of all servers in a computer system as exponentially distributed [58, 69]. However, there are various cases listed by [58] where queuing network models are not very accurate. [115] use a queuing networks to model a commercial multiprocessor bus. Important characteristics of the bus such as asynchronous memory write operations, in-order delivery of responses to processor read requests, priority scheduling of memory responses, upper bound on the number of outstanding processor requests, and so on, can be modeled accurately. 30 2.1.3 Petri Nets A petri net is a graphical modeling tool for description and analysis of concurrency and synchronization in parallel systems. Petri nets were introduced by C. A. Petri in 1962 and are widely used to model asynchronous systems and concurrent processes. The theoretical problems associated with petri nets have been thoroughly investigated and therefore, it has sufficient mathematical structure to support formal analysis of parallel systems. The success of petri nets is mainly due to the simplicity of the model that works well to depict the complexity of large-scale systems. Many extensions have been added to the basic petri net model to facilitate their use for different application fields. These extensions include timed petri nets for quantitative performance analysis of systems, stochastic petri nets which uses random variables to specify the behavior of the model with time, as well as others [69]. Stochastic petri nets are considered more attractive as a modeling tool for analysis of multiprocessor systems. The structure of a standard petri net is a graph that consists of places, a set of transitions, and a set of directed arcs. A place represents some conditions and a transition represents an event. Arcs connect places and transitions to each other. A place is an input to a transition if an arc exists from the place to the transition, and is an output of a transition if an arc exists from the transition to the place. Therefore, the set of arcs can be partitioned into a set of transition input arcs and a set of transition output arcs. Figure 2-5 is a graphical representation of an example petri net, taken from [69]. Circles (or nodes) represent places and bars represent transitions. Tokens are placed on place nodes to represent that a certain condition is held on that node. A transition bar (an event) can fire (occur) if all the nodes (conditions) that input to that transition bar have tokens (i.e., are holding conditions). When a transition bar fires, it removes one token from each of its connecting input nodes and places one token on each of its connecting output nodes [22]. This firing of a transition is called the execution of the petri net. For instance, if P1 in Figure 2-5 contains a token, then transition t1 will be enabled. States of a petri net can be determined by observing the collection of names of the nodes that hold tokens. The 31 P— places t— transitions /’ D Figure 2-5. Example of a petri net. number of tokens a node holds is equal to the number of instances of a node name in the state. This procedure is called marking of petri nets. Petri nets and its variations are graphical, mathematical tools that can model the following characteristics of a concurrent computer system [22]: o representation of concurrent execution of multiple processes; . representation of nondeterministic and asynchronous executions; - decomposition and composition into many graphs; and . representation of model’s structure and dynamic behavior. Petri nets are also used to model the behavior of a processing unit that needs a resource to perform some action [69], the behavior of file systems [22], etc. 32 2.1.4 Simulation Modeling Simulation models are used to model the operation of a system, rather than the structural components of the system [64]. Simulation is a very useful technique to solve an analytical model of a complex stochastic system which is very difficult to be solved numerically or symbolically. Usually, simulation is used in conjunction with analytical modeling of a system to verify the results. Simulation models have their own advantages, in addition to being a tool for verifying analytical models, that are discussed in [58, 63, 64]. Simulation allows much more detailed study of a system than analytical models. In this case, a more detailed model is considered a better model as it makes fewer assumptions. Simulation studies are useful for the analysis of systems even after they are realized because these models do not perturb the system behavior for analyzing it. There are several types of simulations used for computer system performance analysis. These include emulation, Monte Carlo simulation, trace-driven simulation, and discrete-event simulation. Simulation models are translated into simulation programs that generate statistical information regarding the model. This simulation data is analyzed by various statistical techniques [58]. 2.1.5 Analytical Modeling Tools In this section, we briefly describe some typical analytical modeling tools to analyze parallel system performance. The description of these tools will be within the scope of theoretical details of analytical models discussed so far. Abrams, Doraswamy, and Mathur describe a tool called Chitra that produces a semi- Markov chain model of the program execution sequence (PES) which is observed during the execution of a program [1]. The design of Chitra has two aspects: (1) building up a semi-Markov chain model to represent the program behavior, and (2) using methods often applied in signal processing and in post-processing of simulation output, in addition to conventional visualization methods. Program behavior is modeled by considering various states of a program (state-space of Markov chain). Transitions between these states might occur due to conditional branches in the program. A semi-Markov chain process models 33 general state occupancy time distributions, which is considered appropriate for the software. The information obtained from the PES is used to determine the transition probabilities among different program states. Chang et a1. [22] have developed an analytical modeling tool-set consisting of a modeling tool TPQN, a textual specification language TPQL, and a simulator TPQS. This modeling tool integrates two popular analytical modeling approaches for parallel computer systems — timed petri nets and queuing networks— to model real-time complex systems. Existing modeling methods can model queuing behavior by queuing networks or asynchronous behavior of concurrent processes by timed petri nets. These methods are inadequate to model complex systems (including multiprocessor operating systems and distributed systems) if used individually. Complex systems require the use of an integrated model consisting of both modeling approaches. This modeling approach has been applied to real- time multiprocessing systems. The model is defined by the user using a simulation language TPQL that provides textual specification of graphical entities of TPQN model. TPQL is used as an input to the simulations system T‘PQS. Funka-Lea et a1. [40] describe an analytical modeling tool called Q+, developed by AT&T Bell Laboratories. Q+ is an interactive graphical tool for performance modeling that allows the user to specify the model graphically and also represents the simulation output graphically. It is made up of six components: the graphics editor, text editor, Monte Carlo simulator, language interpreter, browser and utility set. A user can specify the system as a queuing network that consists of two entities: nodes that are graphical sites in the network and transactions (jobs, customers, etc.) circulating among the nodes. A node has a queue associated with it which can hold transactions. A queue’s capacity can be positive and finite or infinite. These model components are placed and connected together by the graphical editor. The model is parameterized by using the text editor. Q+ uses Monte Carlo simulation to analyze the queuing model of the system. Q+ allows restructuring of the models by using calls to a C library interface. Q+ supports a hierarchical approach of system modeling by allowing subnetworks to be defined by the user that can be 34 incorporated in other models. Q+ has been used to model many systems ranging from large national telecommunication networks to microprocessor internals. Model sizes have ranged from one to several thousand nodes. Perhaps the greatest advantage of Q-l- is that it allows the system designers to make up their own model without requiring any expertise to be able to develop analytical models. Therefore, system designers can simulate the model to decide on alternative design options without having to explain the system to a modeling specialist to create a model for the system. Q+ also provides interfaces to two analytical packages to solve queuing networks. 2.2 Monitoring-Based Methods Monitoring-based methods are effective for performance analysis of programs on existing systems. Behavior of a parallel computer system is monitored by observing various system and/or program parameters during the execution of a program through hardware or software means. These methods have been developed to provide the user some feedback about the behavior of a program when executed on a parallel computer system. The objectives of analysis of an existing parallel system are usually different from those of a system in various design stages. Design options for system hardware and interconnecting network topology have already been made, and the performance of the system is dependent heavily on the programs that are executed on this system. Problems solved on a parallel system are expected to achieve significant speed-ups to justify the cost of the hardware as well as the development of a parallel program for a given problem. Speed-up is not enough to represent all aspects of the performance analysis because performance of a typical parallel system is a complex, multidimensional assessment that depends on various dynamic aspects of the system hardware as well as software, which might not have been specified prior to the execution of the program. From a user’s perspective, the following questions can be asked about the behavior of a program executed on a parallel system: 1. Is the program running correctly?, 35 2. How well is it utilizing the available resources of the system to get maximum benefit of the concurrent processing?, and 3. How can the performance of the program can be improved upon? The first question is concerned with the issues related with program debugging, while the second and third questions are typically related with program performance evaluation and optimization. Casavant [20] describes two of the most “ominous” characteristics of parallel programs that make them difficult to understand by the traditional techniques by parallel algorithm specialists, formal language experts, and operating system researchers. These characteristics are topological mapping and synchronization. The topological mapping issue demands that the spatial relationships between program data and processors should be understood. The synchronization issue demands correlating the events occurring at different times within a processor ensemble. When a program using even only a few processors is monitored, it typically generates a huge volume of data that is not possible to be assimilated by human senses. The idea of program visualization was introduced to attack this problem by graphically representing the numerical data. Visualization exploits the visual perception capabilities of a human user that are much more powerful compared to the textual perception capabilities [116, 117]. Several program visualization tools such as Seecube, GIST, ParaGraph, SHMAP, etc. are considered pioneers in this field [23, 27, 50]. The use of program monitoring with visualization for the purpose of performance analysis and optimization can be illustrated as shown in Figure 2-6. The process starts with the programmer developing a program to solve a problem on a parallel system. This program is instrumented usually by instrumentation libraries linked with the program. When this program is executed, it generates the observed information about the occurrence of various events during the execution, in addition to the required results from the program. The observed trace data is typically a multi-dimensional dataset which is analyzed by using various displays of the available visualization tool to represent several perspectives of the program behavior. The programmer has to develop a “mental bridge” [95] to link the behavior and performance of the program as represented by visualization displays to 36 Programmer Program Input Program Program Results —> Com arison . “——* 31 Development Execution Instrumentation Program Performance Feedback Visualization f Program Trace Data Figure 2-6. Monitoring-based methods for performance analysis. the actual program. This mental bridge is necessary to establish for debugging the program for correctness as well as optimizing its performance. The program monitoring approach is not as general and mathematically rigorous as the analytical modeling approach. This lack of generality and rigor has its advantages as well as disadvantages. It is beneficial for the purpose of performance analysis because: It is the most direct approach to analyze a particular program on a real particular paral- lel system without having to make any simplifying assumptions about the behavior of either of them. Behavior of various system and program related parameters such as processor utiliza- tion, message traffic on various conununication channels, memory references, cache hits or misses, lengths of system queues, routing protocols, etc. can be observed which contributes directly to the programmer’s understanding of the program activities during its execution. A programmer does not have to build a model for a program which requires specialized model-building expertise. The cost of instrumenting the program in terms of the programer’s time needed to instrument it is usually insignificant. The program monitoring approach is not free from limitations. Its major limitations include: 37 . Lack of rigor in the approach typically manifests itself as lack of scalability of monitor- ing-based approaches for large-scale parallel systems or large volumes of trace data for large number of events. 0 Lack of extensibility, due to which it is difficult to perform analysis which is not sup- ported or provided by a visualization tool. Often trace data semantics are such that the data can not easily be used with any other analysis tool. 0 An analytical approach can help decide an optimal design alternative because different options can be compared against the same model. This type of comparison among vari- ous parallel implementations of an algorithm that can help the programmer to improve the performance is usually non-existent in monitoring-based program visualization tools. 0 This is an intrusive approach and perturbs the normal execution of the program. As mentioned in Chapter 1, monitoring-based methods were introduce to cater to the needs of the users that could not benefit adequately from existing analytical techniques. The limitations of the monitoring-based methods have resulted in improving the approach behind these methods. For example, researchers are developing new representation and analysis paradigms. To put these efforts in an appropriate perspective, this section has been divided into two parts: conventional and advanced. Program monitoring tools will be discussed with the appropriate analysis approaches used by them because almost every tool represents the realization of a specific approach. Therefore, these approaches have been categorized as conventional and advanced, with certain aspects discussed in the following sub—sections. Also, performance data instrumentation techniques will not be discussed as instrumentation systems are beyond the scope of this thesis. The reader is referred to [103, 109] for more details. 2.2.1 Conventional Methods We refer to the performance analysis methods and tools that do not process, analyze and present (visualize) the performance data by rigorous means as conventional methods. Most of the visualization tools that replay the program execution based on the execution trace data will fall in this category. Information is presented by typical program visualization displays with some single-value metrics, such as average system utilization, the cumulative busy or idle time of each processor, the number of messages sent or 38 received, etc. Typical graphical displays include: time-series display (strip chart), space- time diagram, Gantt chart, Kiviat diagram, bar graph, dial, LED display, x-y plots, matrix display, pie chart, contour plot, three dimensional scatter plot, program call graph, architecture topology graph, meter, vector plot, etc. [103]. These are exemplified in specific tools that we will discuss here. There are three common points among all the above typical visualization displays and hence in the tools that use them: 1. These displays show temporal and/or spatial activity of the program on a parallel sys- tem. 2. The displays are driven (or updated at each step) by the event information obtained from the trace data that is ordered according to the event occurrence times, by applying some rule. 3. The learning time for these tools is relatively small as the displays tend to be intuitive. The rule for updating a display is typically not rigorous and depends heavily on the data semantics defined by the instrumentation system. Analyzing the graphical displays and relating them back to the program have to be done “manually” by the user. This approach might seem very rudimentary, but according to several surveys providing user feedback about these tools [23, 61, 77, 92], the number of users using these tools as a part of their program development environment is much larger than those using performance analysis tools of any other type. This is probably due to their simplicity. Therefore, despite their limitations, these tools can not be overlooked because various advanced tools (discussed in section 2.2.2) have been developed based on the experiences with these tools. Conventional analysis methods are mainly of the following types: visual, query-based, debugger-based, heuristic-based, statistical and textual. Conventional analysis methods and tools will be discussed according to this classification. Visual analysis involves graphical displays of information and the ability of the user to visually perceive and discern features, trends, patterns, etc. in the displays. ParaGraph can be considered a representative tool belonging to this category [50]. There are very few program visualization tools that can match the variety of displays provided by ParaGraph. It uses the trace data obtained by instrumenting programs using the Portable Instrumented 39 Communication Library (PICL) [44]. Several of its displays that contribute to visual analysis include processor count, Gantt chart, space-time diagram, concurrency profile, utilization meter, Kiviat diagram, message queues, and task graphs. The SHMAP (Shared Memory Access Patterns) tool is designed to aid in the study of memory access patterns, cache strategies, and processor assignments for matrix algorithms in shared-memory systems. There are two main displays—one to show loads from the main memory and the other to show stores to the main memory. Each memory access results in illumination of the corresponding memory element. A number of general approaches to envisioning information are described by Tufte [116, 117]; one is “analysis by small multiples” for detecting changes between views (which we routinely apply without naming it per se). Another approach that is routinely used is “analysis via multiple views,” which is more formally stated in [65], whereby multiple views assist us in navigating through the space of processor/process states, interactions, and time. Another notable example of visual analysis is “projection pursuit,” whereby the user refines hypotheses about performance by following a path of selective projections of multi-dimensional data [28]. Query-based analysis involves techniques typical of database or spreadsheet processing to extract information from the performance data. The data is viewed as some form of a relational database and can be selectively accessed via queries [105]. “What if” questions can also be supported to enable prediction [41]. SIEVE is a program debugging environment [41] that treats the trace data as a temporal, relational database. The structure of SIEVE allows the user to access the source code structure and event histories as a relational database. In debugger-based analysis, an interactive debugger supports performance analysis by coupling source-code-level information with performance visualizations. Performance is then more closely related to the programmer’s viewpoint, and visualizations may be dependent on the programming model. MPPE (MasPar Programming Environment) is an integrated graphical debugger, performance profiler, and Visualizer, with client-server 4O architecture for remote debugging [23, 70]. It allows source-level debugging and static analysis of parallel programs on the MasPar family of data-parallel computers. PRISM is another example of a source-code level debugger and graphical performance analysis tool used with the CM-S. It uses regular field visualization techniques to examine large arrays [23]. Heuristic-based analysis may be done automatically or interactively (with user input) based on static and/or dynamic information. It is applied, for example, to optimize the mapping of the program or the restructuring of code (for parallelization). These methods are also used to detect non-determinacy due to the asynchronous nature of the processes and race conditions due to shared variables in parallel programs. Callaham and Subhlok describe the PTOOL tool [16] for detecting and analyzing asynchronous parallel loops in a program. Statistical analysis involves calculations on the basic metrics to derive (single value) characterizations such as mean, median, skew, standard deviation, maxima, minirna, frequency distribution, etc. Statistical calculations are common in many tools. Tools such as ParaGraph [50] and Seeplex [28] display statistical information in addition to the graphical information. Finally, textual analysis is the brute-force inspection of program or trace data. This method has to be used when the performance visualization tool does not adequately provide the information that a programmer needs to know either to debug it, or to understand its behavior and performance bottlenecks. Various tools such as ParaGraph recognize this fact and provide a means to browse through the trace data. Thus, parallel program performance analysis in support of verification, optimization, and prediction is an active area that has not yet settled upon definitive methodologies, especially for large-scale systems. It will benefit from continued basic and experimental research, which is the motivation for this work. 41 2.2.2 Advanced Methods Performance visualization and analysis methods and tools that try to overcome the limitations of the conventional tools have been categorized as advanced methods. These methods usually develop their approach on the basis of various well-known rigorous analysis methodologies often used in disciplines such as statistical analysis, numerical analysis, and signal processing. Some of the tools do not use any rigorous approach but their approach alleviates certain limitations of the tools described as conventional tools in section 2.2.1 are also discussed in this section. Advanced methods of analysis include: pattern analysis methods, data transformation methods, numerical methods, automatic analysis, hierarchical analysis methods and auralization. These methods and tools using them are discussed next. Pattern analysis methods are perhaps among the first techniques that were explored upon recognizing the limited effectiveness of conventional methods. [89] suggests that analysis of patterns of programs should be a potentially useful technique for performance analysis and debugging. Belvedere [51] is perhaps the only debugger that was designed to explicitly analyze the patterns. Hough and Cuny [52] indicate the fact that patterns generated by communication calls in a parallel program can help identify communication related bugs in a program. “Perspective views” are supported by the Belvedere tool, which depict abstract events in a space-time diagram (via a reordering transformation) so as to emphasize behavior patterns from the programmer’s viewpoint [51]. Data tranv’ormation methods applied on program trace data to visualize it from various perspectives can be considered as advanced methods. The development of tools based on these methods give the user (or analyst) freedom to deal with any type of performance data. The user has to choose an appropriate transformation technique (according to the rules set by the tool) to select a particular dimension of the data for visualization. Pablo is a performance evaluation environment that uses this approach [86]. It defines a standard data description format (SDDF) for representation of the trace data which makes it easier to share the performance data among multiple tools. Pablo uses a data-flow model of 42 visualization where the user to manipulates and interconnects various data transformation modules in the form of an acyclic graph to transform the data and display it. Analysis methods discussed in this thesis are also based on transformation of the performance data. However, we model the performance data explicitly as a matrix and then apply transformations to this matrix in a rigorous fashion. In addition to transformation of the data to lower-dimensional data, we transform this lower-dimensional data into other forms for analysis. Numerical methods have been applied on performance data to obtain information about specific aspects of the program. Such methods are handy when the performance data is already in a form on which such rigorous analysis could be carried out. [17] has shown the analysis of time-series plots of processor utilization by smoothing, rounding, and piecewise polynomial fitting techniques in order to identify “phases” of the program and predict their scalability characteristics. Automatic analysis techniques are concerned with providing program-specific information without explicit user input about the program. These techniques rely completely on low- level performance data to extract the knowledge of high-level abstractions of the program. These methods do not require explicit user input because the performance data that they use is already in a form that can be processed by standard data analysis methods that are often used in statistical analysis and signal processing. IPS-2 [73] supports analysis of phases by applying smoothing, segmentation, and combining techniques on a curve representing a performance metric. However, in certain cases, this type of automatic analysis might not be as effective as the analysis performed with some amount of user input [74]. Hierarchical analysis methods have been proposed to analyze the performance of programs executed on large-scale parallel systems [29]. Users can hierarchically navigate through views in a global context showing only the macroscopic information and can selectively see parts of it in more detail. Processors can be categorized according to their 43 behavior, and statistics corresponding to that category can be displayed. The execution visualization tool Seeplex [28] implements this type of categorical views and is scalable. Auralization, that is, representing the program behavior as sound, is a comparatively new addition to the performance visualization area. Program auralization can be considered as a complement to visualization using the sound capabilities, in addition to the high-quality graphics capabilities, that are available in today’s workstations. Auralization is also another example of transforming the performance data from its original form (or domain) to another form which contributes towards the efficacy of the analysis of the original problem. [39] uses a prototype sound tool to represent the message-passing behavior of the program which can be used with a conventional visualization tool to enhance the user’s understanding of the behavior of the program. Pablo [86] has also adopted the idea of sound in addition to graphics to represent the program activity. Advanced methods and tools have attempted to address the problems and limitations of conventional tools. Most of the tools mentioned in this section are still in various stages of development. While advanced methods have advantages over conventional tools, they are typically not easy to apply. Usually a user faces a very steep learning curve to use a new more sophisticated tool. Therefore, the user-base of these tools is expected to be smaller compared to that of conventional tools. This factor offsets the advantages of these tools to some extent, because the methods were developed to be used! The practical benefit of these methods and tools, therefore, remains essentially unknown unless there is wider usage and user feedback becomes available. 2.3 Application-Specific Methods Application-specific methods of performance analysis and debugging can be considered as a version of program monitoring that is free from any restrictions of generality of approach and reusability of displays. These methods can be considered as a reaction to the lack of information given by performance visualization methods and tools which provide the user with a direct understanding of a program. Application-specific 44 visualizations enable the user to create the displays that directly and most effectively show the behavior of the program. Performance visualization presents the behavior of a program in an indirect manner by representing various performance metrics graphically. The user must develop a link between the performance of the program and the program—specific information such as program data accesses and data values. Application-specific methods, on the other hand, use a direct approach to this problem and use displays that represent a particular program and its data values. Figure 2-7 shows the process of creation of an application-specific visualization with an application program. A programmer develops a display in addition to the program. The Programmer P P Program Input ——> rogram * rogram —-> Results Developmentl Execution I - Graphics Application- Display Support Specific Development Platform Wsualization Figure 2-7. Application-specific methods for performance analysis. display is dynamically updated by using available graphics libraries of the system where the actual program is executed. Both program code and visualization code are executed by the same system with the display code using the graphics library support. This process generates the actual program output as well as updates the visualization display. Therefore, visualization is a part of the actual program and can represent the program data. 45 We have categorized application-specific visualization methods into two types: algorithm animation and application—domain visualization. The only difference is in the way various tools have used them and is discussed in the following. 2.3.1 Animation Animation has a special significance in computer graphics and visualization. Animation is used to show the passage of time by updating a display according to the change during an incremental instant of time. Animation has been used by generic performance visualization tools (discussed in section 2.2) in addition to the application-specific tools. For example, ParaGraph uses animation displays to show the state of the processors and their message-passing activities. ParaGraph also uses animation in application-specific displays that must be programmed by the user, such as to present a program data object and its values. However, these animations (known as algorithm animation) are not re- usable, i.e., the animations have special features that are appropriate for a particular program only and might not be used with any other. Zeus is an algorithm animation system that uses multiple views to animate various aspects of a particular algorithm [13]. Conunonly used algorithms such as Quick Sort, multilevel adaptive hashing, topological sweeplines, and others have been shown effectively by using animation and sound. Zeus has been used for performance analysis and algorithm tuning purposes. 2.3.2 Application-Domain Visualizations Application-domain visualizations are also program dependent and not re-usable. In contrast to algorithm animation, these visualizations are not used for performance analysis, but exclusively for debugging and correctness checking of a program. These displays are developed alongwith the application program to be visualized, therefore, they represent the application domain directly, in addition to its semantics and fundamental behavior. POLKA (Parallel program-focused Object-oriented Low Key Animation) is an application-specific visualization system [109]. It uses high-level graphical objects and 46 motion primitives. A POLKA animator object is to be instantiated in the program for developing the animation. Although application—specific visualizations have been used successfully by various researchers, it remains an obscure art. Once developed, these visualizations are easy to use by a user, but it is hard to develop a good application—specific visualization display that can portray appropriate aspects of the program activity. Moreover, a user must be good at writing the application program as well as developing a powerful visualization in order to benefit from this approach. The user feedback on these tools is even lesser than that on advanced visualization methods and tools or analytical tools. This concludes the survey of related work on the performance evaluation of parallel systems. It is clear that all three approaches discussed here have distinct advantages, yet none of them are free from constraints. We believe that the approaches which are based on mathematically rigorous and conceptually precise foundations are more likely to be effective, but to do so, must be implemented with a tool that caters to the needs and capabilities of its expected user. Starting in the next chapter, we will introduce a methodology of modeling the performance data and transforming it to other forms (or domains) where more effective analysis methods can be applied, and discuss the analysis methods in these multiple domains. Chapter 3 Performance Metrics and Trace Data Modeling As discussed in Chapter 1 and 2, there are various techniques to evaluate the performance of a parallel program executed on a parallel computer system. Two types of techniques are analytical modeling and monitoring [89]. Analytical modeling is concerned with modeling the behavior of a program on a particular parallel architecture by developing various simplifying assumptions to devise a feasible model of the system and program. Program monitoring (hardware or software) is concerned with logging the run-time data from the system or program. Program monitoring is considered a more effective technique than analytical modeling for complex systems because the actual behavior of the program can be observed and analyzed without any simplifying assumptions. Program monitoring is also used to verify the results obtained by analytical modeling techniques. Therefore, researchers working in diverse areas of parallel program development, modeling and analysis, and optimization have to rely on run-time trace data. In a practical situation, however, it might not be possible for these researchers to share the same trace data without appropriately converting it to a suitable structure. In spite of this, trace data is the only common ground for everyone working in the area of performance evaluation. Therefore, the analysis methods and tools developed in one area could be useful in others if a common trace data structure could be formulated. In this chapter, we will specify the performance metrics and trace data with the ultimate objective of this type of standardization at a “functional level”. 3.1 Performance Metrics and Preliminary Definitions The performance analysis of a concurrent system is an intricate problem to deal with, both at quantitative and qualitative levels. It is difficult to decide upon a set of generic enough objectives to analyze the performance of a parallel program (or system) because the criteria of desirable performance might be different in different practical situations. A 47 48 program can be evaluated in terms of its characteristics of accuracy, reliability, understandability, testability, expandability, design cost, running cost, etc. [33]. All of the above characteristics are non-numerical descriptive variables. Therefore, they are not suitable for the analysis. Hence, the performance analysis tools have adopted the convention of characterizing the performance of a program (or system) by using various suitable numerical parameters. These parameters give some measure of specific aspects of the performance and are referred to as performance metrics. Performance metrics change their values with respect to two independent variables: space and time. These are the two basic variables that specify the interaction of a parallel program with a parallel system. All other events and performance metrics are defined in terms of these two independent variables. For all the definitions considered in this chapter and elsewhere in this thesis, we have assumed that there are P processors being used for executing a program which is traced for N clock cycles. Definition (Time) Execution time of a parallel program is a discrete integer, denoted by n such that n e {O,..., N—l}, where n = O specifies the instant of time when tracing of the program - began and n = N—l specifies the time when tracing was stopped. O The range of time variable n = 0,1,..., N—l defines the global time scale for the program executing on a given system. All other events and hence the performance metrics are defined with reference to this time scale. We will treat the time scale (or index) as a vector quantity, such that: T n = [01- - - NJ] (3.1) where the n = O is arbitrarily chosen by the programmer to selectively start instrumenting a part of the whole code. The length of the time scale (N) is determined by the time taken to execute the instrumented portion of the code. 49 Definition (Space) Space of the program (or system) is specified by a discrete number, denoted by p such that p e {0, ........ , P—l }, where P is the total number of processing nodes allocated for the program, where a node includes a processor, local memory, routers and other components necessary for localized computation in a distributed system. 9 The space of the system is defined by the logical processor number assigned to each of the processors. The events in a distributed system are identified in terms of the spatial locations of their occurrences in addition to the global time-stamps associated with them. Definition (Performance Metric) A performance metric is a discrete number specifying a certain aspect of the behavior of a program and/or its interaction with the system by using numerical values. A performance metric will be denoted by m. 9 All performance metrics are functions of the time and space variables defined above. There are several types of performance metrics, such as response time, processor utilization, processor state, operation count, computational load, execution rate (throughput), message count, message volume, communication time, communication overhead, I/O traffic, etc. [89]. Generally, these metrics are sensitive to the occurrence of specific types of events, usually identified by an event-identifier. A trace data log contains some event-specific information in addition to the temporal and spatial location of an event. This information is used to update the values of afiected metrics at a given temporal and spatial location. Performance metrics are useful to define the system states and the dynamic behavior of the system. Frequently, “system state” will be used to represent the value(s) of a performance metric at a particular instant of time. Since the behavior of the system evolves over (execution) time, we need to define the dynamic behavior of the parallel system. Definition (System State) State of a parallel computer system while executing a program is a numerical representation of the activity of each of the processors of the system at a discrete instant of execution time, where the activities of a processor might include computation, 50 communication, I/O etc. The numerical values of system states can vary with time due to the occurrence of specific events. 0 In the context of this work, the activities of a processor in the system will be restricted to two broad categories: (1) computation and (2) communication. A processor can be considered “busy computing” if it is not participating in any message-passing activity, according to the definition used by [45]. The dynamic (instantaneous) system state will be represented as a vector that will be called a state vector, defined in the following. Definition (State Vector) A state vector denoted by v(n) consists of P elements denoted by v0, v1,..., Vp_1 that hold the states of each of the processors 0,1,..., P—l, respectively, at each instant of program execution time n. O The state vector is initialized to an appropriate value at the start of the tracing, depending on the nature of the selected performance metric. This vector will play a key role for the transformation techniques presented in subsequent chapters. We can numerically denote the busy state of a processor by 1 and the idle state by 0. Whenever we will need to reduce the state of the whole system at a particular time to represent it as a single numerical value, we will use the common reduction technique of determining the ratio of the processors that are busy computing to all the processors allocated for the program, by utilizing the dynamic information about the state of the entire system held by v(n). In some cases (especially for spatial domain analysis as discussed in Chapter 5), we will use cumulative system states as well where we do not need to reduce the whole system state to a single numerical value, but reduce the states of each processor as a ratio of their busy time to the total execution time up to the observation instant. Definition (Dynamic Behavior of the System) Dynamic behavior of a concurrent computer system during the execution of a program is represented by some performance metric that varies with time as well as with space. The observation of dynamic behavior generates time-series data. 9 Dynamic behavior of the system will be characterized by system states at each instant of time (or with each occurrence of events of interest) during the execution of a program. We 51 will need to reduce the state of the whole system to a single numerical value due to the characterization of the dynamic behavior that we will define and use in subsequent chapters. 3.2 Execution Trace Data and Matrix Representation Performance visualization and analysis tools use the time and space as independent variables to depict the occurrences of various types of events during the execution of a program. Currently, there is no standard format of representing the trace data. Every instrumentation and visualization tool sets its own data structure and semantics. If generalized performance analysis methods are proposed, then data semantics will have to be resolved to make them work. This fact points to the need of having some generic data representation format that could be helpful for performance analysis at the functional level, i.e., the analysis methods must be independent of a particular tool’s data semantics. This idea is known to have worked in several other disciplines and applications. For instance, in computer graphics, the data is handled in terms of vectors and matrices to represent various geometric shapes and functions. However, it is not yet practical to have a common data logging and representation strategy for all instrumentation systems. So, a practical alternative is to have a preprocessing filter (a program) to convert the trace data to a format suitable for analysis. We propose a matrix representation as the general performance data structure. This convention has been adopted from statistical analysis techniques that use a “primary data matrix” to represent the data [26]. This primary matrix is transformed to other forms for better analysis. We have adopted the same terminology and create the primary matrix by preprocessing the “raw” trace data. The primary matrix is then transformed to other domains (or subspaces). Performance data in its raw form is ordered according to the time-stamps of events of interest. At each time-stamp, there might be several events occurring at various processors concurrently. Such events are distinguished by their processor number and event- identifier. For the purpose of generality, we represent this data by a trace data matrix T, although we are not going to use this matrix for subsequent analysis. 52 Definition (Trace Data Matrix) Trace data matrix is represented by a block matrix given as: 5010,17,) E T = (3.2) [E ("N- 1’ P9] where E(no,pi) are the block matrices showing concurrent events at an observation instant nO and at processor p,-, where i E {0, 1,..., P-1} and are given by: 90 "0 Pi Co e1 "0 pi C1 E = (3.3) Lek no Pt C’s where eo,e1,...,ek are the event identifiers of k+1 concurrent events that occurred at k+l (out of P) processors at some observation instant (time-stamp) no. Blocks C0,C1,...,Ck represent the event-specific information where each of these blocks consists of one row and the number of columns determined by the event type having maximum number of event-specific fields of information. Therefore, some event types will have redundant columns to make the matrix notation possible. 9 The trace data matrix T includes event information obtained from instrumenting the program and ordering it according to global time stamps. It contains more information than what would be necessary to carry out analysis with specific well-defined objectives. Hence, preprocessing is done on this matrix. 3.3 Metric-Based Representation A better and more useful representation of the trace data is in terms of a performance metric of interest. This is essential to reduce the dimensions of the data to facilitate its analysis and visualization. We represent this data by a matrix M, called the “primary metric-based performance data matrix”. 53 Definition (Metric-Based Performance Data Matrix) Metric-based performance data matrix is represented in the block form as: M(n0a pj) M R”><2 transform M e Rims)” to the spatial domain at a specific observation time no, such that: T1(M(n.p))l,,=,,a = so» I = [12,. mm] (4.1) for k = 0,1,...,P-1, where P is the number of processors allocated for the program and e(no) is the value of event counter at time no. 9 Matrix S of dimensions P X 2 represents the spatial domain. P is typically much smaller than e(no), so the size of the data set is smaller. Also, P is a constant over time, so the size of the dataset in the spatial domain also remains constant over time. Matrix S represents the (logical) processor numbers in the first column and the states of the corresponding processors in the second column, at the observation instant no. States of all the processors at time no can be obtained from the state vector v(n) that is updated for all n S no, in order to make this transformation work in practice. Matrix S does not take into account any physical or logical interactions that might be present among the processors during the program execution. In order to account for the spatial structure (physical or logical) of the system, we have to transform the matrix S into a digital image called a performance 60 image. This transformation is performed by a linear operation commonly known in the visualization discipline as an image rendering operation. Definition (Performance Image) A performance image is a representation of the state of all the processors in the system in a two-dimensional plane, at an observation time no. It is a digital image obtained by a rendering transformation applied to S to form a bound matrix G of dimensions r x c, where total number of processors P = r X c. The rendering transform is defined as: S (p) In —) G (no) (4.2) where elements of G are color-table indices such that: 0 (n,) = {g cum (g (m e R) ,g (:31) =am. (n,)} (4.3) where i = 1,2,...,r; j = 1,2,...,c; k = 0,1,2,...,P-1 and a is the ratio of the ranges of the available color index and the performance metric used for representing system state, specified by: gmax - gmin a = (4.4) mmax _ mmin Determination of the spatial location (i ,1) in the performance image corresponding to the metric value of a particular processor k at the observation time no, given by mk(no), is dependent on the type of image rendering used (e.g., natural or gray-coded ordering of processors). The parameters r and c can be defined as: r = 2(d+1)/2 and c = 2(d'1)/2 under the constraint that P = 2d. 0 In the case studies presented in Chapter 6, we have used natural ordering of the processors, as it depicted an aspect of the logical structure of the programs (data distributions). Various rendering paradigms can be selected by the user to suit the needs of a particular problem. The rendering operation is perhaps the most important link between the spatial image characterization and the logical structure of the program. Theorem 4.1 A color in the performance image can be used to uniquely represent the numerical value of a performance metric, if and only if the color table has an infinite length. 61 Proof (if part): According to the definition of a performance image G(no), it is obtained by a linear mapping from the spatial domain matrix S(p) evaluated at no. The spatial domain matrix S(p) at time no consists of P rows, where P is the number of processors in the system. There are two columns of the matrix. The first column lists the logical number p of the processor, where p E {0,1,...,P-1}. The second column consists of the value of an arbitrarily selected metric m at the processor whose logical number appears in the first column of the same row, at time no, where no 6 [0,nN_1]. Since the color table is indexed depending on the dynamic range of the performance metric (and the dynamic range of the color table itself), a performance metric value will be mapped to a particular color, as indicated by the definition of performance image. If the color table has an infinite number of entries, it can represent each value of the metric (a real number) by a unique color. Proof (only if part): Suppose that the color table does not have an infinite number of entries, i.e., (gmox — goo-n) < 00 where each color table index g E Z. Suppose that g1 and g2 are two consecutive color table indices. Let m1 and m2 be two metric values which when mapped to a performance image according to the transformation specified by equation (4.2) get mapped to a continuous interval within [gbgz]. Then, depending on the rounding technique being used (truncation, rounding up, down, etc.), both m1 and m2 will get mapped to either g1 or g2. Hence, the corresponding color in the performance image will not represent the value of the metric uniquely, if the length of the color table is not infinite. QED. I Corollary If the color table does not have an infinite number of entries, then a range r of the metric values will be mapped to one particular color. In the limiting case, when r —9 0, it is possible to approximate the value of the metric by a color precisely (but not uniquely). In a practical case, the color table can not have infinite length. However, if it is large enough compared to the dynamic range of a metric, it can represent the values of a metric to a fair degree of accuracy. The definition of performance image rendering seems to work well with the concurrent systems using hypercube and mesh interconnection topologies. Other types of renderings can be used in case of other topologies or to emphasize specific aspects of the logical interactions among the processors. Performance image was defined for a generic performance metric m which is a function of time and space variables. To be specific, if we suppose that m represents cumulative busy time of a processor at a given 62 observation instant no, then it can be defined as: 1 m, (no) = 7:; {no 15%,?“ (p) } (4.5) for p e {0,1,...,P-1} and Ank(p) is the time taken by the k-th message-passing event at the p-th processor. There are other metrics that can be used for generating the performance images, such as instantaneous processor states (at an observation time), cumulative communication volume and count, etc. Transformation to the spatial domain converts the system state “snapshot” at a certain time to the performance image (matrix). Several image processing and analysis techniques can be used to glean more (high-level) information regarding the system state and program behavior as well as their variations over time. These techniques will be discussed in the subsequent chapter when we present the analysis techniques within specific domains. 4.3 Temporal Analysis Domain Time is another important dimension of parallel program characterization, in addition to space. This dimension manifests itself difi‘erently at different levels of abstraction. For a programmer, the wall-clock execution time of a program is a means to characterize its performance. For a system analyst, the response time experienced by users during the execution of their jobs might help evaluate the efficiency of system software. Within a parallel program, the time taken by a processor to send or receive a message is associated with performance. Clock pulses from the local clock of a processor provide some notion of time. Therefore, regardless of the level of abstraction, performance of a parallel system can not be characterized without developing a time-scale to temporally distinguish among various events during the execution of a parallel program. The temporal dimension reflects the random nature of occurrences of events during the execution of a parallel program. Although all the processors in a distributed system may have independent clocks, for our purposes, we leave the issues related with synchronization and determination of global time-stamps to the instrumentation system. 63 This is a classical problem in distributed processing, and instrumentation systems resolve this issue according to [51] to keep the “causality” intact among the message-passing events. Therefore, observations of a performance metric at equally-spaced intervals of time can be obtained from the trace data matrix M by transforming it from the performance data domain to the temporal domain. The transformation to the temporal domain is accomplished by eliminating the spatial dimension from matrix M. The elimination of the spatial dimension results in a matrix with two columns: the first column represents time indices of events of interest which are not equally spaced, and the second column represents a time-series. We refer to this transformation as an event domain transformation, as the resulting matrix with two columns indicates the time of occurrences of events that cause changes in the value of the performance metric. The event domain will be discussed in more detail in the next section, and matrix M is transformed to the event domain here as an intermediate step in the transformation to the temporal domain. This transformation is accomplished with the help of state vector v(n), as in the case of spatial domain transformation. Definition ('h‘ansformation to Event Domain) Let T2: RemX3 —) Remy2 be the event domain transformation, such that: p— q "0 x (n0) n1 x(nl) T2(M(n,P)) = M901) = (4.6) _"N—1x("N-1)d where x(n) represents the values of the performance metric m(n,p) which are obtained from the state vector v(n) after the elimination of the spatial dimension by an appropriate reduction technique. 9 The elinrination of the spatial dimension using reduction is mainly dependent on the nature of the performance metric being traced by the state vector v(n) as matrix M is processed. For instance, if x(ni) represents the fraction of the system being utilized (busy) at an instant of time ni, then it can be specified by: ..vo. v 1 l _P[ll'” 131x], uvP‘L 1‘”1 = fiZVkOI‘.) i=0 (4.7) for i = 0,1,...,N-1. This equation holds when the metric m(n,p) is used to represent the status of the processor p at time n by the state vector v(n), where elements vk = 1 mean that the processor is busy and vk = 0 mean the processor is idle for k = 0,1,...,P-1. The sequence {x(n)} constitutes a time-series whose indices n are event numbers (and not time). These event numbers correspond to the time-stamps given by the first column of matrix Mo(n). These time-stamps are not equally spaced as they correspond to the times of occurrence of specific events. The size of the dataset has been reduced from a e(n) X 3 dimensional matrix in the performance data domain to a e(n) X 2 dimensional matrix in the event domain, or to e(n) dimensions if the event domain time-series x(n) is considered. Although, the size of x(n) is increasing over time (i.e., it is time-dependent), the transformation process has resulted in a mathematical form that is more convenient and powerful for analysis, i.e., a time-series. We will go one step further to model this time- series as a discrete-time stochastic signal, as it conforms to the definition of a discrete-time stochastic signal: a physical quantity that varies with time, space, or any other independent variable whose past, present and future values can not be determined precisely by using a mathematical expression [84]. This signal will be termed as an event domain performance signal. This model enhances our ability to process the time-series observations by using several well known digital signal processing (DSP) techniques that will be discussed in the next chapter. For the purpose of temporal domain analysis, the time-scale n has to be equally-spaced. 65 Therefore, we transform the matrix M e(n) from the event domain to the temporal domain and the resulting time-series will be denoted by w(n) to distinguish it from the event domain representation x(n). Definition (Transformation to Temporal Domain) Let T3: Re(n)><2 —) R"N be the temporal domain transformation, such that: r- A (’10) q A (n1) T3 (Me (11)) = W (n) = (4.8) —A(nN"1)_anl where A(ni) are block matrices given by: x(ni) x (n) A (nil = (4.9) .xUli). ("i+r’"i) X1 for i = 0,1,...,N-2 and A(nN_1) = x(nN,1). O The sequence { w(n)} is a time-series and again will be treated as a discrete-time stochastic signal, called a temporal domain performance signal, so that we could analyze it using DSP techniques. Definition (Performance Signal) An event domain or time domain performance signal is a discrete-time stochastic signal, representing the values of a certain performance metric at discrete instants of time during the execution of an instrumented program. A performance signal will be denoted by w(n) when it is a time-series obtained by temporal domain transformation and it will be denoted by x(n) when it is a time-series obtained by the event domain transformation. 9 The transformation to the time domain via the event domain has certain practical advantages. As described in the previous chapter, the instrumentation technique relevant to this work is event-driven tracing, i.e., the data logged to the trace file are dependent on 66 the occurrence of events of interest. This strategy reduces the amount of data to be logged and stored during the program execution (in some cases) compared with the sampling technique where data are logged at equally spaced intervals, regardless of the occurrence of events of interest. The time of event occurrence and the metric value are later used (as defined above) to transform the data to the temporal domain only when analysis is to be carried out. Theorem 4.2 A performance signal can be used to represent the dynamic behavior of a program being executed on a parallel computer system. Proof: The proof follows directly from the definitions of dynamic behavior and performance signal. The transformation to the performance signal through the event domain involves elimination of the spatial dimension by using an appropriate reduction technique on the time-series data which is spatially distributed. This gives rise to another time-series that represents the observations regarding the state of the whole system. By definition, this time-series represents the dynamic behavior of the program on the given parallel system. QED. I Various signal processing techniques are applied to the time-series w(n) for its analysis. These techniques are discussed in the next chapter. 4.4 Event Analysis Domain Performance data are generated in response to the occurrence of specific types of events during the execution of an instrumented program. The occurrence of events and consequent changes of system states can be analyzed more appropriately in what has been referred to as the event domain [1, 68]. The aspects of the program and system behavior that are related with the global state transitions are best explained when characterized without any dependence on the spatial dimension and with minimal dependence on the temporal dimension. The system states may include at least the busy and idle states of the system, and the event domain can characterize the transitions between these states. The transformation to the event domain, denoted by T2, was defined in the previous 67 section. The spatial domain has to be eliminated completely so that we can look at the system from a global perspective. The dependence on the temporal dimension is minimal. The first column of Me(n) lists the time indices (time-stamps) corresponding to the occurrences of events that change the system state. The second column denoted by x(n) is a discrete-time stochastic signal that characterizes the event domain. The only difference between the event domain representation x(n) and the temporal domain representation w(n) is that the event domain represents the state transitions and not the transition (or holding) time between the states as shown by the temporal domain representation. We analytically formulate the difference between the two types of performance signals with the following theorem. Theorem 4.3 An event domain performance signal is equivalent to a temporal domain performance signal except for the state holding times that are eliminated in the event domain performance signal. Proof: Consider an event domain performance signal x(n) obtained from the matrix Mo(n) defined as an intermediate step for the transformation to the temporal domain. Suppose that x(n) consists of N samples {0,1,2,...,N-l} that correspond to N time-stamps (no, n1, n2,..., nN,1}, respectively, where events of interest occurred. Event and temporal domain performance signals can be obtained by working through the transformation from Mo(n) to x(n) and then to w(n). 68 E D ' P ' a0 vent omam erformance Signal aN—l arr—3 (I x(n) 0’2 3 a4 aN—Z 0‘1 l , > n O 1 2 3 4 N-3 N-2 N-l "o = 0 n1 "2 n3 n4 "N-3 "N-2 nN-r Temporal Domain Performance Signal 0'0 or N-l 0'3 O‘N—3 or w(n) a2 a N—2 (11 4 a L l “vi I ............ ”WM l ......... . 7 ..........., , , , > n no= 0 1 n1 n2 n3 n4 nN-3 "N-2 "N-l Figure 4-1. Differences between Event and Temporal Domain Performance Signals. Let 5(n) be the discrete-time unit impulse function and u(n) be discrete-time unit step function. Then the temporal domain performance signal can be represented as: w(n) = ao{8(n)+u(n-l)-u(n—n1+1)}+a1{8(n-nl)+u(n-nl—1)—u(n-n2+1)}+ (blah-"2)+u(n-n2-1)-u(n-n3+l)}+a3{5(n-n3)+u(n-n3-1)-u(n-n4+1)}+«-—- + aN_2{8(n—nN_2) +u(n-nN_2-l) —u(n—nN_1+1)} -t-(1N_| {G(n-nN_l)} w(n) = x(n) +ao{u(n-l) -u(n-—n1+1)} +al{u(n-nl-l) -u(n-n2+1)} + a2{u(n-n2—l) -u(n-n3+1)} +a3{u(n-n3-l) -u(n-n4+1)} +—-«--- + aN_2{u(n-n~_2- 1) -u(n-n~__l + 1)} N-2 w(n) = x(n) + 2 ai{u(n-ni—l) —u(n-n‘-+l+l)} 5.0 where no = 0 as shown in Figure 4—1. Therefore, we can express the temporal domain performance signal as: 69 w(n) = x(n) +H(n) (4.10) where U(n) is a collection of MI “windows” of state holding-times, each of length nm -- n; — 1 for i = 0, 1, 2,..., N-2. This is exactly what is missing from the event domain signal. These windows contribute to the expansion of the time-scale in the case of temporal domain performance signal, whereas the event domain performance signal consists of the “jumps” of this temporal signal. QED. I This theorem contributes to our understanding of the difference between temporal and event domain performance signals. This is useful information because the event domain performance signal corresponding to a given temporal domain performance signal consists of a set of samples that is orders of magnitude smaller. However, the dynamic behavior is still shown by the event domain signal even though the state holding-times are missing. Theorem 4.4 Temporal domain information can be recovered from the event domain matrix Me(n). Proof: The proof follows from the definition of the event domain transformation of the menic- based performance matrix M(n,p) to Me(n). Let x(no) and x(n0+l) be two consecutive samples of an event domain performance signal. In order to determine the relative distance between the two samples on a “true” time-scale, we use no and n0+1 as indices to the first column of the matrix Me(n) and determine nm and nu,” which are the corresponding time-stamps. The size of the holding time window between these two samples is equal to nu,“ — nn, — 1 with an amplitude equal to x(no). Therefore, all the temporal domain information has been recovered. Similarly, for all other event domain performance signal samples, we can use the same conversion procedure to retrieve the temporal information. Hence, the temporal domain performance signal can be recovered from the event domain matrix M e(n). QED. I 4.5 Frequency Analysis Domain When the temporal and event domains are used to analyze the dynamic behavior of the program with the help of a number of signal processing techniques, it is only natural to encounter the frequency domain. Since we have clearly shown the difference between the temporal and event domains, we will mention only the temporal domain in this section (to be consistent with the peculiar notion of time and frequency domains found in signal 70 processing literature). The discussion equally applies to the event domain signals. Time and frequency domains complement each other to analyze discrete-time signals. As we have characterized the temporal (and event) domain using discrete-time performance signals, we need to design various systems (filters) to process and analyze it. These systems are easier to design in the frequency domain under various circumstances. Therefore, the frequency domain is useful for multiple-domain analysis methods that will be discussed in the next chapter. The idea of specifying the temporal behavior of a program in the frequency domain has already been used [1]. Besides the advantages in filter design, a frequency domain representation provides a “graphical signature” of the types and extent of variations present in the performance signal, known as the spectrum of the signal. A user might not be able to extract this information merely by looking at a performance signal. The notion of “frequency” of the performance signal needs to be clarified as it is not a signal generated by a “physical” phenomenon that might consist of a range of frequencies and is sampled at an appropriate frequency to convert it to a discrete-time signal. A performance signal is an inherently discrete-time signal, and the resolution of the event logs (time-index) is determined by the resolution of the instrumentation system. This resolution might be equal to the time-period of a local clock or it might be several multiples of that time-period. In general, if Tmin is the minimum possible time between two successive (instrumented) events (in seconds) that can be registered by the instrumentation system, then the maximum frequency (in radians) of the resulting performance signal is given by: 21: 0) = (4.11) max Tmin Definition (Transformation to Frequency Domain) Let T4: R"N -> C"N be a linear operator, such that: 71 nN-l -j-2-EnK T4(w(n)) = W(K) = 2 w(n)e ”N , for K=O, 1,...,nN-l (4.12) n=0 The linear operator that converts the real time-series (performance signal) to a complex subspace of the same dimensions, is commonly known as the discrete Fourier transform (DFT). 9 A fast Fourier transform (FF'I‘) algorithm can be used to efficiently transform the performance signal to the frequency domain. The detemrination of the spectrum leads to various digital filtering techniques to selectively reject a band of frequencies to aid a particular type of analysis [71, 84]. The characterization of the four analysis domains and the transformation to those domains comprise the starting point of multiple-domain performance analysis. We had mentioned the application domain in Chapter 2 which is of interest but difficult to incorporate in trace-based analysis. We briefly discuss the nature of this domain in the following section. 4.6 Visualization in the Application Domain We have formalized the analysis domains that are of interest in the context of multiple- domain analysis approach. However, it will be interesting to discuss another domain that is useful for presenting the dynamic behavior of a program on a particular system. The application domain is an abstraction rather than a finite dimensional vector space, per se. This domain is associated with the development of an application program and uses that information to depict the behavior of the program [109]. It does not rely on the trace data. Instead, the programmer must develop a display that matches his/her mental view of the program. 'Ihen information about the dynamic behavior of the program is passed to the display directly from within the program, i.e., the application domain. Therefore, in order to use the application domain, the activities of program development and visualization display development are combined. There are at least two advantages that this technique can yield: 72 1. Visualization displays provide the application domain information to the user, and the user need not go through the additional process of linking the information from a generic display to the behavior of the application program. 2. This technique is more suitable for debugging parallel and distributed programs rather than evaluating their performance. Vrsualization in the application domain is closer to being an art as the programmer relies on his/her imagination to create effective displays for visualization. The user does not need to rely on the observed performance data to glean information about the higher-level programming abstractions. For instance, if a sorting algorithm is visualized, the application domain visualization should depict the program as it accesses different data values, according to the programmer’s mental model about the computation [13, 109]. On the other hand, the trace-based methods that are the focus of this work require the user to link the observed performance that is displayed with the behavior of a program. A reason for considering the application domain here is to appreciate the fact that the information from the application domain is always useful, regardless of the performance analysis or program debugging methodology to be used. If a programmer has a priori knowledge about the behavior of a program, then analysis methods can serve the purpose of either confirming or contradicting that knowledge. In either case, this information will be helpful to tune the performance of the program. For example, we have applied a priori knowledge of the application domain in case studies to verify the estimates of loop iteration times, which will be presented in the next chapter. However, it is not trivial, if at all useful, to define and characterize the application domain rigorously as for the other analysis domains. Therefore, this subject is not pursued any further in this thesis. The definitions and characterizations of multiple analysis domains in this chapter provide a basis for the analysis presented in the next chapter. These domains allow the analysis of various aspects of program behavior and system performance where it can be done most effectively. For instance, system states can be analyzed effectively in the spatial domain, and dynamic behavior can be analyzed in the temporal or event domain. Information about program behavior from one domain typically supplements the information from another 73 domain. The analysis in multiple domains eventually contributes to the programmer’s understanding of the global picture of a program’s performance. Chapter 5 Multiple-Domain Analysis Methods The transformations to various analysis domains presented in the previous chapter is the basis of the multiple-domain analysis approaches presented in this chapter. We will focus on the presentation of specific analysis techniques that are applicable to the performance data in each of the analysis domains and on the information that a domain can yield about the program behavior. It is important to point out that a particular analysis method is useful in the context of program performance analysis if it can extract any useful information about the higher-level abstractions of the program behavior on a parallel computer system. The analysis techniques that will be presented in this chapter are commonly used in digital signal and image processing and statistical (time-series) analysis disciplines. Once performance data are transformed into a particular form, in a particular domain, it is often tempting to apply available techniques exhaustively. Although the analysis might be valid theoretically, its contribution towards a better understanding of the program behavior might be minimal. Therefore, we will restrict ourselves to the methods that are appropriate in this context. In fact, identifying a set of appropriate methods and the particular situations in which to apply them has only been partially done in this thesis. Nevertheless, this approach helps meet the objective of making the performance analysis of medium- and large-scale parallel systems more rigorous, scalable and extensible. Analysis methods in each of the spatial, temporal and event domains will be presented separately, starting from the domain-specific representations developed in Chapter 3. As the frequency domain analysis mainly complements the temporal and event domain analyses, we limit our discussion of the frequency domain. After separately presenting the analysis methods and their usefulness in multiple domains, we will discuss their contributions towards the analysis of the behavior of a program. We then prepare a strategy to analyze a given program by multiple-domain analysis methods. 74 75 5.1 Analysis in the Spatial Domain The spatial domain was characterized in Chapter 4 by performance images at particular observation times during the program execution. The analysis methods used in this domain have been derived from this characterization using typical image processing techniques. The objectives of analysis in the spatial domain can be listed as follows: 0 support of visual analysis of states of the system by computational methods; - exploitation of the graphical representation of the system states by associating the anal- ysis with display; - representation of data distribution in the context of a programming paradigm, such as SPMD (single-program, multipledata); . representation of data movements during the program; and 0 detection of changes in the system states. These objectives will be accomplished by applying appropriate image processing techniques on the performance image matrix that was denoted by G(no) at an observation time no during the execution of the program. These techniques include analysis of image statistics, image subtraction, thresholding and various image enhancement operations. These methods and their application to the analysis of the spatial behavior of a program is explained in this section. 5.1.1 Representation of Macroscopic States The macroscopic state of the system at a particular time during the execution of a program contributes towards the understanding of the overall behavior of the program. The following theorem states that the spatial domain characterization using performance images is a valid representation of the macroscopic system state at a particular instant of time. Theorem 5.1 Performance image at a certain instant of time represented by the matrix G(no) describes the state of the system in a two dimensional grid where individual processor states are mapped according to the rendering operation used for this purpose. 76 Proof: Let G(no) be the performance image matrix at time no which can be used to represent corresponding performance image (proved by the following argument). As mentioned in Chapter 4, if the number of processors P = rc, then the image matrix is given by: .1 811 812 81c 821 gnu-82¢ G(no) = (5-1) __grl gr2 grg [31] lists three conditions that must be fulfilled by a matrix to be a bound matrix, which can be transformed to a digital image. These conditions are: 1. gij e R (set of real numbers) 2. lSiSr, 1 Schand 3. r, c e Z (set of integers) Since G(no) fulfills these conditions, it is a bound matrix and it can be represented as a performance image. Each of its elements gij concerns the following: - g is an index to a color table corresponding to the value of a performance metric that has been transformed to the spatial domain by the spatial domain transformation (defined in Chapter 4), and o the pair (ij) specifies a spatial location of a processor in a two-dimensional perfor- mance image, as defined by the rendering transform (in Chapter 4). When this matrix is transformed to a digital image, then it follows from the definition of the system state that it represents the (macroscopic) state of the system at a time no. QED. I A macroscopic display is the first step in a top-down hierarchical approach to performance analysis. This is a particularly useful approach for a large-scale parallel system having several thousands of processors. There are similar “matrix displays” in conventional analysis tools but they are not adequately scalable. Performance images work in this case due to the image matrix structure defined in Chapter 4. Depending on the structure of a program, the representation of macroscopic states provides the following information to the user: 77 . Performance images can represent the data distribution of a data parallel program. This type of programming paradigm is especially suitable for computationally intensive numerical and signal processing problems. Such computations inherently use matrices to represent the data. The program distributes the data (usually elements or blocks of the data matrix) to various processors. All processors execute the same instructions on their local data values. If the manner in which the program-data were distributed among the processors is captured during the rendering operation, then the pixels of the perfor- mance image represent this logical information about the data distribution. . A series of performance images taken over the execution time of a program can repre- sent the spatial movement of the computation. Even in the case of a well-structured pro— gram, there might be several phases of the program that are data-dependent and have to be executed only by specific processors. In various computation-intensive problems, the direction of movement of the computation could be specified. A series of perfor- mance images, for example, can show the movement of utilization “hot spots”. 0 When a parallel program has little, if any, “regular” structure, performance images might be useful to understand the behavior. However, in this case, there is no “coupling property” between a performance image and the computation (such as the data distribu- tion or movements). The physical structure depicted by the performance images will have to be linked to the program behavior “manually” by the user. In this case, patterns in one or a series of performance images should not be thought as representative of log- ical structure of the program, because this might prove misleading. 5.1.2 Change Detection via Image Subtraction Image subtraction is a simple yet powerful technique (in a carefully controlled imaging situation) to compare two complicated images. Two images are aligned and the arithmetic operation of subtraction is performed on their bound matrices. The difference image is then enhanced to detect changes in a dynamic situation or to compare an image with a template. Image subtraction is used for imaging of blood vessels using X—rays; detection of missing components from a circuit board; automated inspection of printed circuit boards; security monitoring based on motion detection; monitoring growth patterns of urban areas; target detection from radar images; weather prediction from satellite pictures [57], and so on. We have applied an image subtraction technique on performance images for two purposes: 1. Detection of changes in system states over time, and 2. Matching spatial patterns of two performance images. These techniques are explained in the following sections. 78 5.1.2.1 Change Detection Changes in system states between two discrete instants of execution time of a program can be detected by subtracting the performance images at those instants and enhancing the difference image. If G(na) is a performance image matrix representing the macroscopic state of the system at time na and G(nb) is another image matrix (of the same dimensions) representing the system state at time nb, such that nb > na, then their difference image is given by: A(na,nb) = G(nb) —G(na) (5.2) where each of the elements in the difference image matrix A(na,nb) is the difference of the corresponding elements of G(na) and G(nb). The range of values of the elements of the difference image matrix 5,-1- is given by: gmx 5 5,3- 5. gmax, where gmax is the maximum value of the available color table index. Usually, the range of the elements is “stretched” by mapping it to the full range of the available color table index, as the two images being subtracted might be quite similar. This enhancement operation facilitates the visual detection of the changes between the two performance images. Detection of changes between dynamic macroscopic states of the system at two different time instants has been referred to as analysis by small multiples in the literature on graphical representation of information [116, 117]. A series of performance images can be taken over the execution time of a program, and two images can be subtracted to show the spatial difference patterns of a metric from one time to the next. When the program has a regular structure, the difference of spatial patterns shows the “flow” of computation among the processors. On the other hand, if the problem is not well-structured, a user has to exert more effort to correlate the dynamics of the spatial patterns with the program behavior. 5.1.2.2 Spatial Pattern Matching Image subtraction can be used to match a particular image with another image having known patterns in it (i.e., a template) to compare the two. This technique is useful when 79 the performance image has intricate patterns that are hard to detect visually. A template of known “correct” spatial patterns can be subtracted from a performance image to determine the degree of mismatch between the two. If the digital (performance) image of dimensions r X c is represented by Gl(ij) and a template performance image is represented by 020,1), then the mismatch energy provides a quantitative measure of differences between the two, and is defined by [57] as: 02 = 2 2 [0103]“) -Gz(i,j)12 (5.3) i=1j=1 The mismatch energy is minimum when the performance image and the template match closely with each other. This technique is useful for both performance and program debugging. If the correct behavior of a program on a system is already known and can be characterized as a performance image at a logically appropriate instant of time, then it can be used as a template. The performance image fi'om another version of the same program at the same instant of time should closely match with the template. If there is any marked difference, then it will be helpful to spatially locate the anomaly that could be further traced in the program. This process is particularly useful for MPP systems where visual or textual analysis of any such anomaly is not very effective. 5.1.3 Thresholding A thresholding transformation is applied to digital images to identify various features in the image. Since a performance image shows the spatial distributions of a performance metric, the thresholding technique can segment this image to emphasize the areas of high and low computation or corrununication activity (depending on the metric). The thresholding transformation is a point-operation which is applied to each element of the performance image matrix G(no) to transform it to G’(n0). If Th represents a threshold function, then the thresholding transform is specified by: Th: G(no) —> G‘(n0) such that gij e G(no) and g’ij E G’(n0). (5.4) 80 There are various ways to define thresholding. We will define the thresholding operator in two closely related ways, but both will provide different results. 5.1.3.1 Identification of Hot-Spots of Activity By selecting a suitable threshold value, the higher values of the color index can be isolated from the rest of the performance image. This operation is defined in terms of the performance image matrix elements, before and after thresholding, as: if githh ‘.. = git g” { Otherwise gmin (5 .5) where th is the value of the threshold. Threshold need not be a single constant. It can be defined as a range of values such that th e {thmim..., thmax} where thmin is the specified minimum value of the threshold and thmax is the maximum value. The resulting image after thresholding will have only those color indices that are equal to or greater than the threshold value specified. All other spatial locations will have minimum value of the color table index. This technique can be used to isolate those spatial locations where the value of the metric is beyond the specified threshold. Therefore, if the metric represents the computational load of each processor, then this technique can identify the “hot-spots” of computation that might be a cause of an imbalance in the work-load distribution among the processors. On the other hand, if the metric is related to the message-passing actions performed by each processor, then this technique can isolate the hot-spots of communication activity to or from a certain ensemble of processors. Such information might prove valuable for the programmer to optimize the program in such as way that the computation and/or communication load on all the processors is balanced. 5.1.3.2 Identification of Types of Change The thresholding operation can be used in conjunction with image subtraction to not only locate the changes but also to find whether the values of the metric are increasing or decreasing with time at various spatial locations. Suppose that the image matrix 8i} 81 contains a difference image between two performance images taken at two different instants of time. Then the thresholding transform can be defined as: if githh = { 8 max Otherwise 3?,- (5.6) gm» We will refer to this operation as binary thresholding, as there are only two color index values in the resulting image. This operation narrows the focus of the analysis to only two types of changes over time, for example to positive changes (i.e., value of performance metric increases) or negative changes (value decreases) over time. If we take a series of performance images during the execution time, we can subtract the successive images and apply binary thresholding to each difference image. The resulting series of binary thresholded images will show the decreasing or increasing activity related to computation or communication at each processor. This technique is useful to find movement of data and/or work when the program has a regular structure that is manifested in the performance image. 5.1.4 Image Enhancement Irnage enhancement involves operations that improve the display of an image without distorting the features of the image. When image processing operations are being used for the purpose of performance analysis, then it is natural to use the usual image enhancement capabilities of the image processing tool. Enhancement operations include sharpening of image features such as edges, boundaries, or contrast; noise reduction; filtering; interpolation; and magnification (commonly known as zooming). The inherent information content (determined by the original image matrix) does not increase by the image enhancement operation. These operations only enhance the efficacy of the visual analysis by a human user. The use of image processing techniques and tools for the purpose of analyzing program performance in the spatial domain provides some additional information, by default, such as image statistics. Image statistics complement the analysis performed by the image 82 processing techniques described above. Figure 5—1 summarizes the overall strategy of using spatial domain analysis by using image processing techniques. Performance data is modeled by metric-based performance data matrix M, which is then transformed to the spatial domain. The spatial domain matrix undergoes a rendering transformation to an image matrix G which can be represented as an image. Performance analysis methods have been divided into two categories: image statistics and image analysis. m Metric-Baud 7 Matrix M(n, p) O Computation Spatial Do ai m n O/. Communication I Tranaiormation S I I n o Input-Output - l of o Taeka Appropriate etc. ’ ’ ’ Metric Image Matrix . . . Image Matrix Traneiormation Tranaformation l - - l Performance Image Performance Image Repreeentation Repreeentation l ‘ i I ............. M... I/ Image Statiotice Image Analyeie o Hietogram o Change Detection o Maxima e Threeholding 0 Mlnima o Enhancement 0 Mean etc. Figure 5-1. Application of image processing to performance analysis in the spatial domain. 5.2 Analysis in the Temporal Domain Performance analysis in the temporal domain is based on the discrete-time stochastic signal (or time-series) representation of a performance metric. Various digital signal processing (DSP) techniques can be applied to analyze the pattern in the time-series to 83 obtain high-level behavioral information about a program. The temporal domain representation of the performance data as a time-series, denoted by w(n) in Chapter 4, can be considered as a realization of a stochastic process. In general, this process is non- stationary [l7] and some technique of imposing stationarity has to be adopted for further statistical analysis. This is a future direction of this research work. Even without specifying a suitable statistical model for the underlying stochastic process, important information about the temporal behavior of the high-level abstractions of a program can be obtained by applying suitable DSP techniques to the time-series. 5.2.] Analysis of Trends and Phases A phase of a program has different meanings for a programmer and for “automatic” analysis of the time-series. Usually a phase of a program is a collection of code that performs a Specific logical task among a set of tasks required for the whole application. These phases occur sequentially during program execution. From an analysis point of view, a phase has been defined as a part of the time-series that exhibits stationarity [17, 73], although the overall time-series is not stationary. Therefore, such phases might not be immediately discerned due to the presence of large amount of variance. The objective of analysis is to remove this “noise” that hides the presence of these phases. The time-series is “smoothed” using methods such as moving averages and least-squares polynomial fitting by [17, 73]. The smoothing process results in the identification of stationary phases, which are periods of time when the time-series is smoothed equal to its mean during that period (i.e., variance is removed completely from a stationary segment of the time-series). This method works well for prediction of certain temporal properties of the program [17], but in general even the developers of this approach accept that the automatic identification of phases does not always work and manual detection of phases of a program might be more accurate in some cases [74]. Despite the limits of the automatic identification approach, the smoothing of the performance signal is still useful for analyzing the trends. The reduction of variance from the signal gives a “top-level” macroscopic view of the behavior of the program. This will 84 suppress the low-level details of the temporal changes of a metric, but on the other hand, it will provide the overall trends of that metric. Trends may be difficult to discern visually, especially in the case of complicated patterns resulting from a large amount of message- passing among processors. A user can always return to the original performance signal (or any other low-level view) to glean more “microscopic” information as needed. For the purpose of analysis of trends of a performance metric, we have adopted smoothing operation by applying a digital moving averages filter to the performance signal. This technique is not different from what has been used by [17, ’73]. The only diflerence is the application of DSP approach with the performance data transformed to a discrete-time signal. This approach is advantageous in the sense that the processing of large amounts of data is a practical problem in DSP and is solved either by resampling the data at a lesser sampling rate (decimation) or by operating on a block of data at a time (windowing). Such techniques reduce the data to be processed at a time but preserve the requisite information, therefore, are useful for analyzing long-running programs. The digital moving averages filter can be specified by its difference equation as: 1zv-r y (n) = fikzowm—k) (5.7) where y(n) is the moving average at discrete-time n. Figure 5-2 shows a block diagram of the implementation of this filter. This is a finite impulse response (FIR) and causal filter, H(z) W!) ———> . . ——-> y(n) Movrng Averages Filter Figure 5-2. Smoothing of the performance signals by a digital moving averages filter. with N-taps all having the same weight (1). There are different strategies to determine the weights, such as giving the most recent observation the maximum weight and progressively lesser weights for the past observations, which imposes a “forgetting factor” 85 on the series [5, 126, 127]. Since we do not claim to have adopted a particular model for the underlying stochastic process, we have empirically adopted uniform weights for all the lags. This filter can be specified by its system function H(z) which is determined by evaluating the z-transform of the input-output difference equation and is given by: N—l Y(z) 1 _,- 11-2‘” (2) W(z) NEOZ N 1.2—1 ( ) where W(z) and Y(z) denote the z-transforms of w(n) and y(n), respectively. This filter is also known as a comb filter. The design parameters of the filter are obtained from this system function. The frequency response of this filter, shown in Figure 5-3, emphasizes Impulse Response of a Moving Averages Filter of Length N-100 2 l fi fi Y *1 T 0 1 - r r r r 1 r 1 0 10 20 70 80 90 100 .40 so on Discrete Samples -- n Frequency Response of the Moving Averages Filter ‘00 r r T I r T I l 1 0 - i ? -100 J I E -200 I ~3UUF _m0 1 l I l l 0 0.1 0.2 .7 0.8 0.9 l 0.3 0.4 0.5 0.3 0 Normalized Frequency ange Figure 5-3. Impuke response of the moving averages filter. the low-pass characteristics of this moving averages filter that consequently smooths the signal by reducing the variations present in it. This low-pass filtering operation eliminates the excessive fluctuations that tend to hide the trends in the data. The order of the moving averages filter is arbitrarily chosen as 100 in the case shown in Figure 5-3. The removal of 86 irregular patterns makes it possible for the user to visually appreciate the dominant patterns. The disadvantage of using a moving averages filter include the loss of data at the starting point and the effect of the extreme values on the average value. However, the filter adequately satisfies our objective of representing the trends in the performance signal. 5.2.2 Analysis of Repetitive Patterns Repetitive patterns are expected in a performance signal due to the presence of iterative code structures such as loops in a program. These repetitive patterns can be observed as similar segments of the signal repeating over time, in response to each iteration of the loop. This periodic behavior can be observed using any conventional time-series plot available in a performance visualization tool (e.g., a space-time diagram). This periodicity, however, is not automatically obvious in all cases due to: - Asynchronous execution in an SPMD programming paradigm, where some processors might not be working on the instructions within a loop. 0 Excessive communication activity, which makes the plot too “dense” to identify the presence of any repetitive patterns. . Large number of data samples in the display, which also makes it difficult to visually identify the presence of any repetitive patterns. A widely used method to determine the repetitive (periodic) behavior of a stochastic signal (time-series) is by using an estimate of autocorrelation of that signal [84]. The exact (statistical) autocorrelation is not trivial to be determined for any stochastic system, therefore, temporal observations are used to estimate the autocorrelation. There are some underlying assumptions, such as stationarity and ergodicity which have to be assumed for such calculations [82]. It has been shown by [17] that the time-series data obtained from the dynamic behavior of a program (performance signal) consists of stationary phases (or can be made stationary using smoothing techniques). Since we are considering a very small segment of the whole signal that represents a part of a “phase” of the program, it is a fair assumption to consider that part of the signal as stationary. The fairness of the ergodicity assumption is ensured by using the unbiased estimate of the autocorrelation, i.e., the expected value of estimate is equal to the statistical autocorrelation and the 87 variance of the estimate asymptotically diminishes [84]. Such estimate of autocorrelation can reveal periodicities in a stochastic signal, even when they are hidden within non- repetitive patterns. The unbiased estimate of autocorrelation sequence of the performance signal x(n) consisting of M samples is denoted by Yxxa) and is defined by: M-lll-l 1 7n“) = M—-l n20 X(n)x(n—l),l = 0,1, ...,M-l (59) where the index I is the time-shift (lag) parameter. The amplitude of the autocorrelation sequence at each I is proportional to the extent to which x(n) and x(n-l) match at that point. If both signals match, a peak will be seen at that lag, emphasizing the periodicity of the signal. The distance between peaks can be used to determine the period (i.e., time between two similar segments) of the signal. The amplitude of the peaks emphasizes the strong or weak periodicities in the signal. The estimate of autocorrelation attains its maximum value at l = 0, where the signal matches with itself perfectly. The unbiased estimate of autocorrelation is not free from its constraints. As we will notice in the examples, the variance starts to increase at larger lags, i.e., when l -> M. The reason is the scaling factor in the definition of the unbiased estimate of autocorrelation given above. Therefore, the estimate of autocorrelation at large lags might have amplitude greater than Yum) and thus will not be reliable. We will be working with the correlation coefficients only at smaller lags in the case studies of Chapter 6. The shape, specifically the peaks, of the estimate of autocorrelation signal can help identify the loops in a program. The regularity or irregularity of these peaks confirms the data-independent or data-dependent control constructs used for the loops. Nested loops will usually show smaller peaks within the dominant peaks. If this observed phenomenon could be related to the program code, then the behavior of the loops can be completely specified. An indicator sequence will be introduced to determine the temporal location of loop iterations from the program for the case studies in Chapter 6 so as to validate the information obtained from the estimate of autocorrelation. Even without the availability of source code, the shape of the autocorrelation can characterize intricate communication 88 patterns in the time domain, such as for broadcast, which is difficult to decipher visually from a time-series plot 5.2.3 Comparison of Temporal Patterns Temporal patterns in performance signals obtained from different phases of the same program or different implementations of the same program can be compared visually. This comparison is often part of the performance optimization process for a program in which its performance needs to be assessed in contrast with alternative implementations. It is not always feasible or desirable to do this visually. A common technique that supports this is to estimate the cross-correlation of two signals, which emphasizes the degree of similarity between two different signals w(n) and y(n). This technique is the time-domain analog of the detection of changes (i.e., degree of dissimilarity) between two performance images in the spatial domain. The estimate of cross-correlation sequence is a generalized form of the estimate of autocorrelation sequence presented in the last section. The cross-correlation of two performance signals w(n) (consisting of N samples) and y(n) (consisting of L samples, where N > L) is obtained by shifting y(n) one step at a time and comparing it with w(n) (over all n) for all lags l to get the magnitude of the correlation coefficients at each lag. This operation can be represented mathematically by: N-l ywyU) = 2w(n)y(n—l),l = 0,1,...,N-1. (5.10) n=0 If the two signals are similar at a certain value of the lag index I, the magnitude of the cross-correlation sequence will have a peak at that lag. Again, the amplitude of the peaks is proportional to the extent of similarities between the two signals. It is useful to make the two signals as zero-mean processes (by subtracting the time-average of each signal from each of its samples), so that strong similarities (or dependencies) between the two show up as positive peaks and strong dissirnilarities show up as negative peaks [62]. 89 Estimate of cross-correlation can help evaluate the patterns of a program as “failed”, “optimal”, “desirable”, etc. based on the a priori knowledge about the expected patterns in a program. Therefore, the estimate of cross-correlation can be used in two ways to identify the patterns: 1. To compare the template signal y(n) with the performance signal w(n) and to identify the temporal locations where strong similarities (dependences) between the two occur. Depending on the magnitude of the estimate of cross-correlation at those points, the likelihood of occurrence of the template pattern at that time can be determined visually from the plot. 2. To compare and distinguish performance signals obtained from difi‘erent phases of the same program or different implementations of the same program. 5.2.4 Thresholding Thresholding was defined for analyzing performance images. Thresholding is a useful generic function that can be applied to performance signals to isolate any patterns that depend on the amplitudes of the signal. A threshold is a linear operator that was denoted by Th for performance images and can be defined for a temporal domain performance signal w(n) as: _ 1 , if w(n) = th Th W n — (5.11) ( ()) { 0 , Otherwise where this the numerical value of the threshold. The resulting signal is a binary signal that will be helpful to identify the occurrences of events that are related with the amplitude of the signal (i.e., the magnitude of a performance metric) during the execution of a program. The use of binary thresholding was found useful for analysis of particular types of temporal patterns in programs. However, the thresholding operation can be defined in any other suitable manner depending on the requirements of analysis. A strategy for performance analysis in the temporal domain is represented by Figure 5-4, which shows two categories, as in the case of spatial domain analysis. The analysis steps starting from the performance data are similar to those in shown in Figure 5-1 for the spatial domain analysis. Performance data are transformed to the temporal domain 90 ' erforrnance Data Metric-Baud Matrix M(n, p) . Computation Temporal Domain ./ 0 Communication Tranaformation 8 ion 0 Input-Output 1 of ‘ 0 Taeite ampu- . “‘3' pmm” Signal 7 Performance Signal l Repreeentation ° ° ' Repreeentation I .............. m... e ./ \ Signal Statietice Signal Analyaie o Hietogram 0 Pattern Analyeie o M_axima . Smoothing : ”22:“ o Threeholding etc. Figure 54. Application of signal processing to performance analysis in the temporal domain. (through the event domain) to represent it as a discrete-time signal. Signal statistics can be obtained, but they are not as useful here as they were for performance images. 5.3 Analysis in the Event Domain The event domain was also characterized by performance signals in Chapter 4. The event domain is primarily of interest to model the computation on a parallel or distributed system. Performance data (domain) consists of events that identify unique temporal and spatial states of the system [68]. However, the use of the event domain for developing models of program behavior is beyond the scope of this thesis, as it is a future direction of this work. We consider the present work as a basis for comparison with the results of any subsequent modeling work. Nevertheless, event domain performance signals can be used 91 to depict the dynamic behavior of the program, in addition to the temporal domain performance signals, with the “tolerance” that was precisely specified by theorem 4.3 in Chapter 4. State transition information can be used to develop Markov models for the program and system behavior, which were discussed in Chapter 2. We will conclude this section with an overview of the Markov modeling approaches that can used to analyze the program and system states in the event domain. 5.3.1 Representation of State Transitions Event domain representation can be used to analyze the sequence of state changes during a program. Various system states are identified by an appropriate choice of performance metric. For instance, if the event domain performance signal x(n) shows the system utilization values, then it can represent the “full utilization” by its maximum value (which is 1.0 as system utilization is a ratio), “no utilization” by its minimum value, etc. Only the transitions among various states are depicted, without representing transition-time between any pair of states. Representation of state transitions is useful to understand the operation of the system while executing a particular phase of the program. For each phase of the program, it goes through a particular sequence of state transitions due to the various activities within that phase. An event domain performance signal represents this pattern with microscopic details and its analysis can be useful to understand the higher-level program abstractions. 5.3.2 Analysis of Trends and Patterns An event domain signal x(n) is a discrete-time (performance) signal, which is mathematically similar to the temporal domain performance signal w(n). The only difference is that it does not contain information about the holding times of the states. Therefore, processing of this signal permits the analysis of higher-level abstractions, as in the case of temporal domain performance signal. We can use the same analysis methods that were described for temporal domain analysis, if we do not lose sight of the difference between x(n) and w(n). 92 A performance signal in the event domain can be smoothed to assess trends. In this case, smoothing is more important due to the absence of transition times that “hold” the value of the metric till the subsequent state transition. If this signal is smoothed using a moving averages filter (similar to the one used in temporal domain smoothing), then the average values of the metric can be represented during the execution of a set of instructions. Higher-level programming abstractions and control constructs have their manifestations in the event domain. A particular phase of the program will cause a peculiar pattern of state changes that will appear in the event domain signal. These patterns can be identified depending on their repetitions or individual occurrences using estimate of autocorrelation. This is not different from what was done in the temporal domain and can be considered as a better approach to obtain the same results because the amount of storage space required for an event domain performance signal is much smaller compared to that for the temporal domain signal. Similarly, other pattern analysis techniques such as estimate of cross- correlation and thresholding are also equally suitable for event domain performance signals. So far, we have discussed the application of the types of analyses that were performed in the temporal domain. From a practical standpoint, an event domain performance signal x(n) requires less computing resources (such as cpu time, memory, floating point operations, etc.) than a temporal domain signal w(n). This was the motivation behind theorems 4.3 and 4.4 given in Chapter 4 to establish the difference between the two signals from a purely DSP point of view. 5.3.3 Analysis of Event-Frequency Distribution Statistical theory has led to the development of various (probability and frequency) distributions that are used in univariate data analysis. There is a duality between frequency distribution and probability distribution because statistical models of a phenomenon can be described in terms of either probability or frequency distributions of the data [26]. In the context of program performance analysis, we have access to the execution trace data in 93 the performance domain which is inherently multidimensional. We can transform this performance data to univariate data corresponding to a suitable performance metric of our choice using the transformations to multiple analysis domains defined in Chapter 4. We can use this univariate data to show the density function of the performance metric, specifying the relative frequency (or probability) of occurrence of each element of a discrete set of values. The graphical pattern of this function will have different meanings (e.g., event-frequency distribution, system load-balance, spatial clusters, etc.) depending on the context of the performance metric being used to find the frequency density function. The shape (or pattern) of this function can be used to model a whole program executed on a particular system by its statistical properties. These statistical patterns will be useful to judge the overall performance of a program. State transitions due to the message-passing activity can be characterized by a histogram. This histogram shows the frequency—density function (from which the frequency distribution can be calculated as cumulative frequency histogram) over the range of system states (represented by the dynamic range of the performance metric). The event- frequency density function represents the number of occurrences of events contributing to each level of the performance metric (e.g., instantaneous processor utilization), which is divided into classes or clusters. A plot of the number of events along the vertical axis and the classes (or clusters) of metric values along the horizontal axis represents the event- frequency information for the performance metric, and is also known as a histogram. The extent of the horizontal axis is equal to the range of the metric, and that of the vertical axis, the maximum number of events belonging to a cluster. The area within the region pertaining to each cluster is proportional to the event-frequency of that cluster. If C; is one of the L clusters of dynamic range of a metric, then the frequency-density function of the events is given by: f(C) = size {x(n)| (x (n) e C)} (5.12) for i = 1,2,..., L. The event frequency-density function f(C) obtained this way shows the (relative) probabilities of events in the program belonging to particular ranges of the 94 metric. The event-frequency curve shows the gross characteristics of the event frequency- distribution. The curve can take on a particular shape that is distinctive for the program and which may resemble one of the standard distributions (such as normal, bimodal, skewed, etc.). This is useful information for performance characterization and optimization, especially if it can be extended to performance modeling by fitting a suitable theoretical distribution to this observed statistical data [26, 30]. The event-frequency distribution in the event-domain gives an overall idea of the dominant states of the program, in terms of transitions to those states. If the histogram is drawn from the temporal domain signal w(n), it represents the frequency-density information about various system states (including state transitions and the states themselves). Such a histogram can help visualize the load-balance of the program. If the frequency curve is skewed to one side or another, it can provide an overall feel for the system load-balance or imbalance, provided that the metric being used represents system utilization. 5.3.4 Markov Modeling Building up of realistic stochastic models of a physical phenomenon is a compromise between the degree of dependence among observed data values that the physical situation dictates and the degree of independence needed to facilitate probability calculations. The more independence built into a probability model, possibility of more explicit calculations become better. However, the realism of the model becomes more questionable. Markov processes frequently balance these two demands [88]. A Markov process has the property that the probabilistic structure of the future data value depends only on the present value, and not on the past values. Therefore, future states become conditionally independent of the past. Markov chains are Markov processes with a discrete index set and a countable (or finite) state space. Program (or system) states can be defined and the transition probabilities can be determined from the event domain information to represent the program behavior as a 95 Markov chain [1]. Various states of the system during program execution can be represented in the matrix form and the transition probabilities to and from these states can be calculated from the performance data transformed to the event domain. Different states might represent different modes of system operations, such as user, system, I/O, idle, busy, etc. The transition probabilities calculated through Markov modeling can help analyze the probability of dominant trends of resource usage by a program. These transition probabilities can be represented graphically and can help analyze whether a given transition is recurrent or transient [88]. Steady state probabilities of various states can be computed and the accuracy of this analysis can be determined from the information obtained from other analysis domains by applying other types of analysis methods (presented in the previous sections) that do not depend heavily on the stochastic model for the system and program behavior. Application of discrete-time Markov chains to model various aspects of parallel computer system behavior were presented in Chapter 2. Developing useful Markov models from the event domain representation is an important future direction of this work. 5.4 Significance of the Frequency Domain Only one reference of using the frequency domain for program performance analysis purposes is known to us [1], but its advantages for understanding the behavior of a program are not yet well-known. Nevertheless, the frequency domain contributes valuable information to better understand the temporal domain behavior of the performance signal. Thus, a brief discussion of analysis using the frequency domain is presented in this section. Performance signals, either in the time domain or the event domain, can show many variations, e.g., due to message-passing events occurring at a very frequent rate. In the time domain, there may be so many variations within a small interval of time that it is not possible to even distinguish them without adjusting the time scale. The event domain further compresses these variations. To observe the variations of the performance data, the 96 frequency domain is a natural choice, particularly when the data has already been transformed to a signal and DSP techniques are to be applied to it. The transformation of a signal to the frequency domain is the well-known discrete Fourier transformation (DFT), as given in the previous chapter. The performance signal is not a periodic signal, thus it cannot be specified by a single frequency. It can be thought of as a combination of all frequencies in the range 0 - 1:, which is the range that applies to real-valued, discrete-time signals [84]. We can consider the frequency (0 = O as the minimum frequency, which corresponds to no change in the signal. This is the case when the performance signal is flat, i.e., the performance metric remains constant throughout program execution. Conversely, (a = 1: corresponds to the maximum (sharpest) change in the signal and is practically limited by the resolution of the instrumentation timer. Changes having a particular frequency (0 can occur at multiple instances during execution. This is reflected by the magnitude of the spectrum at that co. If the spectrum shows a high concentration of energy at some frequency, it means there are many changes occurring at that frequency (or rate). This approach of analyzing the frequency of variations has been applied by Abrams, Doraswamy, and Mathur [1]. The frequency domain analysis that we apply consists mainly of filtering in the frequency domain (as opposed to the time domain, where it may not be as effective) to smooth a performance signal by removing its high frequency content to identify the trends in the performance data. The performance signal can be filtered through a digital low pass filter [84] whose design is specified by the DFT of its impulse response h(n), given (ideally) by: 1, for osxs (M-l)/a 1100 ={ 0, for (M-l)/0tSKSM—l (5.13) It rejects the upper (1 - (M-l)/a)% of the frequencies and retains the lower ((M-l)/0t)% frequencies in its spectrum. The spectrum of the resulting filtered performance signal can be specified as: Y(K) =X(K)H(K),(OSKSM—1) (5.14) 97 and the output of the filter is transformed back to the time domain by applying an inverse discrete Fourier transformation (IDFI') to Y(K). Another indirect use of the frequency domain is for resampling the performance signals at a lower rate (decimation). We need to use an appropriate low-pass filter before the performance signal is resampled at a lower rate in order to avoid aliasing [84]. If the sampling rate is reduced by a factor of at, then the spectrum of the signal will have to be restricted to (M—l)/0t, as shown above by H(K). The removal of the high frequency contents from the spectrum will impose smoothing on the signal which is desirable to analyze the trends in the performance signal. 5.5 An Illustrative Example A simple example of the use of matrices to model performance data and transform the data to the analysis domains is presented here. This example will help to unify the ideas presented so far. For this purpose, a part of a linear system solver was instrumented. The program solves a 16 x 16 matrix on 4 processors of an nCUBE-Z, which is a hypercube multicomputer. The program uses Gaussian elimination (GE) technique on the matrix, with the rows of the matrix distributed to the processors. Each processor owns a block of 4 rows (known as the (BLOCK,*) data distribution). All the processors execute the same instructions on their local data, depending on the rows of the matrix that they own (single- prograrn, multiple-data (SPMD) programming paradigm). The part of the program that was instrumented for this example is concerned with the “pivoting” operation. A broadcast is initiated by the processor containing the row being used for the current iteration to inform all the processors of the location of the column containing the maximum (pivot) value. This operation is later used to implement an exchange of columns at each processor to complete the pivoting. The instrumented code is given in Figure 5-5 for reference. The call to tracenode was used to activate the tracing of the subsequent part of the code upto the traceexit call. 98 l“ GE Mam loop *I tracenode(sooooo o 1) .sfifjjmaxloc-curr tier . M 4 r mdxulind 'r indx(curr nor), j gariig-rf (r Indx>FLAG){ . ; w - , if‘fsfilocaL max(r;_ ind‘x can rter&maxloc,&maxval), ,. . .. ~ I 3.;fellbroadcastwmaxlocsaeofllnt) mode mtypel FLAG) ; f :::~r.eerse{ *' - src-curr nor/ROW ~ ‘ '” ” ” ' ’ f abroadc}ast(&maxloc,s1zeot(lm),src,mtypo1 FLAG) col exchangawecomp can; narmaxloanode). _ traceexrto. . ~ 1 eeveeaneeeoe'eb-Iocee.aaaaavseaaaveaa aee-aeiaae ace-eaaaeaaeeaiee'aea‘aaaaaeaaeea eeeeeeeeeee Figure 5-5. Instrumented code for the example. The loop shown in the segment of the code repeats for all the rows, but tracing is stopped by traceexit after the first call to nbroadcast. Execution proceeds normally, and the program terminates. At the end of the program, a trace file is generated from the trace data collected at each node. The trace file contains the information about each message-passing event (such as event-identifier, time-stamp of the event, processor sending or receiving the message, message size, etc.) comprising the broadcast operation. The events are not globally sorted with respect to their time-stamps due to distributed nature of the data logging, and the first preprocessing step is to sort the events according to the time-stamps, keeping the causality among the events intact. The (PICL) trace file after this sorting operation is shown in the Figure 5-6 for illustration purposes. The first field in all the event records is an event-identifier to identify various types of events such as starting of trace at a processor, message send, message receive, computations statistics of a processor at certain time, time when tracing was stopped at a particular processor, etc. The second and third fields are used for the time-stamps, where the second field gives the number of seconds since the start of tracing and the third field 99 1 o 11 o 4 4 o , 1 0 11 1 4 4' o 1 0 11 2 4 4 0 1 0 11 a 4 4 o 20 0 . 56 1 -2 17 0 2o 0 55 2 «2 17 0 20 0 66 3 -2 17 0 , 7 0 106 1 1‘7 7 o 106 2 17 7 0 106 3 17 11 0 106 1 o o 11 o 105 2 o o 11 o 106 a 0 0 20 o 149 o 42 1:7- ‘0‘ 4 0 204 o 1 17 4 11 o 204 0 0 0 11 o 325 o o 121 4 0 367 0 2 17 4 11 0 367 0 0 121 a 0 399 1 o 174 11 0 399 1 o 292 4 0 455 1 3 174 11 o 455 1 o 292 11 o 488 0 o 242 21 0 532 o -2 17 o s 0 553 2 o 17 4 11 o 553 0 o 242 11 o 553 2 o 446 12 0 553 o "o 'o ’ 2 a 0 19 o 553 o 300, 11 o 577 1 o 41.4 21 o 614 2 .2 17 0 21 0 621 1 -2 17 0 '11 o 534, 2 o 446 12 0 634 2 1 4 o o o 19 0 634 2 252' . 11 0 641 1 o 414 12 o 641 . 1 1 4 1 4 o 19 o 641 1 316 8 0 643 3 1 174, 11 o 643 3 o 536 ‘ 21 0 703 a -2 170 . 11 0 724 3 o 536 12' o 724 3 1 4 o o o 19 o 724 a 252 13 0 106133 3 16 , 0 120755 2 13 0 135533 1 13 ' 0 237552 0 Event type 1: trace_start Event type 4: send Event type 7: recv_blocking Event type 8: recv_waking Event type 13: close Event type 19: trace exit Event type 20 & 21:_block_begin & end Figure 5-6. Format of a PICL tracefile and various event types for the example instrumented code. 100 gives the number of microseconds. Remaining fields are dependent on the event-type and the semantics used by the instrumentation system [45]. The PICL trace file can be visualized by using a conventional space-time diagram from Paragraph. The space-time diagram in Figure 5-7 shows the local computation and communication patterns. The processors (i.e., spatial locations) are listed along the vertical axis, while the time is along the horizontal axis. The broadcast is initiated by I l‘ 5‘?" "' ' .'3':'3'3'3’3':'3'3'7'3'2':'3C'3'2':'3'3'?}I'I'C'C'3'3'I'I'3-f'3'1'i'3'303'1‘5‘3'X'Z'Z‘Z'f'.".".':'>f':'3’.'ZC'P3’1'3'2'C‘I-2C'C'I'Z'3'303'3'3'303-3'3‘2'3fi34" ":'3‘:'Z':'Z':‘:'."" "7'3'2'3'5'7': ' " 3'2'?‘C€€'IC‘?3C'I‘§'?I'I'Z‘PC'I':‘I'x'3‘IC-I'Z‘C'3'I'f': '.'. .‘Ix. ‘I‘Z‘C'I'EI‘PI‘C'I‘PI"i'i'l‘z‘f'l'lii‘l‘:'i'I‘I‘Z‘I'Z'Z-l'I‘C‘E{'3‘I‘Irl'l'Z‘Z'C‘I'I'I‘C‘Z'I'I‘I*f-'.'l'l'I‘I-Z'I'Ei-Z'l'f'lf‘:'i‘l'1‘}:'1'2'3'3'2-1-1‘1'3‘2'1‘2‘s “Bantu M'C'E‘I‘Z'I-I'i‘f‘I'I‘I'I-i{iii-3‘??? "J. """ .'.'.'.‘.'.’.'.'.-.'.'.' ....................................................................................................................................................... . .' , ............................................. .......................................................................................................................................................................................................................................................................................................... Ju- Jana ,........._.... ..z 11 I .I‘ ‘ o I J C .l .I I‘ f 8 . ........................... ’(nouda...aoloonalooo-I 8 for} a 4,9 R J‘ A?” -L_. {a ’ '- I N ,. . , “ r‘ d - "' I I. u 9‘ ' fl. 0 _/ n o W I o " ' .- J .l a .m I“ l f E Tli‘ 744 Figure 5-7. Space-time dynamics of broadcast. processor 0, and a message is first passed to processor 1. This is indicated by a diagonal line from processor 0 to 1. Until processor 1 receives the message, it remains idle, which is indicated by discontinuing the horizontal line for processor 1. Then processor 0 sends message to processor 2; and processor 1 sends the same message to processor 3. At the end of this step, the message has been broadcast to all the processors, and the instrumented portion of the code ends. The trace file being used for this example can be modeled by the trace data matrix T, represented in Figure 5-8. The rows of T containing * in a column indicate redundancy of this representation for making the matrix notation possible. The first column lists the event-identifiers of an event represented by a row. The second and third columns show the time-stamp, and the fourth column shows the processor number where the event was recorded. The rest of the columns are event-specific entries. For example, the third row 101 1 0 11 0 4 4 0 1 ' 1 o 11 1 4 4 0 ' ' E(11,p,-) for i- 0.1.23 1 0 11 2 4 4 0 - ' 1 0 11 3 4 4 0 ' ' E(56,p,) for i- 1,2,3 20 0 56 1 -2 17 0 ' ' 20 0 56 2 -2 17 0 ' - T. 20 o 56 3 -2 17 o ' ' . 5(106'p’)'°'i'1'2’3 7 0 106 1 17 ' ' ' ' 7 0 106 2 17 ' ' ' ' .................................. 7 0 106 3 17 ' ' ' ' 2253552ESEESEEEEEEEEEEEEEEEEEEEEEE 2222122122222223223221222:12222322222322:222212212232222222:222221222322222232222222223 724. ,f -_ 11 0 724 3 0 536 ' ' ' E( p90” 3 12 o 724 3 1 4 0 o 19 0 724 3 252 ' . ' ' Figure 5-8. TYace data matrix for the example program. from the bottom of T has an event-identifier 11. This event-record is known as a compstat record according to the semantics given by PICL [45]. Usually this is recorded at the end of a send or receive of a message by a processor. The second and third columns show that this record was logged after 724 microseconds from the time tracing was started. This record was logged by processor 3, as shown by the fourth column. This event record follows the receipt of the message by processor 3 at the end of the broadcast. The fifth and sixth columns indicate the amount of cumulative time that processor 3 has been idle (i.e., sending or receiving messages) upto the current time (724). It shows that processor 3 has been idle for 536 microseconds out of the total 724 microseconds of instrumented execution time. The block matrix on the right shows the same matrix with blocks of concurrent events, as defined previously. The matrix T has so many dimensions of data that it is not directly useful for analysis or visualization purposes. Therefore we will not use this matrix as-is for further analysis. The analysis is based on the metric—based trace data matrix M from the trace data (not on matrix T). We have to select a metric for the purpose of preprocessing the trace data to get the metric-based representation as matrix M. We choose system utilization, which is the ratio of the processors that are busy to the total number of processors in the system. For 102 this purpose, we browse the matrix T (actually the trace file) with the help of a program and look for the event-identifiers that indicate a change in the status of a processor, i.e., from busy to idle or idle to busy. The state vector v(n) holds the current status of all the processors at all n, as we browse down the rows of matrix T (i.e., for each time-stamp given by a row). The state vector is updated by the events causing a change in the status of any of the processors. The system utilization value is calculated at this point and a row is added in matrix M containing time, spatial location (processor number) and the current status of the processor (busy indicated by 1 and idle indicated by 0). This process continues till the end of the matrix T. The metric-based trace data matrix M for the above given T matrix is shown in Figure 5-9. This is one example of a metric-based matrix that _ —I F— — M 11, - for '-0,1,2,3 11 0 1 ( p’) I 11 1 1 M(106,p,-) for j- 1,2,3 11 2 1 M 204 . 1 ' o 11 3 1 ( up) or]. 106 1 o M(325.p,)1orj- o 106 2 0 . 106 3 0 M(364.p))for1- 0 204 0 0 325 0 1 M(399,p,)1orj. 1 M- 367 0 0 I M 455 for '-1 - n. .m n. . 399 1 1 M:488'p];i ’ 0 [ 1p} ( DP)” , Of - :3: f, ‘1’ ”i ’ where 11,5- {0,1,...,724} 553 2 1 M(553,pj) for j - 0,2 553 0 0 M(577,p,-) for j-1 f 6 {0.1.2.3} 577 1 1 , 634 2 0 M(634,pj) for]- 2 m e {O 1} 54‘ 1 0 M 641, for '-1 ’ 643 3 1 ( p’) I 724 3 0 M(643,pj) for j - 3 M(724,p,-) for i - 3 Figure 5-9. Metric-based data matrix for the example program. can be extracted from the trace data. This matrix is useful for obtaining performance images that depict the global state of the system at an instant of time, within the time scale defined by ni. If we were interested in looking at the performance images that show cumulative utilization of the system, the third column would have listed the cumulative 103 busy time of the processors. Use of a state vector for this processing suggests that it can be accomplished in real time. 5.5.1 Transformation to the Spatial Domain Now we show an example of drawing a performance image by transforming M to the spatial domain by selecting an observation time no. Suppose that observation time is set at 106 microseconds after the tracing was started, i.e., no = 106. While preprocessing the trace file, we know the total number of processors (P) that were used for the program. In this case P = 4. Therefore, we keep track of the states of four processors by using state vector v(n). Processor states might be busy (m = l) or idle (m = O). This will continue till the end of the block M(106,pj) that includes all the concurrent events for time n = 106. As shown by matrix M, the four concurrent events at time n = 11 make all the processors busy (indicated by l in the third column). Then at the end of the concurrent events at n = 106, processors 1, 2 and 3 become idle whereas the (busy) state of processor 0 is unaltered. The state vector at this instant of time, v(106), contains this information. Here we stop any further browsing through the matrix M. The spatial domain transformation at this time is given by: .0 1. T1(M(n,p))|n=106 = S(p)ln=106 = :8 .301 As indicated by equation (4.1) in the Chapter 4, the first column lists the number of processors while the second column lists the status of that processor (performance metric) and is identical to the state vector at this time. The matrix S(p) undergoes the rendering operation to generate the performance image matrix G(no) which in this case is given by: G(106) = [$3] The first row indicates the metric values for processors 0 and l. The second row is for processors 2 and 3. This matrix can be transformed into a digital image with image 104 enhancement operations, such as contrast stretching and pixel replication, to facilitate its viewing. This digital image is graphically drawn in Figure 5-10 to illustrate the idea. The values of the performance metric are mapped to a gray-scale (for this example, otherwise a color table is used), such that 1 is mapped to white and 0 is mapped to dark gray. Figure 5-10. Performance image example. 5.5.2 Transformation to the Event Domain The event-domain transformation is used as an intermediate step for transforming to the temporal domain, due to its practical advantages. Therefore, an example of transformation of the metric-based performance data matrix M to the event domain is shown here. According to the event domain transformation, we have: - r0 x(O) 11 x(ll) T2(M("1P)) = Me(") = 10615005) J 24 x (724).1 Once again, the performance metric will be system utilization, and the performance signal given by the values in the second column of the above matrix represents changes in system utilization values. If x(n)) represents the fraction of the system being utilized (busy) at time ni = 106, which is calculated with the help of state vector v(n) at n = 106, then it can be obtained from M as: 105 x11061=§[1111]v(106) P _ =-,li[1111] coo»— - d 3 =§izv(106) = 25 % k=0 It should be noted that x(O) is 100% with this metric, as all processors are assumed to be computing locally (i.e., busy). Also, the index of the vector x is not time in this case; it is the sample number of the performance signal. The performance signal x(n) thus obtained from Mo(n) is shown in Figure 5-11. This event domain performance signal is shown o 100 100 1 11 100 100 106 25 25 204 o o 325 25 25 367 o o 399 25 25 14.11.). 233 3.- where M - 25 553 25 25 577 50 50 534 25 25 641 o o 643 25 25 724 o o _ _ 15 x 2 _ _ 15 x 1 Figure 5-11. Calculation of event domain performance signal for the example program. (using Matlab) in Figure 5-12. As described above, it shows initial 100% utilization and then the system utilization varies according to the events during the execution of the instrumented part of the program. 5.5.3 Transformation to the Temporal Domain After transforming the metric-based matrix M to the event-domain, it can be transformed to the temporal domain. We obtain a temporal domain performance signal w(n) according to the equation (4.8). It is given as: 106 100 1 r r 90- 00- 70~ 60- 50 T 40~ % System Utilization 30~ 20- 10- 1 0 2 4 6 B 10. 12 14 16 Samples of performance srgnal -- 11 Figure 5-12. Performance signal in event domain x(n). 'A(0)l 4(11) T3(Me(n)) = w(n) = A(106) .4 (724). 725 x1 Again, the block A(0) is needed for the purpose of starting the time—scale from time n = 0, when tracing began. Since the whole system is being utilized at the start of tracing (100% utilization), the values of this block can be given as: 7100‘ 100 A (0) = JOQllxl 107 The temporal domain performance signal is plotted (using Matlab) and is given in Figure 5-13. It shows all the state holding times that were missing in the event domain performance signal. 100 901- 4 80- 1 701- 60- 50'- % System Utlllzatlon 20. 10- 0 100 200 300 8400 500 600 700 600 Time —- 11 Figure 5-13. Temporal domain performance signal w(n). 5.5.4 Transformation to the Frequency Domain The frequency domain graph shows the spectrum of the signal. The spectrum of a temporal domain performance signal w(n) can be obtained by applying DFT on w(n). This is shown in Figure 5-14. The spectrum can also be obtained from the event domain signal, as shown in Figure 5-15. The visual comparison between the spectra shows that the temporal domain signal has more energy at lower frequencies. This is expected due to the state holding times included in the signal. 108 90 . . . 80 701 60- W(w) -- db at O 30- 20» 10 o 0.2 0:8 1 .‘4 Es Nonnthed Frequency Range Figure 5-14. Frequency domain spectrum of the temporal domain performance signal. 20 o 0.12 :4 0:6 0:8 1 Normalized Frequency Range Figure 5-15. Spectrum of the event domain performance signal. 109 This concludes the example illustrating the transformation techniques defined earlier. The case studies presented in Chapter 6 are based on the same techniques presented by this simple example. 5.6 A Discussion on Applying Multiple-Domain Analysis In this chapter, multiple-domain analysis methods have been presented independently of any other methods that are used for program performance analysis. The purpose of this section is to emphasize the fact that the multiple-domain analysis approach can be used in conjunction with typical, conventional analysis methods and displays in a systematic manner. This will ensure scalability, extensibility and efficacy of the program performance analysis in most of the cases. The purpose of the multiple-domain analysis approach is not to prove that the conventional methods and typical displays do not work in all the cases. These methods are effective in their own right. However, the use of multiple-domain methods can (1) supplement the user’s knowledge obtained from conventional methods, and (2) provide analysis within the specific context of a domain. The advantages of the multiple-domain analysis approach to analyze program performance and behavior include: 0 Identification of any macroscopic program execution patterns that might be present in the temporal, event or spatial domains. . Identification and analysis of repetitive temporal patterns that might be linked with loops in a program. 0 Analysis of macroscopic states of the system during the execution of a program in the spatial domain to represent control- and data-flow among the processors. 0 Statistical characterization of a program by using the frequency distribution data of a metric. The multiple-domain analysis methods are not free from limitations. Their main limitation is associated with the premise on which they were developed: they focus on one domain only at the expense of analyzing the performance data in more than one domain at the same time. Conventional tools usually allow this feature with specialized displays (such as the space-time diagram) which are very effective as long as the number of processors and volume of trace data remains small. Due to the advantages as well as limitations of 110 multiple-domain analysis, we suggest the following general guidelines for analyzing the performance of a particular program: 1. Conventional displays that represent both spatial and temporal dynamics of the pro- gram simultaneously should be used. This will be possible when scalability is not a problem and will contribute towards the user’s familiarity of all aspects of the program. 2. Temporal domain information should be used to understand the temporal behavior and patterns of the program. The user’s knowledge about the program is needed at this step to link the patterns with the program. 3. Spatial domain information can be used for more precise analysis of the system state at a give instant of time. This will complement the user’s understanding of temporal behavior and patterns of the program. 4. Frequency distributions of the metrics should be used to statistically characterize and distinguish one program’s performance from another. 5. Most of the statistical information obtained in each of the domains can be used to sto— chastically model the behavior of the program, which can be used to predict the values of a performance metric. Such models can be rarely built automatically and often require user input. Therefore, statistical information should be used carefully to avoid wrong predictions about the behavior of the program and system. These are merely guidelines. We continue to refine them into a more well—defined methodology that a user can apply in practical situations. These guidelines assume that the user has access to conventional as well as generic data analysis tools to perform both conventional and multiple-domain analysis. We have proposed a toolkit approach [119, 120] in which conventional and generic tools are available to the user with a common data transfer protocol. Based on our experiences with conventional and generic data, signal, and image processing tools, we also propose that data can be transferred effectively among most of the tools if it is treated as a matrix [119]. In conclusion, the multiple-domain analysis approach is advocated in order to: 1. increase the scalability of the analysis and presentation methods; 2. precisely analyze a particular aspect of the performance in a particular analysis domain; 3. characterize the higher-level abstractions of the behavior of parallel programs from lower-level trace data; and 4. estimate and visualize various performance metrics in appropriate domains using rigor- ous analysis methods commonly used in statistical analysis, numerical analysis, and dlgital image and signal processing areas. Chapter 6 Case Studies In the previous chapters, we have rigorously defined the matrix representation of performance data and its transformations to the analysis domains. We have also identified the analysis methods in the analysis domains that can be applied for the purpose of more effective analysis of program performance. In this chapter, we will show the application of these methods to the performance analysis of some application programs executed on various parallel systems. In the first section, we explain the context of all the case study programs that will be discussed in the later sections, then we explain specific analysis methods in each subsequent section. We have not divided the case studies according to the analysis domains and analysis methods of that domain. Instead, our approach will be to present the application of these methods to accomplish various performance analysis objectives. 6.1 Context of Case Studies We have chosen two types of programs to present the case studies, implemented on two types of parallel systems. First program approximates the value of pi (3.14...) by numerically evaluating the integrand 4.0 / (l + x2) on the interval [0.0, 1.0]. To compute ‘ the integral numerically, this interval is divided equally among P processors. Each processor further subdivides its region into m subintervals and computes and locally sums the area of its subintervals. After this local computation, all processors participate in a global sum-exchange operation of logzP exchange-add steps to finally get a precise value of pi. This is a simple program which will be used only for the purpose of introducing some case studies and more emphasis will be placed on a suite of second types of programs. These programs will be termed as linear solvers, that solve a system of linear equations by Gaussian elimination followed by backsubstitution. These programs were executed on an nCUBE-2 multicomputer [76] and were instrumented and visualized using 111 112 PICL and Paragraph toolset [44, 50]. The nCUBE-2 is a distributed-memory message- passing Multiple Instruction stream, Multiple Data stream (MIMD) computer. The Portable, Instrumented Communication Library (PICL) provides a tracing facility to record computation and communication events during program execution. It supports application-level monitoring by instrumenting the source code with calls to its communication library functions. The PICL routines are invoked during program execution by the events of interest and time-stamped tracing is accomplished concurrently on all processors. ParaGraph complements this monitoring by replaying PICL generated trace files and displaying a number of views. Some of the case studies will involve another matrix solver benchmark known as SIALOM (Scalable, Language-independent Ames Laboratory One-minute Measurement). It is the first benchmark which compares performance of various computers on the basis of work that can be done within a fixed time (usually one minute). An invdepth discussion of the design of this benchmark is given in [48, 90]. Unlike many other benchmarks, it solves a real problem (optical radiosity on the interior of a box). The walls of the box are decomposed into patches (n). This number n is the figure of merit to rank various architectures instead of execution time, because the time is fixed and problem size scales according to the capabilities of the machine. SLALOM determines the largest problem size (i.e. n) that can be executed on that machine in less than or equal to one minute. There are seven tasks within SLALOM identified as reader; region, setup] , setup2, setup3, solver, and storer. We consider the solver task in this section. The solver performs Gaussian elimination and back-substitution methods to solve a system of linear equations arising from the radiosity problem. In the case studies using SLALOM, we analyze the operation of the SLALOM program (specifically solver) on a MasPar MP-l data-parallel computer using 16,384 processing elements (PBS). The MasPar MP-l is a distributed memory massively parallel Single Instruction stream, Multiple Data stream (SIMD) computer with a toroidal-mesh interconnection topology. 113 To prototype the application of multiple-domain analysis methods, we are using AVS [8, 9] to analyze and display performance images and the Signal Processing Toolbox of Matlab [71] to analyze and display performance signals. ParaGraph has been used when conventional displays are needed. Performance data are collected from program executions on actual machines and translated into AVS- and Matlab-readable formats, in terms of performance images and signals. We have translated PICL [44, 45] trace files from nCUBE-2 multicomputer executions and snapshot files [70] from MasPar MP—l data-parallel computer executions. A programmer assessing performance typically asks the questions, in order [95]: . Is my program running correctly? 0 How well is my program running? 0 Can the performance be improved upon? The second question is answered in part by a summary of the program’s performance, such as that given by Table 6-1. It is the objective of multiple-domain analysis to provide insight into the second and especially the third questions by (1) focusing on particular domains of program execution where answers can be found and (2) offering analytical tools to explore the domains for answers. The remainder of this section describes the program, system, and metric used as the means of presentation. 6.1.1 The Program and System In each case, the program under study is a variant of matrix solution by Gaussian elimination; and the performance metric is a form of processor utilization. These provide a familiar context only and do not favor the analysis methods. The parallel program consists of two main phases: (1) Gaussian elimination, where matrix data accesses occur in a pattern that successively moves “down” the matrix diagonal; and (2) back-substitution, where matrix data accesses occur in a pattern that successively moves “up” the matrix diagonal. Each phase consists of loops of repetitive operations. The programming model is SPMD (Single-Program Multiple-Data) for the nCUBE-2 and SIMD (Single-Instruction Multiple-Data) for the MasPar MP-l. A number of data distribution (or machine mapping) 114 strategies are possible. Using temrinology from the Fortran D specification [36], the results presented here are from three strategies: (1) DISTRIBUTE (CY CLIC, CYCLIC); (2) DISTRIBUTE (BLOCK,*); and (3) DISTRIBUTE (CY CLIC,*). These are shown in Figure 6-1. All are two-dimensional distributions of the matrix among processors, where N/P rows{ p1 1 row 92 Pa 94 (cvcuc, CYCLIC) (BLOCK,*) (cvcuc,*) Figure 6-1. Data distributions (here, P = 4 and N = 8). the latter two divide along the first dimension (rows) only. Assuming P processors and N rows in a decomposition, where N is divisible by P: the (CY CLIC, CYCLIC) distribution divides the decomposition in a round-robin fashion in the first and second dimensions (rows and columns); the (BLOCK,*) distribution divides the decomposition into contiguous blocks of size MP in the first dimension (rows), assigning one block of rows per processor; and the (CY CLIC,*) distribution divides the decomposition in a round- robin fashion in the first dimension (rows), assigning every P-th row to the same processor. There are a number of opportunities for comparative analysis. VVrth regard to the program, we can look at different phases, data distributions, algorithms, and problem (matrix) sizes. We can also use different system (machine) sizes. Each of these impacts the observed performance and will be used to demonstrate the analysis methods. 6.1.2 Performance Metric and Result Summary The basic performance metric that will be shown in the following sections is processor utilization. This metric is obviously an important indicator of how well a program is running and using system resources. It also can be shown in some form in each of the 115 analysis domains. The term “busy” is often used to describe processor utilization, e.g., as the “number of processors that are busy” or the “percent of time that a processor is busy”. In general, in a message-passing multicomputer, a processor can be computing, communicating, or idle (waiting to receive a message or done computing). If we only distinguish between “busy” and “idle”, what does “busy” mean? We adopted the definition used by PICL: a processor is idle if it is executing system routines that would not be necessary if executed on a serial computer; otherwise, it is busy [45]. By this definition, a processor that is communicating is also considered idle. So, busy means doing only “useful computational work”. While this is an important distinction in terms of characterizing program behavior, it has no impact whatsoever on use of analysis domains, and we mention it merely as part of the context in which results are presented. Table 6—1 summarizes the program execution times for the different combinations (times are for programs without instrumentation). Note that considerably more time is spent in the Gaussian elimination phase than in back-substitution in each case. Also note that the execution times for the (BLOCK,*) distributions are lower than the corresponding (CYCLIC,*) distributions. Finally, these are fairly small runs in terms of machine sizes and problem sizes used. This was necessary to collect performance data having sufficient detail and keep the size of the trace files manageable. Alternative tracing strategies are needed for larger runs. However, the multiple-domain analysis approach is not impacted by the amount of performance data: it is scalable, given a scalable data collection facility. TABLE 6-1. Summary of program execution times. Time: Time: Gaussian Back- Total Machine Problem Elimination substitution Execution Distribution Size (P) Size (N) Phase (sec) Phase (sec) Time (sec) (BLOCK,*) 4 16 0.017436 0.002731 0.020167 (BLOCK,‘) 16 64 0.210000 0.034843 0.244843 (BLOCK,‘) 64 128 1.181063 0.174400 1.355463 (CYCLIC,‘) 4 10 0.020850 0.004483 0.025333 (CYCLIC,‘) 16 64 0.270537 0.053075 0.323612 (CYCLIC,‘) 64 1 28 1.527301 0.259702 1 .787003 116 Selected results that demonstrate the various aspects of our approach are given in subsequent sections. 6.2 Representing the Machine Mapping An important consideration is how to present and analyze performance information in the context of the program without requiring completely specialized displays as for program visualization (or algorithm animation) in the application domain. At least, a tool should provide a dynamic display of the source-code. However, a practical alternative to application-specific displays is to represent the programming model [94, 95]. This is more general, yet gives the user useful information about the data and/or control structures of the program. Moreover, this likely is a lesser burden on the user because of the potential for compiler and/or run-time support of the model. For example, in the SPMD model, the relationship of performance data to program data and operations relies on the mapping of program data onto the machine (i.e., the data distribution or machine mapping). In fact, a coupling property can be identified that binds the performance display to the program. This property is similar to the relationship between processes and data illustrated in the data-parallelism view by LeBlanc et a1. [65]. Use of this approach with the SPMD and data-parallel models is discussed in more detail in [95], however, an example is given here to show how performance images in the spatial domain can be related to the application domain. 6.2.1 Method: Performance Images A one- or two-dimensional performance image can represent the machine mapping for structured problems by depicting the spatial relationship among processors with respect to data distribution. In general, if a so—called coupling property can be stated, any relationship among performance data also can be projected to effectively use performance . images. This coupling property can be used while rendering the spatial domain matrix to an image, as discussed in Chapter 4. 117 6.2.2 Example: Linear Solver Consider the case of the backsubstitution phase of the program on 16 processors of the nCUBE-2 using a (CYCLIC, CY CLIC) distribution. A coupling property of this mapping is that a row of processors “owns” a row of the matrix, and a column of processors “owns” a column of the matrix. A binary performance image is shown in Figure 6-2, in which each cell is encoded with the state of the processor at time no, where white denotes busy status (including communicating) and gray denotes idle status. The image is updated over time to reflect changes in state. This image, particularly when animated, clearly shows the sequential effect of stepping row by row up the matrix diagonal during the backsubstitution phase: on average, at most one row of processors in the grid is busy at any time. In one view, we can see the utilization of the system and the pattern of data movement in the program. This view helped to “decode”—in terms of matrix data movement—the more conventional space-time diagram shown adjacent to it. It also helped to optimize the data distribution strategy for this phase. In the recent release of ParaGraph, the Processor Status and Network displays show similar information as this type of performance image. 6.3 Viewing the Macroscopic State of Systems The preceding section demonstrated the use of a performance image in the context of a small-scale system. Performance images and signals, however, are especially useful for representing (and possibly even modeling) the macroscopic state of a large-scale system in a particular domain. A macroscopic display is the first step in a top-down approach to performance analysis. Once again, let us consider the spatial domain and the machine mapping. 6.3.1 Method: Performance Images It was shown in Chapter 5, that the performance images can be used to represent the state of the system at a particular observation time during the execution of the program. This 118 Figure 6-2. Performance image used to represent the machine mapping. was made possible by mapping the value of performance metric (cumulative utilization, in the accompanying example) to a particular color, using the available color table index. 6.3.2 Example: SLALOM Two cases of (symmetric) matrix solution on a 16,384-PE MasPar MP-l (128 x 128 PB grid) using a (CYCLIC, CYCLIC) distribution are compared. In one case, there is an even distribution of matrix elements to PBS (1024 x 1024 matrix); in the other, an uneven distribution (1026 x 1026 matrix). The mapping for the uneven case is pictured in Figure 6-3. Performance images of final PE utilization (where each cell is encoded with the percent of time the PE was busy, or in the active set) are shown in Figure 6-4 (PE 0 is denoted by the upper, left cell). Two results that relate to the program are clearly visible in the images. In each image, the effect of a symmetric solution algorithm is shown by the contrast in utilization between the lower (greater than 50%) and upper (less than 50%) 119 Matrix PE Grid - - - ii’figuano. Diagonal of a symmetric matrix Figure 6-3. Machine mapping for uneven distribution. ows having higher utilization <50% Figure 64. Performance images of PE utilization: even and uneven d'stributions. triangles. 1n the second image, the effect of an uneven distribution is shown by the strip of PBS at the top having higher utilization. Notice that such views help to identify classes of behavior in a large system so that more detailed analyses—within a class, among classes, or of outliers—can be conducted. 6.4 Analysis by Small Multiples Analysis by small multiple is a classical visualization technique for analyzing an image by incrementally increasing the time and analyzing the changes while keeping the 120 perspective fixed [116, 117]. This method can isolate the changes occurring over time. An application of analysis by small multiples is demonstrated with performance images. 6.4.1 Method: Image Subtraction and Binary Thresholding The visual detection of changes over time can be difficult if the incremental changes are small or obscured by other effects. Change detection via image subtraction techniques is applied to successive performance images to enhance differences and similarities and to highlight the magnitude, direction, and distribution of change. The application of image subtraction technique to the performance images was rigorously defined in Chapter 5. Image subtraction can be coupled with binary thresholding technique to focus only on the type of changes, i.e., positive or negative in terms of the cumulative utilization metric represented by colors. A positive change indicates the increase in the value of performance metric at a particular spatial location during the interval of execution time depicted by two performance images, one taken at the start and other at the end of this mmwa. 6.4.1.1 Example: SLALOM Performance of the Gaussian elimination algorithm within solver is shown in Figure 6-5. A series of twelve snapshots taken over time during the algorithm depicts cumulative PE utilization. PE 0 corresponds to the upper left cell of the performance images, and PBS are arranged in row-wise order. A two—dimensional scattered decomposition method was used. The sharp contrast in utilization between upper and lower triangles of the PE grid is due to the behavior of the symmetric matrix solution method (discussed in detail in [93, 95]). These images are rendered with each of the 16K PEs occupying one pixel. The performance images of Figure 6—5 are quite simple, yet it is not trivial to extract meaningful information from the “raw” images. To reap the benefits of this representation for performance visualization of large-scale parallel systems, we need to apply analysis techniques. 121 Cumulative PE 1 Utilization 0% (blue) 2 3 100% red) 4 Color Table Figure 6-5. Performance images of GE in SLALOM (Series of 12 Snapshots Over Time). Figure 6—6 shows the results of the subtraction (3 - 2) to detect the change in utilization from snapshot #2 to snapshot #3. In the image statistics box, the range of color indices is between —5 and 5 for the subtracted image, which indicates relatively small changes with respect to the full range. These indices are then mapped to the full color range in AVS so that the lower half of indices represents negative change and the upper half of indices, positive change. From the histogram for the subtracted image, you can observe that most PEs incur increases in utilization, and the statistics give a mean index of 2.9 (in the range —5 to 5). By inspecting the color image in conjunction with the color table, you can further see that relatively larger positive changes occur in the upper triangle. Therefore, by comparing successive performance images for a given program, the user can observe spatial changes for a given program. 6.4.1.2 Example: SLALOM Consider matrix solution on the MasPar MP-l, for the uneven distribution case. Images were generated at a series of times throughout program execution. TWO images can be subtracted to show the spatial utilization difference pattern from one time to the next, and a subtracted image can be further transformed into a simpler binary image via thresholding by mapping increases in utilization to one value (shown as black) and non- Histogram of Snapshot #2 .,, Histogram of Snapshot so {F .3] Mom mi 1:14 Sn hot I 3 Snapshot # 2 m Change in PE Utilintlon Positive change ,,,.,.,.,..,;,,. in utilization . Color Tdrle Histogram of (3 - 2) Image of (3 — 2) Figure 6-6. Change detection via image subtraction. increases to a second value (shown as gray). The resulting images are shown in Figure 6- 7. Note how the load imbalance in the top rows is emphasized. 6.4.2 Method: Spatial Pattern Matching Many application programs exhibit specific patterns in spatial domain if performance images are used for characterization of the spatial domain. These patterns can depict the behavior of the program in spatial domain and can be used for comparison among various implementations or phases of the same program. This comparison can be carried out by using image processing technique of template matching. One image is used as template and is subtracted from the image under analysis to determine the degree of dissimilarities at all spatial locations, provided that two images are of same size. This method was explained in Chapter 5. Histogram Subtracted Image Figure 6-7. Analysis by small multiples using image subtraction and binary thresholding. Template matching can be used to detemrine any changes in the performance metric (and hence the performance of a program) by comparing its different implementations. One performance image from one program can be used as reference (template) and other performance images from other implementations of the same program can be compared with the template. This procedure will result in identifying those spatial locations where metric showed maximum change, and can identify any bottlenecks in the performance of a program. Spatial patterns will be discussed further after we analyze repetitive and non-repetitive patterns in temporal domains. It will be shown that the spatial patterns and their analysis techniques supplement the information obtained from temporal domain. 6.4.2.1 Example: Comparison among Data Distributions In this example, we will compare the patterns from two implementations of GE phase of the linear solver implemented on 64 processors of nCUBE-2 multicomputer system. The 124 first implementation uses (BLOCK,*) data distribution and the performance images showing cumulative busy time at the end of this program will be used as template. Second implementation uses (CYCLIC,*) data distribution for the same GE phase on 64 processors. Figure 6-8 shows the comparison of these two implementations by subtracting the template image from the performance image from (CYCLIC,*) implementation. The subtracted image is shown below the two images with its histogram on its left side and the statistics. The red color at the top left comer of the subtracted image indicates that processors that owned top rows in (CYCLIC,*) case utilized those processors better compared to (BLOCK,*) implementation. The difference of utilization at other processors is not a great deal. However, this analysis also shows precise information by statistics and histogram. Histogram shows more occurrences at negative values showing the dominance of “negative” changes in utilization. i.e., overall cumulative processor utilization was better in the case of (BLOCK,*) implementation (by almost negligible margin of 1.2%) compared to that in (CYCLIC,*) implementation. 6.4.2.2 Example: Comparison among Phases Template matching technique can be used for comparison of performance of various phases of a program. Such comparison can often help to precisely locate the part of the program that is deteriorating the performance of whole program. The linear solver is divided into two phases. First phase is to convert the matrix into a lower triangular form using GE and the second phase uses B8 to calculate the solution vector. The performance image representing the cumulative processor utilization during BS phase can be used as a template image (for no particular reason, choice could be GE phase) and corresponding performance image from GE phase can be subtracted from it. The difference image is shown in Figure 6-9 with its histogram and statistics. Difference image shows negative values (represented by blue color) at the processors having top rows of the matrix i.e., the processor utilization was lesser for these processors during GE phase compared to that during BS phase. Similarly, processor utilization of processors that owned bottom rows of the matrix was better in GE compared to that for BS phase, therefore, the difference image G510 (BLOCK,*) (cvcucr) Avg. = -3.0/256 3! union mountain. 40117. I William (”11? = _1-2% ; knrln‘n'r “9M (decrease) Figure 6-8. Comparison among GE phases of different data distributions via template matching. show positive values (represented by red color) at those spatial locations. This comparison is consistent with our knowledge of the program according to which matrix data accesses moved “down the diagonal” in GE phase. Processors that owned bottom rows were more utilized, whereas matrix data accesses moved “up the diagonal” during BS phase, therefore, processors that owned top rows were more utilized. Histogram and statistics represent this information more precisely in the Figure 6-9 showing a comparatively better utilization (by 5.4%) in the case of GE phase. 6.5 Analysis of Temporal Patterns Temporal patterns of parallel programs are either of repetitive nature or unique. Repetitive patterns are likely due to the loops in a program. Patterns can be unique due to the nature of the high-level message passing involved, or might even be defined arbitrarily by the . {'3 maximum: i 5310 G510 airplay um I” GE10— 3810 I‘m: 5141i :lrv'! Avg. = 13.7/256 0‘ it ' Mn . m. Illl 710 [Manila]? :, = 5.40% Similar: 0mm»: ,C ' ”‘10 out Minn . ~ Vwrtlniarumll ». . (increase) Figure 6-9. Comparison among GE and BS phases of linear solver. user who is expecting them in his program. The reliable identification of such expected (or suspected, in the case of debugging) patterns can help understand the behavior of the program and be useful for optimizing program performance. The temporal program pattern analysis techniques presented here are basically to allow the user identify the possibilities of finding any expected pattern in the performance signal obtained from that program. These methods can be used in conjunction with the typical program visualization displays if the user is familiar with the patterns in that particular display. Visualization tools usually do not provide any support to the user in order to identify the known patterns, therefore, the a priori knowledge of the user about the behavior of a program is not utilized. Our main objective will be to utilize this a priori information from the user and by using appropriate techniques to identify the temporal locations during the execution of a program where these patterns are likely to be present. 127 6.5.1 Examples of Patterns Resulting from Message-Passing Before analyzing the program patterns (repetitive or otherwise), we present some examples of spatial and temporal patterns resulting from high-level message-passing calls. These patterns will be demonstrated using space-time diagram of Paragraph [50] and (event domain) performance signals using Matlab [71]. This will allow us to combine the spatial and temporal dimensions inherent in these patterns and the interactions of the processors during the message passing activity. The use of performance signals will allow us to introduce them for representation of patterns, as our patterns analysis technique is based on this characterization. The objective is mainly to introduce the reader with the patterns resulting from the often-used message-passing calls using conventional displays as well as performance signals. We will introduce the patterns resulting from broadcast, global summation and barrier synchronization calls. 6.5.1.1 Broadcast Broadcast is a high-level abstraction of inter-processor communication. Several algorithms use broadcast to distribute particular information from a source processor to all other processors in the allocated ensemble of processors. Our objective is to determine some specific patterns that broadcast might generate in temporal and spatial domains. This pattern will serve as a template pattern for a given instrumentation system and for a given number of processors. Every instrumentation system can handle broadcast in its own particular way, therefore, a template that works for one instrumentation system might not work for another clue to the difference of communication protocols used. It might also depend on the type of interconnection network used, such as hypercube, mesh, etc. because various architectures can have different optimal protocols for implementing broadcast The broadcast shown by Figure 6-10 was implemented on 16 processors of nCUBE-2 multicomputer system using PICL’s low-level primitives [45] to accomplish the broadcast of message originating from the processor 0. The diagonal lines in the space-time diagram show message-passing activity between processors listed along the vertical axis and 128 ”MUICZ NOWWMOO”? OFNOAUOQJO Event Domain Performance Signal l L l 0 10 20 30 40 50 60 70 Number of events -- n Figure 6-10. Spatial and temporal dynamics of Broadcmt among 16 processors. passage of time is shown along horizontal axis. The broadcast proceeds in such a way that each processor passes the message to its four neighbors, as there are 16 processors (4- dirnensional cube). The performance signal x(n) shows the system utilization (in percent) along vertical axis and number of discrete samples (or occurrences of events of interest where the metric changed) along horizontal axis. It is clear that the system utilization was very low when at the start of broadcast (the peak at n=0 is due to the synchronization call implemented by PICL, in order to generate consistent time-stamps) as most of the processors are waiting for the messages to be received. The system utilization periodically increases as the message reaches the processors and they exit the message receiving phase and become busy (according to PICL’s definition). This is a specific pattern, both in time and space, and can be expected to occur in programs using PICL’s broadcast call. 129 6.5.1.2 Global Summation Global summation is one of the combining operations that PICL can provide at higher level, for summation of distributed vectors. The summation operation results in a single vector at the “root” node of this call which is a sum of a set of distributed vectors. The space-time pattern of global summation operation involving 16 processors is shown by Figure 6-11. The root node in this case is processor 0 which ends up with the global sum. :1 ' Space SPFBETIlE 111301291 15 14 P 13 R 12 O 11 g 10 8 9 s 8 O 7 R 6 N 5 u 4 n B 3 E 2 R 1 0 Event Domain Performance Signal 100 I T T I I I I I E L q ‘2' 50 0 L l l l 5 10 15 20 25 30 35 40 45 Number of events —- n Figure 6-11. Spatial and temporal dynamics of global summation. As shown by the space-time diagram summation process proceeds in a logarithmic fashion. First eight processors come up with their local sums and pass them along to other set of eight processors. Then in the next step four processors communicate their results to other four processors, and so on. The same activity in the temporal domain is represented by the performance signal x(n). System utilization (in percent) is shown along the vertical axis. After the initial peak due to synchronization, the system utilization is very low as 130 half of the processors are passing messages (idle according to PICL’s definition). System utilization then gradually starts decreasing in a periodic manner due to logarithmically smaller number of processors taking part in communication. This process repeats four times. Again this is a specific pattern and can be found in a program using global summation call. 6.5.1.3 Synchronization Synchronization is a high-level communication process that can be used to synchronize the activity of processors in a distributed system. Such calls are important for the instrumentation purposes. PICL uses this call after opening communication channels among the processors to synchronize their clocks. This is needed to ensure consistency of the time-stamps of the distributed event traces. This call is implemented by having the processors exchange specific messages at the end of which their clocks become synchronized. The specific pattern of this call can be shown by the space-time diagram shown in Figure 6-12. After the processors are synchronized, the system utilization holds at its maximum value (100%) for certain amount of time, as shown by the performance signal. This phenomenon can be noticed in all the space-time diagrams and the performance signals that will be used for the analysis of patterns. These examples of the spatial and temporal patterns show that various high-level message-passing activities generate peculiar spatial and temporal patterns. Knowledge about their shape in a particular domain (such as spatial or temporal) with their characterization (as performance images or signals) can prove valuable to identify them to debug or analyze the performance of a program. In the following, we present a hierarchy of techniques to analyze these patterns. For the purpose of analysis, we have classified both repetitive and unique patterns as macroscopic and microscopic patterns, and have treated them separately due to different degrees of precision required to analyze each type of these patterns. 131 ”MUZCZ roommmnozo'u annoaumao i 512 i Event Domain Performance Signal 100 50r- x(n) Number of events -- n Figure 6-12. Spatial and temporal dynamics of synchronization. 6.5.2 Analysis of Macroscopic Patterns In some cases, if the number of samples in a performance signal is not excessively large due to a large system and/or problem size, then macroscopic patterns and trends can be directly observed from the performance signal. As stated above, these patterns are often hidden due to a large amount of variance present in the signal. If we could remove this high variance from the performance signal (which is a time-series), the macroscopic patterns and trends (repetitive or otherwise) will immediately become clear. This is possible by applying appropriate DSP techniques to filter out the high variance contents of the performance signal which will result in a “smoothed” version of the performance signal. The method that we have selected applies a digital moving averages filter to smooth the performance signal. We give a brief introduction to this method and refer to [18, 42, 84, 85] for precise details of the implementation. 132 6.5.2.1 Method: Moving Averages The use of moving averages for the removal of high variance from time-series performance data for the purpose of smoothing is a well-known technique. Researchers [17 , 73] have shown its use for the identification of stationary phases of a program and have used this information for the prediction of speed-up characteristics [17]. The only difference in our technique is that we have characterized the time-series data as a discrete- time signal and have designed a digital filter for this purpose. This digital filter is a finite impulse response (FIR) type of filter and is causal; therefore, it can be implemented in real-time to calculate the moving averages sequence on-line. Design of this filter was specified in Chapter 5. The order of the moving averages filter is data-dependent. If there is a very large number of fluctuations in the performance signal during a short time, generally a higher-order filter will be required. Disadvantages of using this filter include the loss of data at the starting point and the effect of the extreme values of the data. However, the filter adequately satisfies our objective of analyzing macroscopic patterns. We use other techniques for other purposes, such as thresholding for locating outliers in the signal (discussed in section 6.5.5.1). If there are any macroscopic repetitions of dominant patterns (for instance, changes in utilization due to repeated communication calls) they will show up in the smoothed signal. The following example demonstrates the use of moving averages to identify and analyze macroscopic repetitive patterns. 6.5.2.2 Example: Linear Solver Program A linear solver program is a commonly used application program that solves a system of linear equations Ax=b by using Gaussian elimination (GE) on the coefficient matrix A, followed by backsubstitution (BS) to determine the solution vector x. In order to show an example of the identification of macroscopic repetitive patterns, we have selected the BS phase of a solver of a linear system of dimension 64, executed on an nCUBE-Z [76] multicomputer using 16 processors. The coefficient matrix was distributed by using a (BLOCK,*) data distribution [36], where each processor owns a block of rows of the 133 matrix, and the matrix had already been converted to an upper-triangular matrix using GB. The space-time diagram for this phase is shown in Figure 6-13. There are several Patterns move towards the processors that own higher rows of the matrix. IDWWMDON‘U O D- N o O u 0 ‘4 O 0 DMUZCZ Figure 6-13. Space-time dynamics of backsubstitution algorithm (Paragraph). interesting patterns related to the spatial and temporal activity of the system during the BS phase. Due to the type of data distribution used, each processor owns a block of four rows. The BS algorithm calculates one solution per row, starting from the bottom row of the upper-triangular matrix. At each step, the computation moves to the next higher row until all the solutions are calculated from all the rows. The space-time diagram shows a series of four distinct patterns due to message-passing involving one processor sending messages to all other processors owning higher rows of the matrix, starting from the processor that owns the bottom four rows (i.e., processor #15). This message—passing is needed to communicate the element of the solution vector that was calculated at one row to all the processors owning higher rows. We make the following comments on the patterns of this space-time diagram: 134 . The patterns are of a repetitive nature, with macroscopic patterns repeating fifteen times and each macroscopic pattern consisting of four similar “sub-pattems”, due to each processor owning four rows of the matrix. - Similar patterns shifting gradually towards the processor that owns the top four rows of the matrix (i.e., processor #0) depict the data and control flow of the program. - These patterns are difficult to understand unless a user is knowledgeable of the program and familiar with the patterns that are typically presented in a specific display of a per- formance visualization tool, such as a space-time diagram. 0 Useful displays in conventional visualization tools (such as space-time diagrams) are usually stretched to the limit of their scalability even with moderate system size (16 processors in this case) and immense communication activity involved. To simplify the process of analyzing the repetitive temporal patterns, we can eliminate the spatial domain and focus on only the temporal domain. Ultimately, the spatial patterns also contribute to an understanding of the behavior of the program. Performance images have been used for this purpose [93, 94, 121, 122, 124]. This type of focusing to support analysis is an important aspect of our multiple-domain analysis approach. The performance signal from BS is shown with its smoothed version in Figure 6-14. The Performance Signal from 88 on 16 processors 100 V f V V ‘l T V D D J or O a 9t. Utilization - x(n) 8 N D O 200 400 500 800 1000 1200 1400 1500 1800 Number of events -- n D Moving Averages Sequence of order 20 100 r ‘T O O b I d i '1 WW 0 200 400 600 800 1000 1200 1400 1500 1600 Number of events -- n 01 0 9t Utilization - y(n) '8 8 D Figure 6-14. Analysis of macroscopic patterns in BS phase signal by using a moving averages filte performance signal shows the percent of the system being used throughout the execution 135 of the program (at the instants of transitions, as this performance signal is obtained from the event domain). Even without processing this performance signal x(n), one can visually appreciate the presence of repetitive patterns, although there is localized “noise” (high variance) obscuring the macroscopic patterns. A moving averages filter of order 20 (arbitrarily selected) was applied to this performance signal and y(n) is the resulting smoothed signal. The smoothing process tries to “impose” stationarity on the performance signal (a time-series) and eliminates its “rapid” variations and “outliers” to emphasize global, average behavior. The following should be noticed in the smoothed signal: - It shows most of the fifteen “global” (macroscopic) temporal repetitions of the mes- sage-passing steps that were indicated in the space-time diagram. 0 The macroscopic repetitions are not very clear towards the end, because the number of processors involved in the computation progressively decreases as the algorithm pro- ceeds towards the top rows of the matrix. Therefore, changes in the system utilization (as it is defined) are relatively insignificant and do not show up explicitly in the smoothed signal due to an averaging effect. Similarly, sub-patterns due to the four rep- etitions at each stage lose their individual identity due to the averaging phenomenon. - A user does not have to be familiar with the performance signal to see the repetitive patterns shown by an x-y plot of the smoothed performance signal. - The method scales with system and/or problem sizes due to its rigorous nature. 6.5.3 Estimation of Repetitive Patterns We begin the analysis of temporally repeating patterns of parallel programs by using performance signals to represent the temporal activity of the system and program. Repetitive patterns are expected in a performance signal due to the presence of loops in the program. Actions within a loop cause variations in certain performance metrics (depending on the nature of the actions). These repetitive patterns can be observed as similar segments of the signal repeating over time, in response to each iteration of a loop. This periodic behavior can be observed using any conventional time-series plot available in a performance visualization tool (e.g., a space-time diagram). As discussed in Chapter 5, this periodicity is not automatically obvious in all cases. Therefore, we use performance signals to analyze these repetitive patterns by applying statistical DSP techniques. We have divided the analysis of repetitive patterns into two parts. The first part is concerned 136 with the representation, identification and analysis of macroscopic patterns in the whole signal representing the execution of one or more phases of the instrumented program. The second part is concerned with the precise estimation of the repetitive behavior of the program. The combination of these two steps can be applied hierarchically in most program performance analysis situations. Repetitive patterns are not always easily identified by the removal of variance using averaging. One disadvantage of moving averages that was discussed in the previous section is the loss of precise localized details of the performance signal. If the analysis of patterns is to be hierarchical [29], then a user should be able to locate the macroscopic and average repetitive patterns by using moving averages and then should be able to selectively analyze parts of the program activity more meticulously. If we are able to look at a small segment of the performance signal (this method is termed windowing [84] in signal processing), there is still no guarantee that the repetitive patterns will not be hidden by large variance. Therefore, we need to further process this window of the performance signal to locate repetitive patterns (or periodicities). A classical method that is used in statistical signal processing for this purpose is the estimate of autocorrelation of the signal [12, 59, 62, 81, 82, 84, 126]. We will explain this method and then provide a couple of examples to establish the validity of the method for the analysis of repetitive temporal patterns of a program. 6.5.3.1 Method: Autocorrelation A widely used method to determine the repetitive (periodic) behavior of a stochastic signal (time-series) is by using an estimate of autocorrelation of that signal [84]. The exact (statistical) autocorrelation is not trivially determined for any stochastic system, therefore, temporal observations are used to estimate the autocorrelation. The unbiased estimate of autocorrelation sequence of the performance signal was defined in Chapter 5 by (5.9) and will be used in subsequent examples to identify the possible iterations of 100ps in the instrumented programs. 137 In order to validate the method, we instrumented the loops in two programs discussed in the examples that follow with special markers (called tracemark events [45]) so that we could temporally locate the start of an iteration at each processor of the distributed system (nCUBE-2). Whenever a processor enters into the loop, the instrumentation system records a marker with the time-stamp and the number of the iteration (loop index) being performed by that processor, indicating the start of that particular iteration at that processor. We will define and use an indicator sequence {i (n)} for temporally representing the trace-mark events. In order to obtain the indicator sequence, we look for the trace- mark records in the trace file. Whenever one is found for a processor, at a particular time- stamp n, we extract the information about the loop index represented by that record. Definition (Indicator Sequence) The indicator sequence is defined for time-stamps n of a traced programs as: f(n) = { 3c , 3:16:31ng of nhaving tracemarkevents (6.1) where k 6 {1,2,..., K} is the loop index obtained from the trace-mark record and K is the maximum possible loop index executed by any of the processors. This definition considers the possibility of generating more than one trace-mark record for a particular time-stamp and a particular iteration (i.e., duplicate markers on different processors). In such a case, only one indicator will be stored. Due to the distributed nature of the program (and especially if there are data-dependent iterations of the loop), it is possible that some processors finish an iteration before the others. Given the definition of indicator sequence, we thus get several points where the amplitude of_i(n) is the same for several values of n. This phenomenon will be called “blurring effect” (note the visual appearance of its plots in the following examples, e.g., see Figure 6-17) and will indicate that different processors started the same iteration of the loop at different times. This lack of synchronization due to the distributed nature of the system and SPMD paradigm used should be accounted for while analyzing the estimate of autocorrelation of a performance signal to determine the temporal spacing between two consecutive iterations of a loop. 138 Indicator sequences were extracted from the trace data of the programs discussed in the following examples to validate the estimated results. 6.5.3.2 Example: Pi Program This example consists of a simple non-nested loop and is presented here to merely serve as a means for illustrating the technique. A second example will be given from a real application program consisting of more complex nested structure of loops. Therefore, the techniques developed here will serve for understanding those cases. The value of pi (3.14 ...... ) can be approximated by evaluating the integrand 4.0/(1+x2) numerically, between the limits of x = 0 to x = 1. This problem was solved on sixteen nCUBE-2 processors, with each processor working on a small part of the limit. After a local summation is complete, all processors perform an exchange-add process, so that all end up with the final value of pi. The exchange-add steps proceed in logarithmic fashion, therefore, there are four exchange-add steps for a system consisting of sixteen processors. The program executes a loop in order to implement these exchange-add steps. The instrumented code (in C) is given in Figure 6-15. tracestaIlO; /" To set up the system for tracing and open communication channels ’/ traoenode(500000,0,1); /' To start tracing at all the nodes '/ i I 1; for (j-O; J