v
. u .
n... v~l

 

 

-; .
"v,
Yd
a

:
.r
3‘.

m.-
4““.
.494 ;
21:15.53!“
cit-Y
mm?!

 

 

71-155.;

(1
2004

Stealﬂoq

 

 

LIBRARY
MlChlgan State This is to certify that the
university dissertation entitled

 

 

EVALUATING PERFORMANCE INFORMATION FOR
MAPPING ALGORITHMS TO ADVANCED ARCHITECTURES

presented by

Nayda G. Santiago Santiago

has been accepted towards fulfillment
of the requirements for the

PhD. degree in Electrical Engineering

 

 

MajoWrofessor’s Signature

7/!7/03

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/ClRC/DateDue.p65-p.15

 

EVALUATING PERFORMANCE INFORMATION FOR
MAPPING ALGORITHMS TO ADVANCED
ARCHITECTURES

B y

Nayda G. Santiago Santiago

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical and Computer Engineering

2003

ABSTRACT

EVALUATING PERFORMANCE INFORMATION FOR
MAPPING ALGORITHMS TO ADVANCED
ARCHITECTURES

Bv

V

Nayda G. Santiago Santiago

The development of efﬁcient code for scientiﬁc and engineering applications on advanced
computing systems is not a trivial task. To accomplish this task, a code developer has
to be concerned not only about algorithmic correctness and robustness, but also about
performance and implementation details. These additional factors impose a burden on
the typical scientiﬁc computing expert, preventing the user from effectively leveraging the
computational resources available to the application. Two major factors can be identiﬁed
among those making this task particularly difﬁcult. First, the complex interactions between
the target platform and the application software tend to hide information about the existing
relations between different entities in the system. Second, the high dimensionality of the
performance data conceals interesting patterns in the observations which could lead to
insights into the system behavior. While a multiplicity of tools have been developed to solve
these problems, many obstacles still exist when characterizing the relations among high-
level factors and low-level performance information. These problems not only make difﬁcult
the task of efﬁcient coding, but also prevent the development of automated performance

analysis tools to assist application programmers to tune their code.

This dissertation proposes a new methodology for obtaining information about the re-
lations emerging when compute-intensive applications are mapped onto advanced architec-
tures. The proposed methodology incorporates knowledge and techniques from multiple
areas that include statistics, operational research, pattern recognition, data mining, and
performance evaluation to enable the extraction of performance information during the
mapping process. The methodology is composed of four steps: problem analysis, design
of experiments, data collection, and data analysis. In the ﬁrst two steps, analyses of the
application itself are completed to determine the appropriate design of experiments for es-
tablishing relations between changes in high-level abstractions and performance outcomes.
Feature subset selection is proposed for identifying important system metrics. An evalua-
tion of different statistical analysis alternatives was carried out to characterize the types of
data obtained in performance studies.

Several interesting results emerged from the application of this methodology on a compu-
tational electromagnetic case study. First, a correlation analysis embedded in the proposed
methodology revealed that software instrumentation metrics exhibit collinearity. This im-
plies a redundant information content in the data, limiting the set of statistical methods
applicable for its analysis. Intrinsic dimensionality estimation and unsupervised feature
subset selection identiﬁed the metrics containing the most performance information. On
average, only 18% of the metrics were found to be important. Other results include identiﬁ-
cation of equivalency between multiple compiler options, reducing the actual set of options
necessary at compile time. Also, a categorization of these options was obtained accord-
ing to their effect in the application execution time. In summary, the application of the
proposed methodology reveals that a detailed problem study preceding a systematic de-
sign of experiments, yields useful data on which appropriate statistical tools can provide
unbiased information about the application-system interactions. Moreover the information
obtained from this methodology can be converted into appropriate suggestions, observa-
tions, and guidelines for the scientiﬁc computing expert to tune applications to a particular

computing system.

Copyright © by
Nayda G. Santiago Santiago
2003

To my family: Diana Alexandra, Victor Manuel, and Manuel.

To my parents Héctor and Icsida, and to my sisters Yaira, Damaris, and Betzaida.

ACKNOWLEDGMENTS

I have been at Michigan State University for many years. Enough to get adjusted and
learn to appreciate and enjoy living in Michigan. There have been many people that have
made this transition process much more enjoyable.

First of all, I would like to express my gratitude to Diane T. Rover. I am completing
this degree because of her and her constant encouragement and support. She has been
my stronger supporter for all these years. She was always ﬁnding ways to motivate me,
alternatives to solve the problems along the way, and has been advisor and friend. I have
learned so much from her I still cannot ﬁgure out how she has so much energy and how
can she always manage to have time for everything.

I would like to thank my committee members John R. Deller, Jr., Michael Frazier,
Richard Enbody, and Domingo Rodriguez for their time and effort to review this document
and their insights in the development of this research work. I specially would like to thank
Michael Frazier. I wish I had his ability to convey information to students. I would be well
off even if I were half as good as him as a professor. Also, former committee member, Robert
Nowak, provided a lot of guidance while he was professor at Michigan State University.
Shawn Hunt represented Domingo Rodriguez in the dissertation defense and provided useful
comments on the dissertation.

Domingo Rodriguez deserves special thanks. He has been a friend and mentor for many
years and an advisor for the last part of my dissertation. He has taken me as his graduate
student and provided resources, energy and motivation for my research. Plus he has been
my support and friend when things were not going right. Thanks Domingo, from the bottom

of my heart.

vi

 

Leo Kempel’s assistance has been very important in the completion of my dissertation.

He has provided all the resources and code for my experiments. He was always willing to
help whenever we needed something or when we just needed an explanation.

The people who worked at the Scalable Computing Systems Lab were always my friends
and partners and they deserve my appreciation and gratitude. These are Ken Wright,
Sandeep Rao, Srinivas Kanamata, Timo Vogt, Sharad Kumar, and Vijay Kesavan. Vijay
has been more than a partner, he has been my sounding board and my soul mate. I am
profoundly grateful to you for being always there for me.

I thank Jeff Meese for providing all the time and effort to keep the system working and
installing the software for my experiments. Kennie J. Cruz, Pablo J. Rebollo, Iomar Vargas,
and Ivan David have provided the technical assistance to keep the system working at the
University of Puerto Rico. They have worked extra hours to assist me in anything they
could do for me.

I would like to thank the ECE Department Staff for all their dedication. In particular
I thank Marylin Shriver, the former graduate secretary, who has always been friendly and
helpful to me and Vanessa Mitchner for assisting me many times with paperwork. I would
like to thank Barbara O’Kelly and Percy Pierre, from the Sloan Engineering Program at
MSU, for all their assistance all these years.

This work was supported by the following grants: NSF BIA—9700732, NSF ACI-9624149,
and NSF BIA-9977071. I would like to thank Susan Kingston and Don Gunning from
INTEL for their assistance with KAP / Pro and also Dr. William Kent of Mission Research
Corporation for his support.

My friends have been my moral supporters all along. I thank Ziad Youssﬁ, Maria de los
Angeles Torres, Andrés Diaz, Brenda Ortiz, Oscar Hernandez, Ron Wright, Freddy Pérez,
Amarilis Cuaresma, Hilaura Nava, Daniel Burbano, Judy Rosado, and Gihan Mandour.
Ziad Youssﬁ is an unconditional friend and a wonderful human being. Hilaura has given
me the strength I needed when I was in trouble. Gihan deserves special thanks. She has

laughed and cried with me all along and is my soul mate. She is one of the special friends

vii

 

 

that has been there for me, always..., no matter what. Thanks!

I would like to thank the Congregation of Sisters of Charity of the Sancha Cardinal
(Hermanas de la Caridad del Cardenal Sancha - HCCS) at Santo Domingo, Dominican
Republic for their constant prayers. Their prayers kept my faith strong along the way.

I want to thank family. My sisters, Bechi, Damaris, and Yari (Sor Yaira), have always
prayed for me and given me encouragement and assisted me with all they could do. My
mom, Icsida Santiago, is my role model and she is one of the strongest woman I have ever
known, not only in character, but also in faith to God. My dad, Héctor Santiago, is one of
the nicest people on this world. My husband, Manuel A. Jiménez, has been there with me
all along and supported me 100%. Finally, my children Victor Manuel and Diana Alexandra
who are my motivation and joy. This dissertation is dedicated to you all since you are my
strength and motivation in life.

I want to thank God. Without him, nothing is possible.

viii

 

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction
1.1 Motivation ....................................
1.2 Problem Statement ................................
1.3 A Methodology for Evaluating Performance Information ...........
1.4 Contributions ...................................

1.5 Dissertation Overview ..............................

2 Related Work
2.1 Introduction ....................................
2.2 Relating Performance Information and High-Level Abstractions .......
2.3 Using Statistical Analysis on Performance Data ................
2.3.1 Statistical Analysis of Algorithms and Heuristics ...........
2.3.2 Scalability Analysis using Factorial Designs ...............
2.3.3 Statistical Analysis of Memory Hierarchy ...............
2.4 Multivariate Methods for Performance Data Analysis .............
2.5 Automatic Performance Evaluation .......................

2.6 Summary .....................................

3 Proposed Methodology

3.1 Introduction ....................................
3.2 Preliminary Problem Analysis ..........................
3.3 Experiment Speciﬁcation .............................
3.4 Data Collection ..................................
3.5 Data Analysis ...................................
3.6 Summary .....................................

4 Preliminary Problem Analysis
4.1 Introduction ....................................
4.2 Problem and System Deﬁnition .........................

ix

 

xiv

xvi

mxlubwt—H

10
12
12
13
14
15
18
20

21
21
22
22
24
24
26

28
28
28

 

4.2.1 Finite Element Method in Electromagnetics ..............
4.2.2 Observable Computing System .....................
4.3 Current Situation Assessment ..........................
4.4 Evaluation of Alternatives ............................

4.5 Summary .....................................

Speciﬁcations for the Experiment
5.1 Introduction ....................................
5.2 Performance Characterization Experiments ..................
5.3 Design of Experiment ..............................
5.4 Detailed Description of the Experiment ....................
5.4.1 Experiment 1: Parallel implementation of Prism ...........
5.4.2 Experiment 2: Serial implementation of Prism ............
5.4.3 Experiment 3: Inneﬁcient memory access pattern in Prism, validation
experiment ................................
5.4.4 Experiment 4: Matrix-vector multiplication, validation experiment .

5.5 Summary .....................................

Data Collection

6.1 Introduction ....................................

6.2 Tools ........................................
6.2.1 Software Instrumentation ........................
6.2.2 Operating System Metrics ........................
6.2.3 Output Format ..............................

6.3 Summary .....................................

Data Analysis

7.1 Introduction ....................................

7.2 Statistical Models for Performance Analysis ..................

7.3 Measuring Relationships in Multidimensional Data ..............
7.3.1 Formatting Data for Statistical Methods ................
7.3.2 Preprocessing ...............................
7.3.3 Correlation Analysis ............................
7.3.4 Multidimensional Metric Subset Selection ................

7.3.5 ANOVA ..................................
7.4 Summary .....................................
Results
8.1 Experiment 1: Parallel Implementation of Prism ...............
8.1.1 Correlation Analysis ............................
8.1.2 ANOVA ..................................

29
33
35
37
38

40
40
41
43
44
45
47

49
50
51

53
53
55
55
56
62
62

63
63
64
65
66
69
72
75
84
88

89
90
91
92

 

8.1.3 Dimensionality .............................. 92

8.1.4 Metric Selection ............................. 95
8.1.5 ANOVA .................................. 95
8.1.6 Another method for subset selection .................. 96
8.2 Experiment 2: Serial Implementation of Prism ................ 98
8.2.1 Correlation Analysis ............................ 98
8.2.2 ANOVA .................................. 99
8.2.3 Dimensionality .............................. 99
8.2.4 Metric Selection ............................. 100
8.3 Experiment 3: Inefﬁcient Memory Access Pattern Algorithm ........ 102
8.3.1 Correlation Analysis ........................... 102
8.3.2 AN OVA .................................. 103
8.3.3 Dimensionality .............................. 104
8.3.4 Metric Selection ............................. 104
8.4 Experiment 4: Matrix-Vector Multiplication Tests .............. 106
8.4.1 Correlation Analysis ........................... 106
8.4.2 ANOVA .................................. 107
8.4.3 Dimensionality .............................. 107
8.4.4 Metric Selection ............................. 108
8.5 Analysis of Results ................................ 110
8.6 Scientiﬁc Programmer Actions ......................... 112
8.7 Summary ..................................... 113
Conclusion 114
9.1 Research Summary ................................ 114
9.2 Contributions ................................... 115

9.2.1 A Methodology for Obtaining Relevant Performance Information . . 116
9.2.2 The Use of Design of Experiments for Performance Analysis Experi-

mentation ................................. 117

9.2.3 The Usage of Data Reduction and Statistical Analysis ........ 118

9.3 Validation ..................................... 121
9.4 Conclusions .................................... 122
9.5 ﬁxture Work ................................... 122
Foundations of Computational Science and Engineering 125
A.1 Mathematical Preliminaries ........................... 125
A.1.1 Other Terms ............................... 128

A2 Application .................................... 129
A.2.1 Finite Elements Analysis ........................ 129
A.2.2 Iterative Solvers ............................. 129

xi

A.2.3 Matrix-Vector Multiplication ......................
A.3 Advanced Architectures .............................
A.4 Languages and Environments ..........................
A.4.1 Shared Memory ..............................
A.4.2 Message Passing .............................
A.4.3 Problem Solving Environments .....................
A.5 Performance Measurement ............................
A.5.1 Tools ...................................
A.5.2 Statistical Terms .............................
A.6 Summary .....................................

Glossary

Matrix-Vector Multiplication Algorithms

C.1 Algorithm A ...................................
C.2 Algorithm B ....................................
C.3 Algorithm C ....................................
C.4 Algorithm D ...................................
C.5 Algorithm E ....................................
C.6 Algorithm F ....................................
C.7 Algorithm G ...................................

Experiment 1
D.1 Order of Execution of Experimental Runs for Experiment 1 .........

D.2 Anova on the metrics obtained in Experiment 1 ...............

Experiment 2
E1 Order of Execution of Experimental Runs for Experiment 2 .........

E.2 Anova on the metrics obtained in Experiment 2 ...............

Experiment 3
E1 Order of Execution of Experimental Runs for Experiment 3 .........
F.2 Anova on the metrics obtained in Experiment 3 ...............

Experiment 4
G.1 Order of Execution of Experimental Runs for Experiment 4 .........
G.2 Anova on the metrics obtained in Experiment 4 ...............

Additional Fortran ﬁles
H.l Program to test new routines ..........................

xii

130
132
134
134
135
135
135
136
137
146

148

151
151
152
153
153
154
155
155

157
157
167

170
170
180

183
183
198

201
201
205

211
211

I Matlab Files

I.1 Program to compute order of experimental runs ................
1.2 Routine to compute entropy cost function ...................
1.3 Routine to show scree test and the Kaiser-Guttman criteria .........

1.4 Program to validate intrinsic dimensionality estimators ............

J Perl Script ﬁles

J .1 Script A: Generating Summary of Metrics ...................
J .2 Script B: Create Crontab ﬁle ..........................
J .3 Script C: Convert data to minitab 13 format
J .4 Script D: Convert data to SAS format .....................

BIBLIOGRAPHY

xiii

214
214
216
217
218

221
221
230
233
236

240

2.1

5.1

5.2

5.3

6.1

6.2

6.3

8.1

8.2

8.3

8.4

8.5

8.6

8.7

8.8

8.9

8.10

LIST OF TABLES

OpenMP Metrics .................................
Compiler Options in Experiment 1 ........................
Compiler Options in Experiment 2 ........................
Compiler Options in Experiment 3 ........................
Metrics obtained from the SAR command ...................
Metrics obtained from the IOSTAT command .................
Metrics obtained from the VMSTAT command ................
Metrics with largest correlation with execution time in experiment 1. . . . .

Effect of factors and interactions on the most correlated metrics with execu-

tion time for experiment 1. ...........................

Number of metrics to keep variability of the current data according to three

different criteria for experiment 1. .......................
Metrics with highest information content in experiment 1. ..........
ANOVA on the metrics shown in table 8.4. Main effects ............
Metrics with highest information content selected by SVD for experiment 1.
AN OVA on the metrics shown in table 8.6. Main effects ............
Metrics with largest correlation with execution time for experiment 2.

Effect of factors and interactions on the most correlated metrics with execu-

tion time in experiment 2 .............................

Number of metrics to keep variability of data according to three different

criteria in experiment 2. .............................

xiv

19

47

48

50

58

59

60

91

92

94

95

96

97

97

98

99

100

8.11

8.12

8.13

8.14

8.15

8.16

8.17

8.18

8.19

8.20

8.21

8.22

8.23

8.24

8.25

8.26

8.27

8.28

8.29

A.1

D.1

D.2

E1

E2

Metrics with highest information content in experiment 2. ..........
AN OVA on the metrics shown in table 8.11. Main effects. ..........
Most important metrics for experiment 2 according to SVD ..........
ANOVA on the metrics shown in table 8.13. Main effects. ..........
Metrics with largest correlation with execution time for experiment 3.

Effect of factors and interactions on the most correlated metrics with execu-

tion time for experiment 3. ...........................
Estimate of the intrinsic dimension of this data set ...............
Metrics with highest information content in experiment 3. ..........
ANOVA on the metrics shown in table 8.18. Main effects. ..........
Most important metrics for experiment 3 according to SVD ..........
ANOVA on the metrics shown in table 8.20. Main effects. ..........
Metrics with largest correlation with execution time ..............

Effect of factors and interactions on the most correlated metrics with execu-

tion time for experiment 3. ...........................
Estimate of the intrinsic dimension in experiment 4. .............
Metrics with highest information content for experiment 4 . .........
AN OVA on the metrics shown in table 8.18. Main effects. ..........
Most important metrics for experiment 4 according to SVD ..........
ANOVA on the metrics shown in table 8.27. Main effects. ..........
Percentage of metrics kept for the analysis. ..................
Order of experiments for a fully randomized experiment ............
Order of execution of experiments .......................
AN OVA ......................................
Order of execution of experiments .......................

ANOVA ......................................

XV

100

101

104

104

107

107

107

108

109

109

109

110

144

167

170

180

F.1

F.2

G.1

G2

G3

G4

Order of execution of experiments ....................... 183

ANOVA ...................................... 198
Order of execution of experiments ....................... 201
ANOVA - main factors effect in experiment 4 ................. 205
AN OVA - two term interaction effect in experiment 4 ............ 207
ANOVA - three and four term interaction effect in experiment 4 ...... 209

xvi

1.1

1.2

1.3

1.4

3.1

3.2

3.3

3.4

4.1

4.2

5.1

5.2

5.3

5.4

6.1

6.2

6.3

7.1

LIST OF FIGURES

Typical analysis flow for tuning an application. ................
Integrative performance analysis. ........................
Proposed approach for application tuning ....................
Proposed methodology ...............................
Proposed methodology to extract information in an OCS. ..........
Model of an experiment ..............................
Feature Subset Selection Scheme .........................

The combination of feature selection and feature extraction for performance

data analysis. ...................................

Preliminary Problem Analysis is the ﬁrst step in the proposed methodology.

Some representative ﬁnite elements. ......................
Design of Experiment step in the methodology. ................
Compiler Operating on selected software codes. ................
Linker and loader operating on compiled codes .................
Example of one block in our split-split plot design. ..............
Data Collection step integrated with the methodology. ............
Stages in the program mapping process [1]. ..................

Collinearity problem: Those metrics obtained by the operating system may
come from the same groups of variables. ....................

Data Analysis is the last step in the proposed methodology. .........

xvii

22

23

26

27

29

31

40

41

42

44

53

54

57

63

 

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

7.10

8.1

8.2

9.1

9.2

9.3

9.4

A.1

A2

A3

Performance Data Analysis Architecture. ...................
Graphical View of a Discrete-Time Continuous Value Stochastic Process . .
Example of a matrix format used for the performance data. .........
Two principal components of the validation data - no normalization

Two principal components of the validation data - Min-Max normalization .
Two principal components of the validation data - Euclidean normalization

Visual display of the correlation matrix of the data obtained from the vali-

dation experiment matrix-vector multiplication ..............
Feature subset selection .............................
Classiﬁcation scheme of feature selection measures [2] ............
Eigenvalues of correlation matrix in experiment 1. ..............
Eigenvalues of correlation matrix for synthetic data. .............
Typical analysis ﬂow for tuning an application. ................

Proposed approach for application tuning. The dashed line shows the part
of this tuning methodology addressed by this research .............

Proposed methodology to extract information in an observable computing

system {003). ..................................

Summary of statistical analysis techniques used for extracting information

about performance outcomes. ..........................
Venn diagram of mathematical signals ......................
Experiment illustrating execution time of two simple comparative studies.

Execution time when Machine B is used in the study . ...........

xviii

66

68

71

72

73

75

77

79

93

94

115

116

117

119

127

142

142

 

 

CHAPTER 1

Introduction

1 . 1 Motivation

Finding a suitable, high-performance, computer-based solution of a real-world problem is a
complex process. The number of different possibilities for programming style, algorithms,
parameters, operating system environment variables, compiler and ﬂags, and architecture,
among others, create a set of entangled interactions. A clear understanding of these inter—
actions and how they relate will assist the programmer in the decision-making process.

The process of solving a real-world problem is composed of two major steps: concep-
tualization and instantiation. Conceptualization is the process of developing a new idea to
solve a problem. Instantiation is the action of describing the idea as a series of steps to
solve the problem. There are different levels of instantiation [3] from the highest level of
abstraction to the most detailed solution of the problem where all parameters have been
deﬁned. An algorithm is a well-deﬁned procedure to solve a problem in a ﬁnite number of
steps. In this context, an instantiation can be expressed as a collection of algorithms seeking
the solution to a problem. An implementation is deﬁned in this work as an instantiation
where all parameters and algorithms have been determined.

For a person solving a real-world problem, many different criteria might be considered
to measure success in a given implementation. Some might consider robustness, usability,
or speed as criteria for measuring how well suited is the implementation. Robustness refers

here to the capability of software to properly react to unusual requirements [4]. Usability

 

 

is related to the characteristics of software that makes it easy to learn, efﬁcient to use, easy
to remember, error tolerant, and pleasant to use [5]. The most common measure used is
speed: the faster the algorithm, the better the performance.

Different factors affect computer performance of an implementation. For instance, speed
is determined by a series of factors such as programming style, language, compiler options,
and architecture and these are selected by the application programmer as part of the imple-
mentation process. Application programmers are usually experts in one area. For example,
application developers in signal processing are proﬁcient in applying mathematical concepts
to solve their problems. Their level of expertise is usually concentrated in one of the levels
of instantiation, typically in the highest level of abstraction. This leads to the selection of
alternatives without having a complete understanding of the relationship among each of the
factors and the obtained performance.

Mapping refers in this work to the relation between a language of a high-level abstrac—
tion and a language of a concrete architecture [6]. It is still unknown, even for experts
in the area of performance, what are the relationships among the different parts of the
mapping process. This is due to the vast number of platforms, compilers, compiler options,
algorithms, and programming styles associated with a particular implementation. With the
advent of advanced computer architectures with parallel units or parallel organization, we
add to this list different programming paradigms for parallel processing.

This dissertation is addressing the problem of obtaining information about relationships
between various factors and the computer performance of an implementation. It introduces
a statistics-based methodology to bridge the gap between the high-level abstraction and
the low level implementation information. A case study in the area of computational elec-
tromagnetics illustrates the formulation of real-world problems of large scale modelling of
physical systems. This methodology uses an empirical analysis and a statistical approach
to understand how different computer performance metrics at different levels are affected

by the selection of parameters in the implementation process.

 

1.2 Problem Statement

The performance obtained when intensive applications are mapped to a computing platform
is highly dependent on how well adapted is the application to the platform. However, the
tuning process used in most of today’s applications still leaves room for improvement.
The information required to establish existing relations among high-level factors and low—
level performance data is not easily obtained due to the complexity of the system. This
contributes to the difﬁculty experienced by scientiﬁc programmers to obtain acceptable
performance on advanced systems. This is the main problem addressed by this dissertation.
A number of issues contribute to this problem.

First, the mapping process is not one to one. Source code lines get optimized by the com-
piler and linked to advanced libraries in a way in which executable code do not correspond
directly to source code. Also, the order of execution on the actual system is rearranged dur—
ing run time, making it difﬁcult to associate performance costs to speciﬁc code or segments
of instructions. Moreover, communication patterns among processes might be affected by
unpredicted asynchronous situations in the system.

Second, performance analysis incorporates the application programmer’s insight into
the tuning process, which prevents automatic performance evaluation. This is illustrated
in Figure 1.1.

The application programmer needs to understand instrumentation, learn the appropriate
tools, and interpret data and its relation to the code, in order to optimize the code for a
particular system. This method is complex and prone to wrong interpretations [7]. Also,
important performance information might be overlooked, hidden by the large amounts of
performance data collected by instrumentation systems. Moreover, as architectures increase
in complexity and larger problems are solved, the performance analyst will require a greater
experience or expertise, which only a select group of people might have.

Finally, current performance analysis tools are not necessarily portable and scientiﬁc

application programmers do not ﬁnd them intuitive or appealing [8].

 

 

Programming Programming

 

 

 

 

 

 

 

 

  
  

 

 

 

 

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P d' St 1
ara 18‘“ Y e Languages System
Conﬁguration
High—level ‘ Computer A _ Instrumentation
7 code System V '7 Tools
Libraries Algorithm
Performance
Modify Data
Ir
, Analysis and
Evaluation .
Programmer A > Evaluation
Tools
/ / \ \
Experience Knowledge In—depth Understand Relations
of Knowledge on Between Performance Burden on Programmer
Tools Computer System Data and Code

 

Figure 1.1. Typical analysis flow for tuning an application.

1.3] A Methodology for Evaluating Performance Information

The proposed solution is based on the integration of theories and methodologies that are
related to different aspects of the performance analysis problem posed in Section 1.2. We
have borrowed ideas from other disciplines to ﬁnd a solution to the problem.

The use of an integrative approach for performance analysis is proposed to combine
information at different levels and present it to the scientiﬁc programmer in a meaningful
form. Figure 1.2 shows a description of this environment. Performance data analysis is
integral to this approach. The traditional formulation for performance tuning, shown in
Figure 1.1, should be modiﬁed to satisfy the scientiﬁc programmer’s needs. We suggest the
tuning methodology presented in Figure 1.3.

This dissertation proposes a methodology which integrates four main components. These

are systematically applied to an observable computing system to extract relevant informa-

Integrative Performance Analysis

Integrative: directed toward coordination with

the user’s environment /]

 

 

  
 
 

Measurement:
Abstraction: Low-level .
Low—level information IS
information is collected

 

 

 
 

 

hidden

   
  
 
 
  

Problem Translation

   

>

 

System Levels

 

User's View
Problem Solving Environment
Tools Metrics
Hi h-level Lan ua es .
Dogmain Factorsg g Iggchme
Node

Mapping back to user’s view

 

\ Network

 

<

Figure 1.2. Integrative performance analysis.

tion to assist scientiﬁc programmers to tune applications to advanced architectures. The
four steps are: problem analysis, design of experiments, data collection, and data analysis
as illustrated in Figure 1.4.

A preliminary problem analysis is used to visualize what is affecting performance and
gather preliminary information. Screening experiments are used to establish which factors
are mostly affecting performance and select a subset of factors for experimentation.

Design of experiments is used to collect appropriate information from the smallest num-
ber of experimental runs. There are a large number of experimentation strategies from
which we can select the most appropriate one based on the characteristics of the system
and software.

The third step is data collection. This is determined by the particular system, language,
and instrumentation tools used. Data are collected at runtime and analyzed post mortem.

Data analysis begins by extracting the data to an appropriate matrix format. The per-

Alternative

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Algorithms .....................................................................
Experimentation
> High—level 5 Computer Instrumentation Performance
code E System Tools Data
M ed fy II
1 . a
Problem Solvmg Envrronment Statistical
Analysis
ll
— Programmer 4 Suggestion : Knowledge—Based 4 Information
System

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.3. Proposed approach for application tuning.

 

 

 

 

Preliminary Problem _ _
Analysrs Experiments Collection Analysis

 

 

 

Design of Data Data

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.4. Proposed methodology.

formance data matrix columns represent metrics and each row represent one experimental
run. Dimension normalization is applied to this matrix. The correlation matrix is com-
puted to determine which metrics are linearly related to execution time. Then we proceed
by extracting relevant metrics to analyze.

Multivariate statistical methods are used to extract this information. Intrinsic dimen-
sionality estimators are used to estimate how many metrics explain the variability of the
data. Feature subset selection methods are used to extract the most important metrics for
the analysis.

ANOVA is used to test the hypothesis that no factors are affecting performance. When
this hypothesis is rejected, post hoc comparisons and analysis of means are used to determine
which factors are affecting relevant metrics.

Thesis Statement: Design of experiments, instrumentation, dimensionality estimation,

 

feature subset selection, and AN OVA can be systematically combined to obtain information
relevant to performance analysis when mapping algorithms to advanced architectures. The
use of these techniques will assist in locating, in an unbiased manner, sources of performance

improvement.

1.4 Contributions

The contributions of this work are as follows.

Our ﬁrst contribution is a systematic methodology to obtain information on the ex-
isting relations when mapping compute-intensive applications to advanced architectures.
This methodology is composed of four steps: problem analysis, design of experiments, data
collection, and data analysis.

Second, we have identiﬁed the need of screening experiments to limit the number of
factors when experimentation is used. The use of large number of factors in the experi-
mentation phase can be unfeasible in terms of time and resources, for real applications and
advanced computing systems.

Third, we have deﬁned a performance characterization experiment (PCE) as the pro-
cedure of selecting a software code in a given computer language, applying a parallelizing
compiler with an ordered set of directives, running the code on a target machine, and
retrieving a well deﬁned set of performance parameters.

We also establish design of experiments (DOE) as necessary to establish causal relations
between high-level factors and low level performance information [9]. If we do not use DOE,
only correlational relations can be established. When design of experiments is used, the
performance analyst do not require extensive knowledge about the code in order to obtain
information on the relations. With the traditional tuning methodology, the performance
analyst must incorporate previous knowledge and experience into the process of tuning an
application.

The measurements obtained from performance instrumentation vary largely in scale. We

have identiﬁed the need preprocessing before some of the statistical methods can be applied.

We examined three different types of preprocessing schemes: log normalization, min-max
normalization, and dimension normalization. Dimension normalization resulted the most
appropriate one for our type of data. In addition, correlation analysis identiﬁed those
metrics most linearly related with execution time and revealed collinearity in measurements
obtained through software instrumentation.

Multidimensional data analysis methods were identiﬁed as appropriate tools for extract-
ing relevant information content in performance data. Sequential forward search was used
to identify those metrics most important for the performance evaluation process. Entropy
was used as a measure of information content in the response data set. Evaluation of three
intrinsic dimensionality estimators - scree test, Kaiser-Guttman, and cumulative percentage
of total variance - revealed that even though all produce similar results, scree test is not
appropriate for automated performance evaluation. Scree test requires the visual evaluation
of a graph.

Finally, analysis of variance and post hoc comparisons were used describe which factors,
if any, are affecting individual performance metrics, and which ones are statistically similar
or different. From these results we learned that compiler directives might be grouped into

similar categories, where their effect is statistically not distinguishable.

1.5 Dissertation Overview

This dissertation is organized into ten chapters.

Chapter 1 called Introduction presents the motivation and objective of this research
work. Chapter 2 gives a review of the current status of research. Chapter 3 presents an
overview of the proposed methodology used to relate high-level abstractions to low-level
performance information. Chapters 4 to 7 will expand on the methodology, giving details
on the purpose of the steps included in the methodology. Finally, chapters 8 and 9 present

results and conclusions.

CHAPTER 2

Related Work

2. 1 Introduction

A diversity of methods have been proposed to reach the goal of automated performance anal-
ysis. Some work has been done in relating performance analysis to high-level abstractions
[8, 10]. Also in the use of statistical methods for performance data analysis [3, 11, 12] and
on the use of multidimensional data analysis for studying performance data [13, 14, 15].
Moreover, the APART group is working towards the advancement of automated perfor—
mance tools [16]. However, we know of no other research working on the integration of all
these aspects into a coherent and general methodology for extracting information on the
existing relations of performance information obtained at the lowest level and the highest
level of abstraction.

In the following sections we present different approaches which are related to the tOpics of
this dissertation. Section 2.2 presents different approaches for relating low level performance
information to high level abstractions. Section 2.3 discusses how statistical methods have
played an important role in obtaining unbiased information about the performance of a
system. In section 2.4 we present multivariate analysis methods used for performance data
analysis. Finally, in section 2.5, we present the collective work of a group of researchers

workin in the area of automatic performance evaluation tools: APART.

2.2 Relating Performance Information and High-Level Ab-

stractions

Early work on performance analysis tools proposed the types of information to be collected
at different instrumentation levels. Irvin and Miller [10, 17, 18] proposed a framework
called the NV model (noun-verb model) for the identiﬁcation of fundamental information
to be collected by performance tools in order to correlate high-level abstractions to low level
performance information. In their work, a noun is an element from which a measurement
is taken and a verb is an action taken by or on the noun. A level of abstraction is then
deﬁned in terms of the collection of nouns and verbs associated to a speciﬁc point in a
mapping process. Irvin and Miller have deﬁned four different levels of abstraction: source
code, runtime library, operating system, and hardware.

The relationship between nouns and verbs at one level of abstraction and those at
another are known as a mapping in the NV model. Those mappings are classiﬁed as static
or dynamic. Static mappings occur previous to runtime while dynamic mappings occur
during runtime. The NV model was implemented in Paradyn [10] with CM-Fortran.

This work is similar to our research in the goal of correlating performance information
across levels of abstraction. Moreover, we have adopted their deﬁnition of levels of abstrac-
tions in our work. However, the NV framework described by Irvin and Miller should be
used by tool developers to relate information across levels while our methodology aims to
aid scientiﬁc programmers to ﬁnd relations across levels of abstractions with existing tools,
regardless of their use of the NV model.

Another approach to relate performance information to high-level constructs was used by
Mellor-Crummey, et al. in [8]. They have correlated low level information to source code by
creating a tool called HPCView. In their work, the authors have identiﬁed the main reasons
for lack of user support for performance evaluation. In general they claim that performance
tools do not improve the productivity of codes for three main reasons: usability, scope
of metrics, and appropriate assignment of data and source. First, the lack of usability of

existing tools comes from the absence of both language and architecture portability, and

10

 

 

the need of user intervention for instrumentation. Second, the scope of performance metrics
need to be expanded by presenting collective information and this information should be
presented with respect to relevant parts of the code. Finally, assigning performance data
to source code implies the correct assignment, after compiler optimizations, of performance
costs to source information.

HPCView is a toolkit designed to correlate performance information to source code. It
has been implemented for the following platforms: Alphas running Tru64, IA-32 machines
running Linux, IA-64 machines running Linux, SGI systems running IRIX64, and Sun
SPARC machines running SunOS. HPCView takes proﬁling data collected by platform
dependent proﬁlers and combines them with an estimate of the program structure obtained
from a tool called bloop, included in this toolkit. This information is then used to produce
a hyperlinked database viewable from any web browser. Basically, HPCVIEW requires a
conﬁguration ﬁle containing the paths to source code ﬁles, a set of performance metrics
obtained from the system, and a set of parameters to conﬁgure the display. It produces
html and javascript ﬁles which can be read by any web browser to produce an interactive
display that can be used by the programmer to identify metric - source code correlations.
Moreover, derived metrics can be computed by HPCView by means of MathML expressions
suggested by the programmer/ analyst. There are two main disadvantages of HPCView.
First, currently the tool relies on system dependent proﬁling tools, which may not provide
accurate performance information. Second, the accuracy of bloop depends on the compiler
used, since mapping information is collected from the associations of the symbol table
generated by the compiler.

As we have stated previously, our goal is to correlate performance information to high
level construct, which was achieved by the HPCView toolkit. Therefore we can state that
our research work are complementary. Our methodology is system and tool independent
while theirs is available for certain platforms only. It would be interesting to use HPCView
to verify results obtained by our methodology, something not done previously since the tool

was not ported to SUN machines at the time of experimentation of this study. Another basic

11

 

 

difference is the use of user’s intuition in selecting which metrics are going to be displayed
in HPCView. Our methodology points out to some metrics of interest for the user to pay

attention to them.

2.3 Using Statistical Analysis on Performance Data

Statistical analysis has been used in the past for analyzing certain aspects of performance
analysis such as execution time, memory performance, and scalability. We will describe

these in the following sections.

2.3.1 Statistical Analysis of Algorithms and Heuristics

The most common approach to compare algorithms in literature is to compare times pub-
lished in literature with the best time obtained from a new algorithm. However, the actual
running time of a coded algorithm is affected by the machine, compiler, language, program-
ming style, and workload, among different factors. To fairly compare two algorithms, Coffin
and Saltzman [3] suggest statistical analysis of algorithms. This will show the relationships
between the problem and the algorithm.

According to Cofﬁn and Saltzman, there are basically two different approaches to study
algorithms: theoretical analysis and empirical analysis. In theoretical analysis, an analysis
previous to the implementation is performed based on the parameters of the problem. In
empirical analysis, the actual time is evaluated by implementing it in computer code. In
our case, we will use empirical analysis.

There are very important results from Cofﬁn and Saltzman’s studies. One of their most
relevant conclusions is that statistical evaluations can provide surprising results or conclu-
sions different from superﬁcial evaluation of results. A general procedure is suggested for
comparing algorithms and making recommendations of which one to use. There are basi-
cally three steps in the general procedure. First, the data collection is done. Here a careful
experiment design must be done. Some possible design approaches are: completely random-

ized, randomized block, factorial, and fractional factorial. The second step is exploratory

12

 

data analysis. This analysis is done graphically to identify possible patterns or trends in
the data. Finally, the last step is formal statistical analysis. Here some basic methods
such as hypothesis testing, parameter estimation, and conﬁdence interval calculations are
performed.

There are some statistical considerations for adequately comparing algorithms. The
experimental design is important for the analysis and reproducibility of results. The model
and analysis done on the data will depend on whether exploratory analysis or conﬁrmatory
analysis is done. In exploratory data analysis, data is observed graphically to visualize
trends. In conﬁrmatory data analysis, a model is preconceived and a qualitative analysis
is done to conﬁrm or reject the model. Another important consideration is to identify the
experimental unit in order to analyze the data in terms of the unit. An experimental unit
is the unit to which a treatment is applied. Another consideration is the sample size. Not
enough data will have low power and a conﬁdence interval too wide. Too many observations
will reject any hypothesis and will lead to every factor being important.

In their analysis Cofﬁn and Saltzman concluded, after analyzing several examples, that
running times are often nonnormal and they exhibit heteroskedasticity (nonconstant vari-
ance). Therefore the analysis performed should be robust to nonnormality or nonconstant

variance. Analysis of variance methods are robust in this sense.

2.3.2 Scalability Analysis using Factorial Designs.

The work of Alabdulkareem et al. is close to our research [11]. In this work, scalability of
large codes is studied from the perspective of experimental design. Similar to our work, they
use experimental design and AN OVA to study parallel codes from what they call a “black-
box” perspective. Main differences between their study and ours are that they concentrate
only on scalability issues while we want an overview of the state of the system in general. In
their study, fractional factorial designs were used to control the number of experiments, with
large numbers of factors, each with two levels. In contrast, we have limited the number of
experiments by designing screening experiments and having the insight of the programmer

to select a few important factors for experimentation. Therefore, in our case, we have used

13

full factorial designs with fewer factors and more than two levels per factor. In their work,
measurements of execution time are used to estimate scalability of the system. Some of their
conclusions are applicable to our work. First, knowledge of the code is necessary for the
selection of factors since the results obtained in the study depend on the selection of factors;
however, not extensive knowledge is required. Also, a method for limiting the number of
experiments is required due to the time consuming task of experimenting with intrinsically
long running codes. Their results demonstrate that the CRAYJ 90 exhibits better scalability
than the IBMSP2 for the application code they are using. This application is the weather
prediction code ARPS (Advanced Regional Prediction System) from the Center for the
Analysis of Prediction of Storm (CAPS) of the University of Oklahoma. Two main routines
were identiﬁed by their method as having a large effect on the scalability for the ARPS

code.

2.3.3 Statistical Analysis of Memory Hierarchy

Sun, et al. have proposed the use of multivariate regression, factorial design, contrast and
post hoc comparisons, and ANOVA for the analysis of hierarchical memories [12, 19, 20].
In this research, a four level methodology was developed for the evaluation of memory
hierarchies on single processor performance. The dependent variables used to assess memory
performance are cpi (cycles per instruction) and cache hits, therefore this methodology
assumes that both can be measured on the system under study is conducted has a method
of providing cycles, instructions, and cache hit ratios.

The methodology is composed of four steps: main effect study, code/ machine classi-
ﬁcation, scalability comparison, and memory hierarchy study [20]. The main effect study
examines the effect of code and machine on the variable cpi. ANOVA is used on a two-level
full factorial design to determine if there is a signiﬁcant effect of machine or code on the
cycles per instructions on the system. The second step is code/machine classiﬁcation. If
there is signiﬁcative effect on the code or machine, a post hoc comparison can be made,
where signiﬁcant differences among means is studies to classify code or machine into similar

statistical groups. The least signiﬁcant difference (LSD) post hoc method was used in this

14

study. Third, a scalability comparison is done using regression analysis. Here problem size
and machine are studied versus cpi. Finally, the last level of analysis uses cache hits to
locate which memory components are causing the variations found in the previous three
levels of analysis. This last step is dependent on the kind of measurements available on a
particular system while the other three levels are independent of the system.

This study was done on two SGI systems: Origin 2000 and a Power Challenge, both
with the same processor but with different memory hierarchies. Results obtained by Sun et
al. show that the Origin 2000 has better scalability for the types of codes used in the study.

Like this study, our research is using AN OVA and design of experiments to study perfor-
mance. However, we do not concentrate our work on only one metric and we do not propose
a multilevel analysis. While Sun uses a full factorial fully randomized design, we are using
a full factorial split-split plot design. Our work also concentrates on overall metrics on a

multiprocessor system in contrast to their work on single processor analysis.

2.4 Multivariate Methods for Performance Data Analysis

There are several works in the area of reduction of multidimensional performance evaluation
data.

Early work by N ickolayev et al. [13] demonstrates that the use of statistical data clus-
tering techniques on performance trace data is useful for the reduction of large volumes
of data while keeping important system behavior. They use dynamic clustering to select a
subset of traces for representing trace behavior on the system. Both clustering and entropy-
based feature subset selection are classiﬁed as unsupervised feature classiﬁcation methods.
They also use normalization as one data preprocessing technique on the data. However, two
important differences exist: the dimension along which the reduction is done and its asso-
ciated cost function. In Nickolayev’s work, a subset of traces is selected as representative,
reducing the trace space while leaving the dimension of performance metrics intact. In our
work, on the other hand, we reduce along the metric space, reducing the number of metrics

showing important information. Another difference is the cost function used. While we use

15

entropy as cost function, their work bases selection on Euclidean distance which is more
appropriate for the goal of trace reduction.

In their work [14], Vetter and Reed used statistical projection pursuit, a multidimen-
sional projection technique to identify “interesting” performance metrics from a monitoring
system. This work aims to reduce the number of metrics and the dimensionality of the data.
They do this reduction dynamically by periodically using projection pursuit for identifying
which metrics are important. Projection pursuit is a dimension reduction technique where
multivariate data sets of high dimension are projected to two or three dimensions according
to the “best” projection. The “best” projection angle is selected according to a projection
index, which determines the outcome of the method. This index is a cost function deter-
mined by the objective of projection pursuit and is usually based on the amount of structure
found on the projection. Vetter and Reed dynamically projected tracing data reducing the
number of metrics to three interesting metrics at a given sampling time. This was used on
data extracted by the Pablo performance tool.

There are some similarities between this work and ours. First both studies concentrate
on the automatic selection of performance metrics by selecting interesting metrics based on
a cost function. Moreover, the projection index used by Vetter and Reed is based on an
entropy estimate, as well as ours. In both studies we perform data preprocessing: they use
data smoothing, centering, normalization to a range, and sphering on the data while we use
Euclidean normalization.

On the other hand, they use a linear combination of measurements for the projection and
selection of metrics, in contrast to our use of subset selection techniques. Moreover, we use
dimensionality estimation to determine the number of metrics to select before starting the
selection process, different to three metrics obtained by projection pursuit. The selection
of metrics in our work is post mortem while in their case is dynamic.

One surprising similarity that we found is that even though they are using MP1 and a
distributed memory system for their case study, and we are using OpenMP and a shared

memory system, both methods selected the metric bwrite as an important metric. In all

16

our examples this metric was selected by our subset selection method. In most examples
presented in their work [14], it was selected by projection pursuit. The reason might be
that both cost functions are based on an entropy estimate.

A more recent piece of work is presented in [15]. In this study, Ahn and Vetter used
several multivariate statistical techniques on hardware performance metrics to characterize
high-performance computing systems. They speciﬁcally evaluated the use of principal com-
ponent analysis (PCA), clustering, and factor analysis to extract performance information.
Factor analysis and clustering are combined to gain insight into the behavior of metrics, se-
lecting important metrics to observe, and classifying or categorizing metrics together. They
apply these methods on three different applications on two different IBM SP systems and
the parallel code was deve10ped both with MP1 and OpenMP. Their results show that, for
homogeneous systems, metrics coming from processors with similar tasks are categorized
together. Master and worker threads show different behavior and were classiﬁed into differ-
ent clusters by their method. Also different memory behavior caused the classiﬁcation of
processors into different groups.

In the same way, we have identiﬁed multidimensional statistical analysis techniques as
fundamental for the automated detection of patterns in the information provided by the
large volumes of data obtained by performance tools. Like Ahn and Vetter, we have used a
correlation matrix to establish linear relation among metrics. We have also suggested the
use of a knowledge based system for recommending optimizations to the programmer. Our
work is also applicable to homogeneous systems. Likewise Ahn and Vetter keep metrics
which account for most of the variations of the data. However, there some differences in
our work we should mention. First, our multidimensional analysis combines feature subset
selection, dimensionality estimation with entropy cost function, and ANOVA for extracting
information from the dataset whereas Ahn and Vetter have combined clustering, factor
analysis, and principal component analysis in their study. Principal component analysis was
used for visualization purposes. Second, we have studied software performance metrics while

they examine hardware associated metrics. Third, they do not estimate the dimensionality

17

of the data set, nor use any cost function associated to entropy. On the other hand, we

have not used derived metrics as in their work, which can complement our study.

2.5 Automatic Performance Evaluation

APART stands for Automatic Performance Analysis: Resources and Tools. It started in
1999 as a group of researchers, institutions, and companies investigating the area of an-
tomatic performance analysis. They have been working towards the formalization of the
language and methods to present performance information and also have identiﬁed the re-
quirements for automatic performance analysis tools based on their vast experience in the
area.

In order to automate the analysis of performance data, the APART group worked on
three different aspects of the problem. APART ﬁrst identiﬁed the requirements for an-
tomatic performance analysis [7]. Some important conclusions were reached by this ﬁrst
study. Analysis should take into consideration the application, programmer, and perfor-
mance monitoring support. Also programming style and architecture should be taken into
account in the analysis. They also determined that there are two styles for performance
evaluation: hardware utilization and the identiﬁcation of the relation between performance
data and source code. Finally, the complex existing interactions on these types of systems
may cause that slight variations produce large performance differences.

The second study completed by the group identiﬁed the need for an infrastructure for
measurement, modeling, and analysis of performance [21]. They developed the APART-
Speciﬁcation Language (ASL). ASL describes performance properties using an object ori—
ented model of performance prOperties [16, 22, 23] and a corresponding syntax. In this work

performance properties were identiﬁed according to the programming model used.

Table 2.1 contains some of the Apart performance properties deﬁned for OpenMP. The
ﬁrst group of metrics are related to memory utilization, from which we can identify cache
usage as of importance. Synchronization and parallel organization metrics are also listed.

Since we are using KAP / Pro, we can measure only number of threads and synchronization

18

time.

Table 2.1. OpenMP Metrics

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tools for

Name Level measurement Category
Instruction Cache misses Low Level x Memory
Data Cache misses Low Level x Memory
Instruction cache hits Low Level x Memory
Data Cache hits Low Level x Memory
Cache hit rate Low Level x Memory
Disk Access Low Level x Memory
Buffer size Low Level x Memory
Number of loads and Low Level x Memory
stores
Time of context switches Low Level x Memory
Remote reference (page) Low Level x Memory
count
Number of threads High Level xosview, top Memory
Synchronization time Low Level KAP / Pro Synchronization
Synchronization counts High Level x Synchronization
Number of Iterations per High Level x Parallel Organization
Thread
Execution time of parallel Low Level x Parallel Organization
loops
Loop organization over- Low Level x Parallel Organization
head
Loop execution time Low Level x Parallel Organization
Loop overhead Low Level x Parallel Organization

 

 

We are using some of these deﬁned metrics as our response variable in our study.
The third study concentrated of implementation related issues. A survey of existing
tools was conducted and a categorization scheme was proposed. Integration of tools and

experimentation were identiﬁed as key issues in automated performance.

19

2.6 Summary

Researchers have worked in different aspects pertaining our work. Early work in per-
formance analysis established a model to be used by tool developers to relate high-level
abstractions to low-level performance information. This is called the NV model [17, 18, 10].
Some tool developers have addressed this problem with a different approach. HPCView re-
lates source code to performance data by estimating the program structure and combining
compiler information with proﬁling information [8]. Even though these two works address
the same problem as ours, we complement their work by providing a methodology applicable
to platforms where their tools are not available.

ANOVA and design of experiments have been used in for the analysis of execution
time of algorithms for operational research [3]. They have also been used for the study of
scalability of large codes [11] and the analysis of memory hierarchies [12, 19, 20].

A third aspect used in our methodology is the application of multivariate methods for
performance data analysis. Nickolayev et al. used data clustering for the reduction of
trace data along the trace space dimension [13]. Vetter and Reed used statistical projection
pursuit for identifying important metrics on a system [14]. Finally, Ahn and Vetter evalu-
ated Principal Component Analysis, clustering, and factor analysis to extract performance
information on collected with hardware counters on distributed system [15].

Finally, we present the work of the APART group whose goal is to investigate different
aspects of automated performance analysis and to move forward the research in this area in a
coordinated effort among diverse research groups, institutions, and companies [7, 16, 22, 23].

Automatic performance analysis is a quite active research area, kept alive by the APART

group.

20

CHAPTER 3

Proposed Methodology

3. 1 Introduction

This work addresses the problem of loss of information when mapping scientiﬁc appli-
cations to observable computing systems. Information mapping is crucial for automated
performance evaluation. Our research proposes a well-deﬁned methodology to extract rele—
vant information from a set of observable measures which try to describe the performance
of an observable computing system (003). This methodology uses a combination of care-
fully designed experimentation, multidimensional data analysis, exploratory data analysis
(EDA), and conﬁrmatory data analysis (CDA). EDA is characterized by no preliminary
knowledge about the possible relations of the variables under study and the use of statistics
and graphical summaries to understand the information data is conveying. In CDA, formal
statistical methods are used to conﬁrm or reject a hypothesis about the population under
study. Experimentation is used to collect unbiased data to conﬁrm or reject the hypothe-
ses and establish causal relationships among controlled factors and resulting observations.
Multivariate analysis is used to extract meaningful relations among large sets of data.
The proposed methodology applies to multiinput-multioutput (MIMO) systems with a
set of observable outputs [24] arranged into four basic steps as illustrated in Figure 3.1.
First, a preliminary problem analysis is performed. Here we can visualize in general what is
affecting performance and gather preliminary information. The second step is to specify the

experiment design to collect enough unbiased information to be analyzed for establishing

21

relationships. The third step is to collect the data. Finally, in the last step is data analysis.

A description of each step follows.

 

 

 

 

Preliminary Problem __ Design of Data A Data
Analysrs Experiments Collection Analysis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1. Proposed methodology to extract information in an OCS.

3.2 Preliminary Problem Analysis

A performance problem-solving process starts with the analysis of the problem Speciﬁcation.
Here, the components of the observable computing system are identiﬁed. These include
hardware and software components and measurement tools. In addition, information about
the programmer’s goal, the performance problem, and the application itself are collected.
This delimits the problem scope.

Once the system, application, and performance goal are clear, the next step is to proﬁle
the code to identify potential functions to optimize. Analysis continues with the identiﬁ-
cation of possible factors affecting performance. These include environment factors, algo-
rithms to solve those functions to optimize, and hardware speciﬁc factors. Next, a subset of
factors is selected for the experiment, considering controllability, feasibility, practicability,

and constraints.

3.3 Experiment Speciﬁcation

The second step in the methodology is the experiment speciﬁcation. The theory of design
of experiments allow us to take an objective approach in the experimentation process [25].
Experimental relationships allow for the identiﬁcation of causality among variables [9]. A

well known model of the experimentation process is shown in Figure 3.2 [25].

22

Controllable factors

 

 

 

 

algorithms problem size
code I I I execution time
——> ———>
Inputs 3 System 3 Outputs
' (1
input ata T T T output data
workload OS processes

Uncontrollable factors

Figure 3.2. Model of an experiment.

Studying all possible factors and levels of these factors is an intractable problem. A
level refers here to the different possible values of one factor considered in an experiment.
In order to obtain the total number of experimental runs, it is necessary to calculate all
possible assignment of factors when varying all at the same time. The next step is to
select the random order in which the experimental runs will be executed. Randomization
is required to avoid the inﬂuence of uncontrollable factors in the outcome. We must also
have at least two replicates of the experiment [25].

The effect of each factor is obtained through experimentation by the use of a factorial
design. In this type of design, all combinations of all levels of all factors are tested, usually in
complete random order [26, 27]. For practical considerations, in certain cases a completely
random set of runs might not be easily implemented. A completely randomized run would
imply that from run to run any factor may change. For most computer applications, this
is impractical. For example, in our study, changing the problem size from experimental
run to experimental run results in excessive time and limits our ability to automatically
control experimentation. So a split-split-plot design was used. Split-split-plot is a special
case of a split-plot design. A split-plot design is a general case of a factorial design in
which randomization is restricted. In this design, one factor is selected for a treatment. A
treatment is a set of levels of controllable factors administered to an experimental run. The
order in which the treatments will be applied to this factor is selected at random. Once this

is ﬁxed, a second factor is selected and, given the order for experimental runs selected for the

23

ﬁrst factor, randomization is done on the second factor. This could be repeated successively.
When a third factor follows the same restrictions, this is called a split-split plot design [25].
A partial randomization of experiments causes a higher experimentation error so split-split
plot is suggested only when a completely randomized design is not possible for practical

reasons .

3.4 Data Collection

The data collection step is the only one determined particularly by the computer system,
language, and the tools used. This is due to the large variation of metrics available for
different computer systems and at different levels. One group working towards standard-
ization of performance metrics is the APART (Automatic Performance Analysis: Resources
and Tools) group [7, 21]. Their work moves towards the formalization of the language and
methods to present performance information and to identify the requirements for automatic
performance analysis tools. APART workpackage 2 presents a set of metrics deﬁned using
ASL for determining some performance properties for shared memory, message passing, and
high performance Fortran [21].

During this step we identify which metrics are measurable for the paradigms and systems
being used. Speciﬁcally, we identify the instrumentation tools that are available and the
metrics that are measurable at the operating system, application, and hardware levels.
Then from these, for a given paradigm, we select the APART-recommended set of metrics.
Important metrics suggested by the application programmer should also be selected. Once a
set of performance metrics is selected, instrumentation is activated to collect the data. Code

is compiled and linked as needed, and performance data are collected during execution.

3.5 Data Analysis

After data collection, analysis begins. Those metrics obtained from the OCS are a sample
drawn from a stochastic process [28]. We assume this process is mean-square ergodic in

the mean, that is, the corresponding time average converges to the ensemble average in the

24

mean-square sense. These assumptions are necessary to make use of the statistical methods
explained below.

Performance metric data is ﬁrst formatted to support the statistical techniques to be
used. For one experiment, a matrix format is used. Each element of the matrix is either
an average or absolute metric value. An average value is computed as the sum of all metric
sample values divided by the number of samples, where the samples of the metric values
are taken during execution time. Each column of the performance data matrix contains
the measurement of one performance metric over a set of experimental runs and each row
contains information about one experimental run. Several statistical techniques may be
applied to this matrix, as described below.

The data should be preprocessed. There are several normalization methods which can
be applied to high-dimensional performance metrics: absolute, log normalization, min-max,
and vector normalization. Absolute refers to no normalization. Log normalization is the use
of a logarithmic transform. Min-max is to transform the data to the (0,1) range. Finally,
vector normalization normalizes each column by dividing by the Euclidean norm.

We have used the correlation coefficient to ﬁnd linear relation among variables. The
correlation coefﬁcient is a measure of the linear association between two variables. The
correlation matrix is a two—dimensional array of correlations where all correlation coeﬁ'icients
are organized systematically. The value of each element (i, j) in the correlation matrix
contains the correlation coefﬁcient between metric i and metric j.

In performance data analysis, large volumes of data with complex relationships contain
the information on the behavior of the system. We use unsupervised feature subset selection
methods for the automatic selection of important features describing the system. Figure
3.3 illustrates this process.

Two important issues of unsupervised automatic feature selection are the order identiﬁ-
cation or dimensionality of the data set [29] and the subset generation method [30]. Intrinsic
dimensionality estimation methods have been used in the past to estimate the number of

components to retain and the number of features to keep [31, 32]. This is illustrated in

25

Software and
Conﬁguration

 

 

 

 

Parameters Measurements
__> Observable Computing ‘ ’
System
Measurements Relevant

 

___> Unsupervised Feature Metrics
Subset Selection >

 

 

 

Figure 3.3. Feature Subset Selection Scheme.

ﬁgure 3.4.

We have tested several methods for intrinsic dimensionality estimation. Once the dimen-
sion of the data set is estimated, a subset search method should be selected. We have used
sequential forward search with entropy cost function to select the most important metrics
in the data set.

Analysis of variance (A NO VA) is a statistical procedure for the analysis of the response
of an experiment. It is used to estimate the contribution of each factor to the variations
in the outcome. We are using ANOVA to determine whether there is inﬂuence of any of
the factors on the result obtained for each performance metric. Post hoc methods will then

identify which differences are signiﬁcative in the data.

3.6 Summary

This chapter presented an overview of the methodology developed to understand the rela-
tionship between high-level abstractions and low-level performance information in an ob-
servable computing system. This methodology is composed basically of four steps: prelim~
inary problem analysis, speciﬁcation of the experiment design, data collection, and data
analysis. An overview of each step was given. Speciﬁc details of the methodology are given

in chapters 4, 5, 6, and 7.

26

Large Set of

Features
Feature Feature
Selection Extraction
Select a subset from Combine features to obtain
a large set of features a reduced set of new features
Sequential Sequential Oscillaing Pnncrpal Factor
Forward Backward Methods Component Analysrs
Analysrs (PCA)

Search Search

V V

w n f at r s lect?
H0 ma y e u es to c How many components or factors to select?

\/

Intrinsic Dimension Estimation

I

Scree Test
Kaiser—Gunman
% Total Variation

Figure 3.4. The combination of feature selection and feature extraction for performance
data analysis.

27

CHAPTER 4

Preliminary Problem Analysis

4.1 Introduction

Most problem solving techniques applicable to the area of computer performance evaluation
start with a problem analysis. Problem solving can be broadly viewed from two different
perspectives: a behaviorist approach and an information-processing-based approach [33, 34].
The behaviorist view is based on stimulus and response without considering the process to
solve the problem. The information processing view is concerned with the process leading
to problem solution [33]. This second perspective is more appropriate for the performance
evaluation problem. In the information processing approach a pattern or form of solution
is suggested to get the desired goal.

General problem solving patterns in literature establish different steps to reach a solution
[34]. Despite their differences, all of them concur on three basic steps: problem and system
deﬁnition, current situation assessment, and evaluation of alternatives. These steps conform
the preliminary problem analysis, illustrated in Figure 4.1, and will be described in the

following sections.

4.2 Problem and System Deﬁnition

The target problem selected for this research consisted of porting a computational electro-

magnetics application to a symmetric multiprocessor system with four processors. Speciﬁ-

28

 

 

 

 

 

Preliminary Problem Design of Data Data

M

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Analysis Experiments Collection Analysis
Problem and Current Evaluation
System —> Situation —> of
Deﬁnition Assessment Alternatives

 

 

 

 

 

 

 

 

 

Figure 4.1. Preliminary Problem Analysis is the ﬁrst step in the proposed methodology.

cally it implemented ﬁnite elements method for conformal antenna analysis. The goal was
to parallelize the serial code, taking advantage of the system, and reducing the execution
time.

The code in this application could be considered legacy code since it was developed in
Fortran77 over a period of time [35]. The programmer’s expertise is in electromagnetics
and numerical methods. However, to tune the application to the target system required

detailed knowledge of the computer system and tools.

4.2.1 Finite Element Method in Electromagnetics

The analysis and design of antennas require the characterization of associated electromag-
netic ﬁelds. Maxwell’s equations form the basis of electromagnetic theory and apply to
general ﬁelds. In the general form, these equations are not easily solved by direct analyt-
ical methods. However, they are simpliﬁed and tailored to speciﬁc conditions by making
appropriate assumptions. To completely determine the set of equations, boundary condi-
tions are imposed in addition to complementary condition equations. Numerical methods
are then applied to ﬁnd a feasible solution. Those methods impose a heavy computational
load on the target system. Two numerical methods used in computational electromagnet-
ics (CEM) are integral—equation methods, also known as methods of moments, and ﬁnite
element-frequency domain (FE-FD) methods [36].

The code in the target application implements a ﬁnite-element boundary-integral (FE-

29

BI) method for conformal antenna analysis. A conformal antenna is an antenna which
adapts or ”conforms” to the surface to which it is mounted on [37]. These antennas are
attractive to use on vehicles due to their low weight and ﬂexibility [38]. Even though
detailed information about the application itself is not required, a general understanding of
the FE-BI method is needed for optimizing and comprehending the code.
Integral—equation methods, also known as the method of moments (MOM), start with an
integral equation, generally involving Green’s function, in the time domain. They assume
that an integrated function can be approximated by a linear combination of a set of basis or
expansion functions [36, 39, 40]. This method converts the integral equation to the matrix

form

zm = {F} (4.1)

where Z is the impedance matrix, I is the currents data vector, and F is the excitation
data vector [36, 40]. Important characteristics of the method of moments are that the gen-
erated matrices are dense, the method is computationally intensive, and has large memory
requirements [36, 41].

When considering the solution of Maxwell’s equations in differential form, the ﬁnite
elements method (FEM) applies. This method is based on the decomposition of the equation
domain into nonoverlapping subregions called ﬁnite elements [39]. This decomposition is
called meshing [42]. In each subregion, a simple function approximates the solution of the
equation which might be complex over the larger region [39, 43]. If the elements are small
enough, this approximation is close to the solution. In this case, small enough implies
smaller than i of the wavelength per side [42]. For two-dimensions, the elements are
polygons. Simple geometric shapes are used as elements for three dimensions, such as those
shown in Figure 4.2 [44].

Instead of solving the original formulation, which may include higher order derivatives
in the differential equation, the ﬁnite element method uses a weak formulation, which re—

duces the differentiability requirements [44]. This allows the use of piecewise functions as

30

 

 

 

_-——————

 

 

approximation functions [43, 45].

When Maxwell’s equations are integrated over the ﬁnite elements and boundary condi-
tions are imposed, a system of linear equations is constructed that can be expressed in the

form

where A represents a square, sparse matrix, H Z denotes the magnetic ﬁeld vector, and I
is the excitation column vector. The advantage of the FEM is that A results in a sparse
matrix so it has lower memory requirements. A disadvantage is that it is difﬁcult to evaluate
the boundary conditions when the domain is inﬁnite. Appropriate boundary conditions for

terminating the mesh are required. Volakis et al. summarize the steps to generate and solve

a FEM system as follows [36]:

The ﬁnite elements are chosen.

The mesh is generated.

Right Angled
Brick

Figure 4.2. Some representative ﬁnite elements.

Asz} ={1},

The domain of the problem is determined.

The method for terminating the mesh is selected.

The matrix is generated by using the wave function.

Boundary conditions are applied to construct the linear system.

31

Tetrahedron

o The solver is selected and the system is solved.

0 Those parameters of interest are computed. These might include capacitances,

impedances, scattering matrices, etc.

The ﬁnite-element boundary-integral (FE-BI) method combines a ﬁnite elements method
with integral equations to represent the ﬁelds outside the surface and to terminate the mesh.
Exact boundary conditions are used to terminate the mesh [44]. The resulting system is
partly sparse and partly dense. The advantage of this method is that for a certain class of
problems, it can be solved efﬁciently.

The code used in this particular application implements FE—BI for the analysis of confor-
mal antennas. It was developed by Leo C. Kempel, from the Electromagnetics Laboratory
at Michigan State University [46]. The FE-BI equations solved by the method are obtained

from the weak form of the wave equation explained by Kempel in [46] and shown below

[V[va.-ﬁ:’-vaj]dV—k3/V[Wt-a-ledv

f [w x Wyn x “an
8

 

SR
4:3 [5 /S [W,- - 3x 562 x 2- w,]ds’ds = ff’“ + ff“ (4.3)

The ﬁrst term on the left side of the equation is related to the magnetic ﬁeld, the second
with the electric ﬁeld, the third with the resistive transition conditions, and the last one
is the boundary integral term. The terms ff’“ and ff“ are functions of the internal and
external excitations. The code computes the input impedance of the antenna.

As explained in section A.2.2, the solution of a large system of linear equations can be
found either using direct or iterative solvers. Direct solvers determine the solution in a
ﬁnite number of steps, while iterative solvers begin with an initial guess of the solution and
iteratively improve it until a good enough solution is obtained. As we previously explained
there are stationary and nonstationary methods. Stationary methods include the Jacobi and

the Gauss-Seidel. Nonstationary methods include the conjugate gradient (CG), Generalized

32

Minimal Residual (GMRES), BiConjugate Gradient (BiCG), Conjugate Gradient Squared
(CGS), and Biconjugate Gradient Stabilized (Bi-CGSTAB). Iterative methods implemented
in the application code are BiCG, CGS, and Bi-CGSTAB. The convergence rate of the
method can be improved through preconditioning as previously explained. A diagonal
preconditioner, also known as Jacobi preconditioner, was used in the code [42].

In general, the application consists of a biconjugate iterative solver for a system of linear

equations of the form

G 0
[AI{E}+ 0 0 {E}={F}, (44)

where A is a sparse matrix and G is a symmetric dense matrix. Only the lower triangle
part of G is saved in memory using the compressed sparse row format [42]. Dense matrices
are very large so memory becomes a variable of concern when solving the problem. One
of the input parameters is the error threshold. It determines the number of iterations. In
the implemented code we worked with relatively small problems of variable sizes, where the
smallest number of unknowns was 6033.

A preliminary study using a proﬁler pointed to a dense matrix-vector multiplication
subroutine as the code bottleneck, taking most of the execution time and being by far the
most time consuming task. Changing the problem size gave a similar proﬁle, pointing to the
same routine as bottleneck. We ran the application on an SMP-4 processor Spare Enterprise
machine. OpenMP directives were used to take advantage of the SMP architecture of the

machine.

4.2.2 Observable Computing System

An observable computing system (OCS) is any given computing system with a set of observ-
able measures. In our context, these observables represent a physical quantity measurable
from the particular system and are traditionally called performance metrics.

The OCS used in this case study was composed of an SMP computing system, a set
of software tools, and a set of performance measurement tools with their corresponding

metrics.

33

 

Architecture

We ran our experiments on a quad-processor Sun Enterprise 450 Server. This machine is a
shared-memory, symmetric multiprocessor system (SMP). Each one of the processors is an
UltraSparcTM II running at 400MHz with 2MB of local high-speed external cache memory.
The UltraSparc II is a superscalar/superpipeline 64 bit RISC microprocessor [47] with a
nine-stage pipeline and nine concurrent execution units. The execution units are four integer
execution units, three ﬂoating-point execution units, and two graphic execution units [48].
The UltraSparc II has a specialized instruction set called VIS (Visual Instruction Set)?”M
designed to accelerate multimedia, image processing, and networking applications. The
processor contains a 16K non-blocking data cache and a 16K instruction cache with 2-bit
branch prediction. The processors connect to main memory and I/O via the Ultra Port
Architecture (UPA) data bus [49].

This particular machine has 640MB of main memory, with 4—way memory interleaving.
The connection from each processor to main memory is through a crossbar switch conﬁgured
to obtain uniform access to memory. There are two levels of cache. The ﬁrst level, contains
both an instruction cache (I-cache) and a data cache (D-cache). The I-cache is associative

with 32-byte cache lines, while the D-cache is directly mapped with two 16-byte sub-blocks

per line [50]. The second level is the external cache.

Software and Analysis Tools

In addition to the operating system, three main components were used for software devel-
opment and measurement: the KAP / Pro toolset, a proﬁler, and operating system measure-
ment calls.

The Operating environment was Solaris 7 (SunOS 5.7 operating system) which supports
the UltraSparc architecture and multithreading [51]. The KAP/ Pro toolset was used for
software development, measurement, and performance analysis. KAP / Pro has three com-
ponents: Guide, Assure, and GuideView. Guide is the compiler and linker component of

the toolset. It is actually a precompiler on top of a Fortran compiler. In our case, we

34

used the Forte Fortran HPC 6 Fortran compiler. Guide supports OpenMP directives for
Fortran 77 and includes a statistical library which allows instrumentation of multithreaded
code. Assure is the debugger/ thread analyzer and GuideView is the visualization tool for
performance analysis. Performance of code instrumented with Guide is visualized through
GuideView.

The proﬁler used was gprof. A proﬁler is used to determine which portion of the time is
spent on each routine. This gives us a rough idea of where should we start optimizing the

code. The operating system measurement calls used were sar, iostat, and vmstat.

4.3 Current Situation Assessment

A general assessment of the performance of the code was obtained through the gprof proﬁler.
It was important to determine which routines were the most time consuming and how large
was the difference between them. This information is useful because sometimes improving
only one routine will cause a large difference in performance.

Proﬁling the EMAGs code pointed to the routine BiMATVECCav taking up most of the
execution time. BiMATVECCav was performing a matrix-vector multiplication operation
on a dense matrix. The routine required double precision complex number operations. It
was noticed that the dense matrix was originally saved in a vector in column-major order.
Changing the data structure to save this matrix in row-major order improved the execution
time by 40%. After the data structure was changed, the same routine was still taking a
signiﬁcant amount of time. Other functions and subroutines executed by the program were
consuming only a small percentage of the time. This included algorithms related to sparse
matrix-vector multiplication.

It was important to determine a list of possible factors affecting the performance of the
application. From these, a subset was selected for experimentation. Some of the factors

considered were included:

0 Compiler Options: The compiler options selected by the programmer affect perfor-

mance. In some compilers, the order in which the compiler options are given affects

35

the outcome. Since in our case, the order affected the size of the executable, we
assumed that the order of compiler ﬂags was also a factor to consider. There were
combinations of ﬂags not allowed by the compiler so these were discarded as possible

options.

0 User workload: When the system is being heavily used, the performance is com-

pletely different than when the system is dedicated to only one task.

0 Sampling time: How often the system was sampled for measurement affects per-
formance. If the system is sampled too Often, there is a heavy load imposed on the
system. On the other hand, if the system is sampled at large intervals, important

transient information might be missed.

0 Number Of processors: The number of processors working on a problem might be

changed to determine system speedup.

0 Problem size: This number indicated how large was the linear system of equations

to be solved by the code. It was controlled by the antenna speciﬁcations.

0 Algorithms: Different algorithms for solving dense matrix-vector multiplication
might change performance. Also algorithms for sparse matrix-vector multiplication

or any other important kernel of the code.

0 Iterative solvers: Different types of iterative solvers might be programmed to solve

the particular application under study.

0 Hardware: The system might be changed itself. For instance, memory might be

increased, operating system changed, etc.

Rom these possible factors, a subset was selected for experimentation. Those whose

effect was not as noticeable as others were left out from the subset.

36

 

4.4 Evaluation of Alternatives

The parsimony or Pareto principle establishes that few factors will have the most effect
on the outcome while others will contribute very little. The process of pre—selecting those
factors which will be considered for experimentation is called screening [52, 53].

In this phase of experimentation we have used the one-factor-at-a-time approach de-
scribed by Montgomery [25] as the screening method. We selected a baseline or set of levels
for each factor, and then we varied one factor at a time and obtained a general assessment
of which factors were having a larger effect on the response.

Some criteria in selecting the factors to use in experimentation were controllability,
practicability, and constraints. Controllability refers to whether or not we can control the
factors themselves. Practicability indicates the usefulness of the variation of the factor in the
experimentation process. Constraints, such as time, were also imposed on the experiment
design.

As explained in Chapter A, factors are classiﬁed as design, held-constant, and allowed-
to-vary factors. From the list of factors described above, the type of iterative solver, the
number of processors, and the hardware were held constant. This was due to practical
reasons. Workload was allowed to vary but under certain constraints: no user was allowed
on the machine when experiments were conducted. This limited the workload to processes
ran by the operating system. Finally, compiler options, sampling time, algorithms, and
problem size were selected for screening.

Once the screening was complete, sampling time was discarded since the three levels
selected were basically not affecting the outcome of the experiment. The only measurement
taken for the screening process was execution time.

In summary, the factors will be considered in the following way:

0 Held-constant factors

0 Iterative solvers

0 Number of processors

37

 

0 Hardware
0 Allowed-to—vary factors

0 User workload (with limited use of resources)

0 Sampling time
0 Design factors

0 Compiler options
0 Problem size

0 Algorithms

The code was parallelized using OpenMP calls on the BiMATVECCav routine. This
routine was by large taking most of the execution time. Two main reasons were behind
selecting OpenMP over MP1 for parallelization. First, the system was a shared memory
machine, therefore, OpenMP would map directly to the system. Second, the application
programmers were not familiar to parallel processing and changes to the code in OpenMP

would be easier to understand than MP1 changes.

4.5 Summary

This chapter has presented an overview of the Preliminary Problem Analysis step of the
proposed methodology. Problem solving techniques suggest three basic phases of prelimi-
nary problem analysis: problem and system deﬁnition, current situation assessment, and
evaluation of alternatives. These three phases were explained in the context of our appli-
cation. The application code used in this work deals with the problem of the use of ﬁnite
element (FE) methods for the analysis of conformal antennas. It speciﬁcally implements a
ﬁnite-element boundary-integral method which combines FE with integral equations (IE)
to represent the ﬁelds outside the surface and to terminate the mesh. This leads to the

solution of large systems of linear equations using iterative methods. The iterative solvers

38

implemented in this code are BICG, CGS, and Bi-CGSTAB. A proﬁle of the code pointed
to a dense matrix-vector multiplication as the code bottleneck. We ran the application on
a Sun Enterprise 450 SMP system with four processors, parallelizing it with OpenMP. The
KAP/ Pro toolset was used for software development, measurement, and analysis. A list of
possible factors affecting performance was compiled and evaluated. Screening experiments
were held to study the factors and limit this list to three design factors: problem size,
dense matrix-vector multiplication algorithm, and compiler Options. Next chapter presents

a description of the Experiment Speciﬁcation step of the methodology.

39

CHAPTER 5

Speciﬁcations for the Experiment

5. 1 Introduction

Empirical studies were identiﬁed as key step for automatic performance evaluation by the
APART group in workpackage 3 of the phase one of the study [54]. The goal of exper-
imentation is to understand the interactions in the system. Causal associations can be
made in properly designed experimental studies [9]. Design of experiments is an effective
tool for understanding relations in processes and is typically used in industrial statistics.
DOE provides a minimum set of experiments necessary to obtain the information we need
in the most effective order. Experiment speciﬁcation is the second step in the integrated
methodology, and it is illustrated in Figure 5.1. We will explain how DOE is used to obtain

relevant information on the system-software interactions in our work.

 

 

 

 

Preliminary Problem ‘ Design of Data Data
Analysis 7 Experiments Collection Analysis

,/I\

Replication Randomization Blocking

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.1. Design of Experiment step in the methodology.

40

5.2 Performance Characterization Experiments

Before proceeding further, we need to deﬁne some terminology.
Deﬁnition 5.1 Compiling

Let 05 be the set of all selected software codes to be mapped to a target machine
and their variants obtained from a given canonical formulation. Let CC denote the set
of compiled codes. Let PéDC denote the parallel compiler that operates on the selected
software codes and dk a compiler option for the selected compiler. Then for a software code

C,- 6 CS and its compiled version C: 6 Go there is a relation

PéDC'ut..--- ,dm) {0.} = 6. (5.1)

This is called compiling.
Figure 5.2 show the process of converting the selected software codes to a set of compiled

codes.

 

 

 

 

IDC
. ——>
PC
Parallel
Compiler
Set of Selected Set of Compiled
Software Codes

Codes

Figure 5.2. Compiler operating on selected software codes.

Label I DC will identify a speciﬁc compiler. For example, Pg“) will identify the
KAP / Pro Compiler for Fortran called guidef? 7.

Deﬁnition 5.2 Linking

41

 

 

 

 

 

Let L L denote the linker and loader used on the code and let CE be the set of executable
codes. The relation on Go and CE is LL(l3,lm, - -- ,l,){ak} = {bk} where LL denotes the
linker and loader used; lglm, - -- ,l, are the different libraries used in the linking process;
and uh 6 CC, bk 6 03.

Figure 5.3 shows the process of linking and loading the compiled code. It produces an

executable ﬁle.

 

 

 

 

Linker
and
Set of Compiled Loader Set of Executable
Codes Codes

Figure 5.3. Linker and loader operating on compiled codes.

Deﬁnition 5.3 Metrics

Let I‘ denote the set of metrics used to measure performance and posm the Oper-
ating system under which measurements are taken. Then for a]; 6 CE and 7;; E I‘,
p03m (pm,pn,- -- ,p,,ps){ak} = 7;, where pm,pn, - -- ,pr,ps denote parameters used to ob-

tain the metrics from the system.
Deﬁnition 5.4 Performance Characterization Experiment

A performance characterization experiment is a composition of the form p03,," oL L oPéD C
acting on a selected code. In other words, a performance characterization experiment or
(PCE) is deﬁned as the procedure of selecting a software code in a given computer language,
applying a parallelizing compiler with an ordered set of directives, running the code on a

target machine, and retrieving a well deﬁned set of performance parameters.

42

 

5.3 Design of Experiment

High performance computing systems were designed to be used at maximum potential, how-
ever, most of the time, performance is not even close to full potential. Even though per-
formance analysis tools were designed to aid researchers to achieve maximum performance,
they are not widely used. Reasons include the lack of guidance on possible problems in the
code and the fact that users are expected to understand the data and views presented by
the tools and associate them to their codes [8]. A solution to this problem could be obtained
through the use of automatic performance tools that guide the user in the analysis and the
solution of the performance problem.

HPCView addresses this problem by correlating data to source code [8]. However, these
utilities are not necessarily available for all platforms. Given the large variety of systems
available, nowadays we need a general methodology to correlate performance data from
advanced platforms to variations in source code or changes in any other important factor
considered relevant.

This can be done through the use of design of experiments. A carefully designed ex—
periment. can establish relations among factors and outcomes on an Observable computing
system. In the screening process explained in Section 4.4 we have done preliminary stud-
ies to detect the most inﬂuential factors on the outcome. A simple design was used for
screening purposes. Once the set of factors is selected, an experiment should be designed.

As explained in Section A.5.2, there are three basic criteria to consider in an experiment:
replication, randomization, and blocking. We require to have at least two replicates of an
experiment [25]. A full-factorial design was used. A simple experiment can be used for
screening purposes but for complex interactions as those obtained in high performance
computing systems, it is not efﬁcient.

The randomization scheme is Of importance in deciding a speciﬁc design of experiment.
In a completely randomized design, the order in which experimental runs are arranged is
randomly allocated. When in a factorial experiment we are unable to completely randomize

the order of the runs, a split-plot, or as in our case a split-split plot design might be used.

43

 

 

A partial randomization of experiments causes a higher experimentation error so split-split
plot is suggested only when a completely randomized design is not feasible.

Figure 5.4 shows a graphical description of a block of our split-split-plot design. A
block refers to a replicate or repetition of the basic experiment. In this ﬁgure, a block in
the design is divided into whole plots where the the problem size (1, 2, and 3) was selected
at random. The subplot factor is the matrix multiplication algorithm (A and B). Then

sub-subplots will contain the compiler options (a - m) that were tested randomly.

 

 

2 l 3
A B B
b,l,a,....h d.C.g.---,b a,i,d.....c
B A A
j’h'e"",c k’b’g"°”a m.a,e,ouc’h

 

 

 

 

 

Figure 5.4. Example of one block in our split-split plot design.

5.4 Detailed Description of the Experiment

Four different experiments were performed: two characterization experiments using the
application code, and two validation experiments, one with the application code and another
one with a matrix-vector multiplication algorithm.

The goal of all experiments was to identify existing relations among different compiler
Options, problem size, and code to the performance metrics obtained when running the
experiment. The eﬂ'ect of uncontrollable factors in the experiments caused by external
workload has been minimized by selecting a no workload system. We have selected a subset
of compiler options, algorithms, and metrics to test our methodology.

The name of our application code is Prism. It implements a ﬁnite-element boundary-

integral (FE-BI) method for conformal antenna analysis as explained in section 4.2.1. The

44

code was developed by the electromagnetics research group (EMRG) at Michigan State

University.

5.4.1 Experiment 1: Parallel implementation of Prism

As explained earlier, we parallelized Prism using OpenMP directives. The objective of the
parallelization was to improve a matrix-vector multiplication routine taking up most of the
execution time. Three different factors were selected for the experiment: compiler options,
algorithm, and problem size. We performed three repetitions of the experiment. A full
factorial experiment design was used, but as explained earlier, a fully randomized problem
size is impractical. So we worked with a split-split plot design.

When establishing the experiment design, we devised the following number of levels
for each of the three factors: two algorithms, three problem sizes, and sixteen compiler
options. Compiler Options are evaluated from left to right in one pass in our target compiler.
Therefore, permutations in the order of compiler options are considered a different set of
compiler options. The sixteen compiler options we are using were obtained by permutations
and combinations of three different compiler ﬂags. We selected ﬂags -fast, -unroll=2, and
-xcrossfile after performing a preliminary study of the effect of different ﬂags on the code.
These were the three ﬂags with most effect on the results. Compiler Option -Wgstat: will
always show in the compiler options since it is required for measuring performance with
KAP/Pro’s statistical library. However, when studying compiler option -xcrossfile in
more in detail, we found that it requires optimization level four or more, so it is ignored in

the following cases:

0 Compiler option -xcrossfi1e -Vgstats, is equivalent to -Ilgstats

o Compiler Option -unroll=2 -xcrossfile ~Wgstats is equivalent to -unroll=2

-Wgstats

o Compiler option -xcrossfile -unroll=2 -Wgstats is equivalent to -unroll=2

-Wgstats

45

Therefore the experiment consist of two algorithms, three problem sizes, and sixteen
compiler options. This experiment then consists of 234 experimental runs. This comes
from formula A.10.

The sequence of experiments were labelled E1 to E234.

Therefore, we ended having the following factor levels in the experiment:

1. Algorithm

0 Algorithm A is the original matrix multiplication algorithm shown in Appendix

C.

0 Algorithm B is the modiﬁed matrix multiplication algorithm shown in Appendix

C.

2. Problem Size

0 The original problem size was N = 6033.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the dense matrix and we obtained a size of N = 6337.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the sparse and we obtained a size of N = 13857.

3. Compiler Options

We had three replications of the experiment. Experiments 1 to 78 were replication one,
experiments 79 to 156 were replication two, and experiments 157 to 234 were replication
three. The order in which the algorithms, compiler options, and size were selected was
randomized in the following way. First, we selected a random problem size. For that size
we randomly selected the order of the algorithms. Third, for each option, at random we
pick the 13 compiler options to perform the experiments. This is called a split-split-plot
design. The metrics shown in Tables 6.1 to 6.3 were selected for assessing the experimental

results. One problem we had to confront when designing the experiment was that statistical

46

 

 

 

 

Table 5.1. Compiler Options in Experiment 1.

 

[ 1 ] No ﬂags -Wgstats

[ 2 [ -fast -Wgstats

[ 3 -unroll=2 -Wgstats

[ 4 -fast -unroll=2 -Wgstats
L5 -fast -xcrossﬁle -'Wgstats
[ 6 -unroll=2 -fast -Wgstats

 

 

 

 

 

 

 

 

 

 

 

 

[ 7 -xcrossﬁle -fast -Wgstats
[ 8 ] -fast -unroll:2 -xcrossﬁle —Wgstats ]
[ 9 ] —unroll=2 -xcrossﬁle -fast -Wgstats ]

 

 

 

 

 

 

[ 10 ] -xcrossfile -fast -unroll:2 -Wgstats ]

 

 

11 -fast -xcrossﬁle -unroll=2 -Wgstats ]

 

 

 

 

12 -unroll=2 -fast -xcrossﬁle -Wgstats [
[ 13 [ -xcrossﬁle -unroll=2 -fast -Wgstats ]

 

 

 

tools such as Minitab do not allow 13 levels of one factor. These types of statistical tools
are designed for industrial experiments, so that number of factors in one variable is very
uncommon. Therefore, we programmed our own design of experiment random generator
to obtain the order in which the experimental runs were allocated. The code is shown in
Appendix I. The ﬁnal order in which the experiments were executed is shown in Appendix
1.1. We would also like to mention that since we used a split-split plot design, the AN OVA
calculations were rather different than the fully-randomized calculations, typically found in
statistical softwares. We used the software SAS to specify the split-split plot model and the

appropriate error term calculations.

5.4.2 Experiment 2: Serial implementation of Prism

A serial version of Prism was studied. We had three repetitions of the experiment for which
a split-split plot design was used.

The experiment consisted of two algorithms, three problem sizes, and thirteen compiler
options. This experiment had 234 experimental runs.

Therefore we used the following factor levels in the experiment:

47

 

1. Algorithm

0 Algorithm D is the matrix multiplication algorithm shown in appendix C.

0 Algorithm E is the matrix multiplication algorithm shown in appendix C.

2. Problem Size

0 The original problem size was N = 6033.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the dense matrix and we obtained a size of N = 6337.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the sparse and we obtained a size of N = 13857.

3. Compiler Options

Table 5.2. Compiler Options in Experiment 2.

 

1 ] No ﬂags -Wgstats
2i -fast -Wgstats

 

 

 

 

[ -unroll=2 -Wgstats

 

[ -fast -unroll=2 -Wgstats

 

 

-fast -xcrossﬁle -Wgstats

L—ﬁl—LAMM

 

 

 

-xcrossﬁle -fast -Wgstats

 

———r—-——l

[ -fast -unroll=2 -xcrossﬁle -Wgstats

 

 

3
4
5
6 -unroll=2 -fast -Wgstats
7
8
9

-unroll=2 -xcrossﬁle ~fast -Wgstats

 

 

10 -xcrossﬁle -fast -unroll=2 —Wgstats

 

 

 

 

11 -fast -xcrossﬁle -unroll=2 -Wgstats
L12 [ ~unroll=2 -fast -xcrossﬁle -Wgstats I
[13]-xerossﬁle -unroll=2 -fast -Wgstats ]

 

 

 

 

 

The order in which the algorithms, compiler options, and sizes were selected was random

in the same way they were selected for Experiment 1. The same metrics were selected for

48

 

assessing experimental results. The ﬁnal order in which the experiment were executed is

shown in Appendix E.

5.4.3 Experiment 3: Inneﬁcient memory access pattern in Prism, valida-

tion experiment

A parallel version Of Prism was studied, this time adding a third algorithm with an inefﬁcient
memory access pattern. We had three repetitions of the experiment and similarly a split-
split plot design was used.

The experiment consisted of three algorithms, three problem sizes, and thirteen compiler
options. This experiment had 351 experimental runs.

Therefore we used the following factor levels in the experiment:

1. Algorithm

0 Algorithm A is the matrix multiplication algorithm shown in appendix C.
0 Algorithm B is the matrix multiplication algorithm shown in appendix C.

0 Algorithm C is the matrix multiplication algorithm shown in appendix C.
2. Problem Size

0 The original problem size was N = 6033.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the dense matrix and we obtained a size of N = 6337.

o The problem was modiﬁed by changing the antenna speciﬁcations which affected

most the sparse and we obtained a size of N = 13857.

3. Compiler Options

The order in which the algorithms, compiler options, and sizes were selected at random
in the same way they were selected for Experiment 1. The same metrics were selected for
assessing experimental results. The ﬁnal order in which the experiment were executed is

shown in Appendix F.

49

Table 5.3. Compiler Options in Experiment 3.

 

] No ﬂags -Wgstats
[ -fast -Wgstats

 

 

 

 

-unroll=2 -Wgstats

 

 

 

 

-fast -xcrossﬁle -Wgstats

 

-unroll=2 -fast -Wgstats

 

 

I
I
|
-fast -unroll=2 -Wgstats ]
I
I
I

—xcrossﬁle -fast -Wgstats

 

 

(X) «razor uh w M H

-fast ~unroll=2 -xcrossﬁle -Wgstats ]

 

 

U

-unroll=2 -xcrossﬁle ~fast -Wgstats ]

 

O

-xcrossﬁle -fast -unroll=2 -Wgstats [

 

 

1
1

l—l

 

[ -fast -xcrossﬁle -unroll=2 -Wgstatsj
L12 [ -unroll:2 -fast -xcrossﬁle -Wgstats ]
[ 13 [ —xcrossﬁle -unroll=2 -fast -Wgstats [

 

 

 

 

 

5.4.4 Experiment 4: Matrix-vector multiplication, validation experiment

The kernel routine in Prism is a dense matrix-vector multiplication algorithm. We decided
to study only this algorithm for validation purposes. The complexity of this problem allowed
for a fully randomized full factorial experiment design.

The experiment consisted of three dense matrix-vector multiplication algorithms, two
problem sizes, four compiler options, and two data structures with two repetitions. This
experiment had 96 experimental runs.

Therefore we used the following factor levels in the experiment:
1. Problem Size:

0 Size 1 refers to a 100 multiplications of a matrix of 500 x 500 elements.

0 Size 2 refers to a 100 multiplications of a matrix of 1000 x 1000 elements.
2. Dense Matrix-Vector Multiplication Algorithm:

0 Algorithm A described in Section 5.4.1

o Golub’s algorithm described in [55] and shown as Algorithm F in appendix C.

50

0 Algorithm G described in Appendix C. Inverse reading.
3. Compiler Options: Four levels

0 Compiler Option 1: No ﬂags -Wgstats
O Compiler Option 2: -fast -Wgstats
O Compiler Option 3: -O5 -Wgstats

0 Compiler Option 4: -fast -05 -Wgstats
4. Data Structure: Two levels

0 The matrix is accessed row by row

0 The matrix is accessed column by column

The order in which the algorithms, compiler options, and sizes were selected was random
in the same way they were selected for Experiment 1. The same metrics were selected for

assessing the experimental results.

5.5 Summary

A performance characterization experiment is a procedure for selecting a software code in a
given computer language, applying a parallelization compiler with an ordered set of direc-
tives, running the code on a target machine, and retrieving a well deﬁned set of performance
parameters.

Design of experiments is key for determining whether factors are signiﬁcantly affecting
the outcome on a particular system. A screening experiment can be performed with a
simple experiment to determine the most important factors affecting performance. Then a
full factorial experiment can be used for studying the effect of these factors. A split-split
plot design was used in our case for feasibility constraints.

Four different experiments were described in detail for our case study. The ﬁrst three

experiments were done using the full application code while the last one was done with a

51

matrix-vector multiplication algorithm. Experiment one studies the parallel implementa-
tion of Prism. Experiment two considers its serial implementation. In experiment three, an
algorithm with a bad memory access pattern was introduced to the parallel implementation
for studying the behavior of the system. Finally, the last experiment is a validation experi-
ment in which algorithms for matrix—vector multiplication were studied for different problem
sizes, data structures, and compiler options. This last experiment is fully randomized while

the ﬁrst three use a split-split plot design.

52

 

 

CHAPTER 6

Data Collection

6. 1 Introduction

Data collection is the process of acquiring information about the actual behavior of a com-
puter system and its associated software through measurements. Instrumentation is the
group of modules to collect and manage from a program while it runs on a parallel or dis-
tributed system [56]. Data collection is the only step in the proposed methodology tied to
architecture speciﬁc details. As previously mentioned, the proposed methodology consists
of preliminary problem analysis, experiment design, data collection, and data analysis as

depicted on Figure 6.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Preliminary Problem _ Design of _ Data _ Data
Analysis Experiments 7 Collection Analysis
Tool File manipulation

FD Instrumentation ~> _
Setup Perl scripts

 

 

 

 

 

 

 

 

 

 

Figure 6.1. Data Collection step integrated with the methodology.

Instrumentation techniques are applied to diﬁ'erent components in the system: programs,

operating system, and processor [1]. Program instrumentation collects information of the

53

application code and its interactions with the system. When measurements are collected
from the operating system, it is called operating system instrumentation. The operating
system keeps track of the behavior of the memory, ﬁle system, cache, and processor sta-
tus. Additional information about the processor may be directly obtained from hardware
counters. In any observation, there is perturbation of the observed system. This is called
intrusion. When hardware counters are used, the intrusion on the behavior of the system
is smaller than software instrumentation. Hardware counters can provide information on
memory behavior, ﬂoating point executions, instructions executed, and branching among
others. For grid computing, there is also network instrumentation techniques. Here the
goal is to monitor network traffic and ﬁnd possible problems with the communication. In
our work, we concentrated our efforts in program and operating system instrumentation
techniques.

Software instrumentation can be inserted at different stages in the software mapping

process [1]. Figure 6.2 illustrates this mapping process.

 

 

 

 

 

 

Libraries
Source . Object . Executable Running
Com rler L k

     

 

 

 

 

Figure 6.2. Stages in the program mapping process [1].

Some of the stages at which instrumentation can be inserted are: source code, compiler,
object code, library, executable, and running code [57]. When instrumentation occurs at
the source code level, either the programmer inserts instrumentation calls manually to the
source code or a preprocessor automatically inserts them. These calls will collect event
information from the system. Instrumentation can also be introduced through the use of

libraries at compile time. Wrappers are used to insert instrumentation calls to the code.

54

 

Wrapper here refers to an interface to convert information from a software source to an
application. MPE (Multi-Processing Environment) is an example of one of such libraries.
It is distributed with the MP1 programming language [58]. Another technique used is binary
rewriting which edits the executable ﬁle and rewrites it with added instrumentation code.
Atom is a example of these tools [59]. The last technique is called dynamic instrumentation.
In dynamic instrumentation the running program is modiﬁed to generate performance data.
This technique has been successfully used by Dyninst [60].

Computer performance metrics can be classiﬁed into two different types: traces and
proﬁles [57]. 'ﬁaces are a collection of events with associated information about the state
of the system when the event occurred. Proﬁles are counts of events or summaries of events
occurred in a speciﬁc period of time during the execution of a program.

There are different types of trace formats for the data: MPICL, SDDF, VTE, Vampir
trace format, ALOG, SLOG, Epilog, and Paraver trace format are some of the typical
formats found used in current performance tools. Each one contains a set of records to
generate information about the event. Whichever format we select for the tracing data,
it has to be manipulated to extract information suitable for the statistical analysis, as it

is described in section 7.3.1. We are using perl scripts to manipulate the trace ﬁles and

extract the data.

6. 2 Tools

We have used software instrumentation at the library level and operating system instru-

mentation to collect information about the behavior of our application

6.2.1 Software Instrumentation

Software instrumentation at the library level was done through the use of the KAP/ Pro
toolset. The KAP / Pro is composed of three independent tools: the Guide Compiler, Assure,
and GuideView. Guide is a precompiler for OpenMP. It accepts Fortran 77 code with

OpenMP directives and produces Fortran code with thread programming [61]. Assure is

55

a thread debugger/ analyzer. GuideView is a tool for performance analysis which shows a
visual description of the data collected by the Guide instrumentation library.

Guide for Fortran, version 3.9, was used as preprocessor for Forte Fortran/HPC 6 to
implement OpenMp. Guide contains the -Wgstats library to collect proﬁling information
about the execution of a program. The compiler directive -guide_stat:s is used during the
compiling/ linking phase to collect performance information at run time.

Some statistics collected by the guide-stats library are: number of CPUs, start time
and stop time, number of serial regions, number of parallel regions, number of barrier
regions, CPU time, CPU utilization, elapsed time, imbalance time per thread, parallel time

per thread, and total serial time.

6.2.2 Operating System Metrics

The concept of metrics is central to this work. By a metric we mean the variations of
the observable quantities of a target computing system stored in the form of variables.
The variables used for the metrics presented in this work are either of the interval or
ratio types, as deﬁned in Section A.5.2. For large scale complex systems, the dimension
of the set of variables under consideration tend to be very high and the volumes of data
tend to be extremely large [15]. The dimension of variables, and the size of the data
associated with this type of Observations of computing systems, necessitate the use of new
methodologies and techniques of analysis in order to extract information relevant to an
application programmer. As previously discussed in Chapter 3, our methodology centers on
the integration of instrumentation-based data collection, systematic experimental design,
and the selection of appropriate statistical analysis techniques. The instrumentation-based
data collection on the target computing system is effected by the operating system of the
machine (in our case Solaris). It is important to point out the collinearity problem that arises
when tools such as an operating system is used for data collection in a computing system.
Collinearity is studied in the context of regression and refers to the independent variables or
regressor variables. Two variables are exactly collinear if there is a linear equation describing

their relationship [62]. Approximate collinearity occurs if the linear equation approximately

56

gives the relationship among variables. Some metrics collected by the operating system log
information from the same groups of variables causing large correlations between different

metrics [50]. This is illustrated in Figure 6.3.

 

 

 

 

Observables Observables
Ideal Real
Orr—“r“. «V7
A
0’1 N'O O: A
A
r H. o: \
OCS Measurement OCS Measurement
Machine Set Machine Set

Ideal condition: correspondence between Observable quantities and variables.

Figure 6.3. Collinearity problem: Those metrics obtained by the operating system may
come from the same groups of variables.

We are studying collinearity on the dependent variables or the response variables. Ac-
cording to Sundberg in [63], the degree of multicollinearity can be estimated by principal
component analysis of the sample correlation matrix of the data. If the smallest eigenvalue
is less than 0.05 then there is collinearity.

From linear algebra it is known that highly correlated variables cause a large condition
number of the observation matrix, making the use of a large number of statistical methods
unfeasible. This problem can be alleviated by the use of subset selection methods [2].

At the operating system level performance information was collected using the sar,
iostat, and vmstat commands. We used a sampling period of 20 seconds for each one of the
commands. According to [50], as long as the sampling period is greater than 5 seconds, the
system is not affected by the collection of data.

The sar command (system activity reporter) is used by system administrators to collect
baseline system activity information when the system is having normal workload and then

used to determine the reasons why a system is having a different performance. Among

57

the data sar will provide is buffer and paging activity, CPU usage, and system swapping

activity. Some of the metrics of interest from this command are shown in table 6.1.

Table 6.1. Metrics obtained from the SAR command

 

Label Name Description Category

 

 

m1 bread/s Reads per second of data to sys- Buffer Activity

tern buffers from disk

 

 

 

m2 lread/s Accesses of systems buffers to Buffer Activity
read

m3 %rcache Cache hit ratios for read as per- Buffer Activity
centage

m4 bwrit/s Writes per second of data from Buffer Activity

system buffers to disk

 

 

 

 

 

m5 lwrit/s Accesses of system buffers to Buffer Activity
write

n16 %wcache Cache hit ratios for write as per- Buffer Activity
centage

m7 pgout/s Page-out requests per second Paging Activity

m8 ppgout/s Pages paged-out per second Paging Activity

m9 pgfree/s Pages per second placed on the Paging Activity

free list by the page stealing dae-

111011

 

m10 pgscan/s Pages per second scanned by the Paging Activity

page stealing daemon.

 

mll atch/s Page faults per second that are Paging Activity (2)
satisﬁed by reclaiming a page cur-
rently in memory (attaches per

second)

 

 

 

 

 

 

continued on next page

 

58

Table 6.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name Description Category

m12 pgin/s Page-in requests per second Paging Activity (2)

m13 ppgin/s Pages paged-in per second Paging Activity (2)

m14 pﬂt/s Page faults from protection errors Paging Activity (2)
per second (illegal access to page)

m15 vﬂt/s Address translation page faults Paging Activity (2)
per second (valid page not in
memory)

m16 %usr Portion of time running in user CPU utilization
mode

m17 %sys Portion of time running in system CPU utilization
mode

m18 %wio Portion Of time running idle with CPU utilization
some process waiting for block
I/O

m19 %idle Portion of time running idle CPU utilization

m20 pswch/s Process switches System swapping activity

 

 

Iostat reports input / output statistics from the system. Those metrics we observed using
the iostat command are shown in table 6.2. Iostat provides statistics about CPU utilization

and disk utilization (per physical disk).

Table 6.2. Metrics obtained from the IOSTAT command

 

 

 

 

 

 

 

 

 

Label Name Description Category
m21 diskl/rps Read per second per disk I/O
m22 diskl/wps Write per second per disk I/O
continued on next page

 

 

 

59

Table 6.2 (cont’d).

 

Label Name Description Category

 

 

m23 diskl/util Percentage of disk utilization per I/O

 

 

disk
m24 disk2/rps Read per second per disk 1/ O
m25 disk2/wps Write per second per disk I/O

 

m26 disk2/util Percentage of disk utilization per 1 / O

disk

 

m27 cpu / us Report the percentage of time the CPU utilization

system has spent in user mode.

 

m28 cpu/sy Report the percentage of time the CPU utilization

system has spent in system mode

 

m29 cpu / wt Report the percentage of time the CPU utilization

system has spent waiting for 1/ O

 

m30 cpu/ id Report the percentage of time the CPU utilization

system has spent idling.

 

 

 

 

 

 

Vmstat stands for virtual memory statistics. This command reports aggregate informa-
tion about virtual memory statistics in the system. Those metrics we observed from the

vmstat command are shown in table 6.3

Table 6.3. Metrics obtained from the VMSTAT command

 

Label Name Description Category ]

 

 

 

m31 memory/swap Usage of virtual and real memory. Virtual Memory Statistic
Amount of swap space currently

available (Kbytes)

 

 

 

 

 

 

continued on next page

 

60

Table 6.3 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name Description Category

m32 memory/free Usage of virtual and real memory. Virtual Memory Statistic
Free size of the free list (Kbytes)

m33 page/ re Page reclaims per second. Paging activity

m34 page/mf Minor faults per second. Paging activity

m35 page/ pi Kilobytes paged in per second. Paging activity

m36 page/p0 Kilobytes paged out per second. Paging activity

m37 page/ fr Kilobytes freed per second. Paging activity

m38 page/sr Pages scanned by clock algorithm Paging activity
per second.

m39 disk/SO Disk operations per second. Disk

m40 disk/sl Disk operations per second. Disk

m4] faults / in Trap/ Interrupt rates per second. Memory faults
Non-clock device interrupts.

m42 faults/3y Trap/ Interrupt rates per second. Memory faults.
System calls.

m43 faults/cs Trap/ Interrupt rates per second. Memory faults.
CPU context switches.

m44 cpu / us Percentage usage of CPU time. Av- CPU utilization.
erage across all processors. User
time.

m45 cpu/sy Percentage usage of CPU time. Av- CPU utilization.
erage across all processors. System
time.

m46 cpu/ id Percentage usage of CPU time. Av- CPU utilization.

 

 

erage across all processors. Idle time.

 

 

61

 

 

6.2.3 Output Format

When we collect proﬁling data from Guide, we get is a scalar containing aggregate informa-
tion about the runs executed. We have programmed some perl scripts to convert the data
from the format established by Guide to a format appropriate for the statistical analysis
tools.

Also, every 20 seconds we collected information from the Operating system in form
of a vector containing all measurements. At the end of one experimental run, a matrix
containing all measurements from the time the application started running to the end of
the run was generated. A vector containing the average of all metrics was generated per
experimental run. A matrix containing all experimental runs from one experiment will be
created. All these manipulations were done using perl scripts. Some of these scripts are

shown in appendix J.

6.3 Summary

The data collection step of the methodology is the only one tied to the speciﬁc architecture
and software available on the particular platform where the software runs. Data can be
collected in the form of a proﬁle as a scalar, or in the form of a trace as a vector. Instru-
mentation can be inserted at programs, operating system, and processor level. For software
instrumentation data can be collected inserting instrumentation at the source code, com-
piler, object code, library, executable, and running code. We have inserted instrumentation
at the library level. We have also used operating system instrumentation. A description of

the metrics used for data collection was given in this chapter.

62

 

CHAPTER 7

Data Analysis

7.1 Introduction

The last step in the integrated methodology for obtaining information across levels is data

analysis. This is depicted in Figure 7.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  
  
 
    
 
     
  

Preliminary Problem _ Design of _ Data _ Data
Analysis Experiments Collection Analysis
——
II
II
Statistical

Analysis

 

Feature
Subset
Selection

II

I
(Dimensionality 3

Correlation
Matrix

 
   

 

 

Subset Selection

 

 

 

 

Figure 7.1. Data Analysis is the last step in the proposed methodology.

Performance data collected during experimentation will not yield useful information un-

til it is carefully analyzed. Statistical methods are the basis for data analysis. Speciﬁcally,

63

three statistical techniques were used in our case study: correlation analysis, multidimen-
sional data subset selection, and analysis of variance. It should be mentioned, however,
that additional methods may be used for data analysis to extract information. For exam-
ple, multiple regression analysis might be used to model the outcome from the system.

This process can be seen as parallel to the steps for a knowledge discovery process.
The goal of knowledge discovery is to eﬂ'ectively transform raw data into information [2].
Knowledge discovery is composed of several steps [64], which include: data preprocessing,
data reduction, modeling and hypothesis selection, and data mining.

In this chapter we describe the statistical data analysis used for our study.

7.2 Statistical Models for Performance Analysis

Statistical analysis provide a powerful tool for evaluating and interpreting data objectively.
Statistical methods have been used for performance data analysis and interpretation for a
long time [26, 27]. However, the proposed integrated use of design of experiments, correla-
tion analysis, and feature selection has not been used for automatic performance evaluation.

Statistical methods are the basis for analysis of multivariate data. Three distinct statis-
tical techniques are combined in our study to assess the status of the system performance:
correlation, multivariate analysis, and analysis of variance (ANOVA).

The use of design of experiments techniques in our experimentation process prompts the
use of AN OVA or any general linear model on the data. We have used correlation analysis
and ANOVA to establish relations in the data. Design of experiments techniques have
been used in the past in the area of computer performance for analyzing the behavior of
algorithms and heuristics in terms of execution time [3], scalability [11], and for performance
evaluation of memory hierarchies [12, 19]. We introduce its use for the empirical study of
performance instrumentation data.

We were particularly interested in correlating high-level programming decisions to low-

level metric information.

64

7 .3 Measuring Relationships in Multidimensional Data

A block diagram of the components of a performance data analysis system is shown in

Figure 7.2. It follows the structure of a general pattern recognition system [65].

 

 

 

 

 

Raw Normalized Feature Important
Workload Data Measurement Vectors Factors Action
I Data Analysis . . I
_ . Decrsron
Computer 4 Instrumen __> Pre—processmg __) Feature 9 and 9 Mak' ‘
System tatton Selection . mg
Interpretation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I Statistical Data Analysis [

 

Figure 7.2. Performance Data Analysis Architecture.

Statistical methods used for data analysis begin with the raw data and conclude with
the information obtained about the factors affecting the system performance.

After data collection, we have a set of measurements drawn from an observable comput-
ing system (OCS) which are a sample from a stochastic process. It is important to emphasize
the stochastic nature of the performance data since variations in the outcome from the mea-
surements might be signiﬁcant (due to real factors affecting the system), or insigniﬁcant
(due to the random nature of the process itself, and therefore, not important). The time
variable might be discrete or continuous. Similarly, the value might be countable or un-
countable, giving rise to four types of stochastic processes: discrete time-discrete value, dis-
crete time-continuous value, continuous time-discrete value, and continuous time-continuous
value processes. Figure 7.3 shows a graphical depiction of a discrete-time continuous-value
random process. 'Ii‘acing data measurements taken in our system come from a discrete-time
continuous-value random process.

We assume that this process is mean-square ergodic in the mean. The ergodicity as-
sumption is needed to compute the averages of the metrics.

We have selected a very small problem size for our experiments due to time constraints.

65

 

 

 

 

 

 

Time I

Figure 7.3. Graphical View of a Discrete-Time Continuous Value Stochastic Process

However, typical running times for this code is in the order of days, and depends on the

physical characteristics of the antenna to be analyzed.

7 .3.1 Formatting Data for Statistical Methods

Data obtained from an instrumentation system come in different formats. If data are
received in the form of summaries or aggregate information, like those obtained from
KAP / Pro, each experimental run will provide a set of average measurements of the variable
of interest or an absolute value. Some examples of these types of measurements, taken
by KAP / Pro, are percentage of imbalance time per thread, percentage of barrier time per
thread, CPU time, and idle time. Another type of measurement is obtained as a trace ﬁle.
Here, measurements are either taken at speciﬁc time intervals or a times triggered by an
event of interest in the system. Operating system metrics on Unix or Linux systems taken
with the ear, iostat, vmstat, or mpstat commands are examples of measurements taken
at speciﬁc time intervals. In these commands, one of the arguments is the sampling time at
which all values will be reported, allowing for specifying the recording times. Other tools,
such as Vampirtrace or MPE, will trace MP1 activities at the instant of time when they

actually occur, and not at regular time intervals.

66

This leads to three different types of variables to manipulate: absolute, average, and
regularly sampled measurements. Absolute measurements are variables which record a total
count of an event or a real number representing total time. Average measurements are values
computed by the instrumentation system where the system itself records a total number
and performs an average calculation on the overall measurement and presents this number.
Finally, regularly sampled measurements are recorded at regularly spaced sampling times.

In order to support data analysis using statistical methods, the data obtained from
experiments should be formatted as a matrix. Each experiment consists of a series of
experimental runs, each supplying a set of measurements. These measurements may come
in the form of a random number in the case of an average or absolute value, or as a sample
of a discrete time - continuous value random process in the case of trace data. This type
of trace data can be regarded as time series and can be modelled as piecewise independent
stochastic process [66]. When data is presented in the form of a time series, the temporal
average of the realization is computed to estimate the statistical average, assuming the

process is mean-ergodic. The temporal average is computed by

1 N—l .
u. = (N) :4; xlz] (7.1)
where N is the number of points in the time series [67].
For one experiment, one matrix is formed by all measurements. Each element is either
a temporal average or an absolute metric value. Page faults per second is an example
of a temporal average metric. We will denote an average metric as mavg. An absolute
value, denoted ma“ is a metric whose value is obtained as a total value at the end of the
execution time. It could be either a discrete value or a continuous value. Total CPU time
is an example of an absolute metric.
One experiment consists of R experimental runs executed in a pre—speciﬁed order. This
order is randomized during the experiment speciﬁcation phase of the methodology to mini-
mize the number of experiments required to obtain useful information. The randomization

scheme determines the precision of the results obtained. Let P denote the number of per-

67

formance metrics measured during an experimental run, It denotes the experimental run
where 0 S k s K — 1, and p the metric identiﬁcation number, where 0 g p S P — 1. Let M

denote following performance data matrix:

P m“[0,0] ma[0, 1] m“[0,P — 1] f
M: ma[.1,0] mall,” ma[1,.P — 1] a (7.2)
m“[K —1,0] m“[K — 1,1] m“[K — 1,P — 1] J

 

 

where M E l2 (ZK x Zp). In a more compact form, M = {m“[k,p] : k E Zmp E Zp;a =
abs or a = avg}. We can also use the notation m°[k,p] = mﬂp] = mg[k], where m“[k,p]
denotes the average or absolute metric value for experimental run It: and metric p, and a
is either avg or abs. Note that each column of the performance data matrix M consists of
measurements of one performance metric over a set of experimental runs and each row is a
P dimensional vector containing all different measurements for one given experimental run.
The notation m: [p] refers to one row of the matrix and mf,[k] to one column of the matrix.

For example, if we perform one experiment consisting of four experimental runs, mea-
suring ExTime, pg faults/s, cachehits/sec, cachemisses/sec,and idleTime, then matrix

M will be a 4 x 5 matrix with the format shown in Figure 7.4:

metric 0 metric 1 metric 2 metric 3 metric 4

_I I l I I-

ExTime(O) pgfaults/sec(0) cachehits/sec(0) cachemisses/sec(0) idleTime(O) ‘— “"10

ExTime(l) pgfaults/sec(l) cachehits/sec(l) cachemisses/sec(l) idleTime(l) <— runl

ExTime(Z) pgfaults/sec(2) cachehits/sec(2) cachemisses/sec(2) idleTime(2) +‘ “"12

 

 

ExTime(3) pgfaults/sec(3) cachehits/sec(3) cachemisses/sec(3) idleTime(3)_J ‘— run 3
L.—

Figure 7.4. Example of a matrix format used for the performance data.

68

When the matrix is formatted in this fashion, it is ready to apply statistical techniques,
such as: correlation, ANOVA, and feature subset selection. Before applying any of these

methods, preprocessing is required on this type of data.

7 .3.2 Preprocessing

Most statistical techniques are biased by the magnitude or order of the data to be processed.
For example, principal component analysis is not scale invariant. When the values are
measured in largely different scales, principal component analysis is biased towards the
largest values [68]. In pattern recognition, since these techniques are applied mostly to
images, and pixels in images have values in the range of 0 to 255, problems with differences
in magnitude are not typically found. However, in areas such as data mining [69, 70] and
sensor data [71], the magnitude of a measurement may differ by a large amount from another
measurement. This is similar to our case. For example, variable pswch/s (process switches
per second) is in the order of thousands, while disk/util (disk utilization) is in the (0,1)

range. This problem is mitigated by scaling the inputs to the statistical methods.

Normalization

Data normalization makes all measurements comparable to all subsequent methods, there—
fore making sure that statistical methods are not sensitive to large changes in the scales
of the data. Global normalization methods are applicable to performance data. Diﬂerent

methods are available for normalization, among which we considered:

0 Absolute: The data is not preprocessed and it is left as it is.

0 Log Normalization: The log of each element of the performance data matrix M is
computed. We will denote the normalized matrix by N. This technique was used by

Nickolayev, Roth, and Reed in [13] to extend the dynamic range of the data.

naikmi = log(m“[k.pl) (7.3)

69

o Min-max normalization: This is a linear transformation transforming each metric

measurement to the (0,1) range [70, 72].

mal/WI — mgnlmslkil

 

n“[k,p] = (7.4)

m‘ax[mg[k]] — mkin[mg[k]]
o Dimension normalization: Each metric vector is divided by its Euclidean norm so

that it is forced to lie on a hypersphere of unit radius [65].

milk] mal/cap]

"alkvl’l z Euclid_Norm(m$IkII = 1"" 2
[CZ—:0 (ma[k,PI)

 

 

 

To determine how good a normalization scheme is, we can project the multivariate
data using principal component analysis along the two principal components. We have
used the validation experiment using matrix-vector multiplication for this purpose. In
this experiment we can identify two classes based on execution time: acceptable and poor
performance. Good execution time is on the order of 10 seconds while bad performance
execution time is on the order of a 100 seconds. We need to remember that a good feature
should separate the classes as far as possible. When we project the data along the two main
components, a good normalization scheme should separate the classes. All three different
normalization schemes were used on the validation data to test which one gives the best
separation.

Results show that when using no normalization, as expected, those metrics with the
highest range of values were biasing the results. Figure 7.5 shows the projection along the
two principal components of the data. It can be observed that the two classes are mixed in
the plot, therefore the classes cannot be easily distinguished from the ﬁgure.

Log normalization cannot be applied to our data set. This data set contains several
values of zero. This type of normalization is applicable to the case where measurements are
always greater than one. In the case of [13], it is applied on performance data ﬁltered using

a low pass ﬁlter and averaged out over a sliding window.

70

4 Not normalized

 

 

 

 

2nd Principal Component

 

l-

—1 0 —5
1st Principal Component x 104

Figure 7.5. Two principal components of the validation data - no normalization

Third, min-max normalization was applied to the data. Figure 7.6 shows the two main
components of this data set for the validation experiment. The discriminatory effect Of this
normalization is seen from the ﬁgure, where all values of class 1 are located at the left in
the graph and all values of class 2 are located at the right side of the graph. However, there
are two small clusters of class 1, one at the top of the graph and one at the bottom. A
desirable discriminatory separation should gather all elements of one class together.

Finally, Euclidean normalization was applied to the data. Figure 7.7 shows the projec-
tion along the two main components of the data. It distinguishes the two classes and shows
all components of class 1 together (except for one outlier), and all components of class 2
together. This is a better normalization for the validation data.

From all different normalization schemes, the one which presented a better discrimina-
tion power was Euclidean normalization and it is appropriate for the types of data we have

in performance analysis. Therefore we have used Euclidean normalization in all our data

sets. This normalized matrix N is then analyzed using other statistical methods.

71

 

1 5 Normalizing to range (0.1)

 

 

 

”—5

0° 0

 

0.5: ,, "‘ "

-—0.5 :

2nd Principal Component
0

 

 

"1 '5 -1 o 1 2 3

tst Principal Component

Figure 7.6. Two principal components of the validation data - Min-Max normalization

7 .3.3 Correlation Analysis.

The degree of association between two variables, if any exists, is obtained through corre-
lation, which measures the linear association among variables. No causal relationship can
be made about correlated variables. This computation is not appropriate for nominal or
ordinal variables. Some assumptions are required to hold for correlation analysis: the vari-
ables are random, the relationship is linear in nature, and the variables follow a normal
distribution [9]. We deﬁne a random variable as a function whose domain is a sample space
of an experiment and whose range is a subset of the real line [9, 67].

The product moment correlation coefﬁcient or correlation coefficient measures the linear
correlation among pairs of variables and it is denoted by p (rho).

The estimate of the correlation coefﬁcient between two variables x = m"[k,pr] and
y = m°[k, pj] is denoted by r and is computed as
K
=1

(2:.- -'f)(y.' W)

‘l

(K—nxg’

 

1":

am

72

Normalizing with Euclidean Norm

 

 

 

 

1
0 2
,_. 0.5
C
Q)
C
8 00 x x xx x
g Oi— “0 xx x x x N ”I.
O
E
.9-
g —o.5
a“.
‘O o
c
N

I
A
I

 

 

..1. . 1 . . . . .
—8.6 —0.4 —-0.2 0 0.2 0.4 0.6 0.8
1st Principal Component

Figure 7.7. Two principal components of the validation data - Euclidean normalization

K
It 2 yt
where x is the mean of x, that is, x = K , ‘y‘ is the mean of y, 'y‘ = 5117—, S; and Sy are

I‘M:

 

the sample estimate of the standard deviation of x and y, respectively, that is

 

 

’ K K
2 (1’32" a)? a (y.- — vi?

_ i=1 :
Sx— tSy K‘I a

K _ 1 (7.7)

and K is the number of observations, or in this case, the number of experimental runs
[9]. The value of r ranges from -1 to 1 where —1 indicates perfect negative correlation,
+1 indicates perfect positive correlation, and 0 indicates no correlation. Correlation will
indicate to what extent two variables vary simultaneously. When more than two variables
are involved, a correlation matrix is used.

Let N denote the matrix containing the normalized observation data. Rows of matrix N
contain observations and columns of N contain variables. Let S = x’x—i— (x’ 1) (1’x) , where
1’ denotes a unit row vector, 1 denotes a unit column vector and ' denotes the transpose

operation. Then R = gig—DELSDFZI', where D:2‘1is a diagonal matrix whose entries along

73

the diagonal are the reciprocals of the standard deviations of the variables in x. That is,

 

 

 

0:2l = diag ( 5:1 3:2 Sin ) [68]. R is called the correlation matrix. Elements
(2', j ) in R are the correlation coefﬁcients between row i and column j, therefore matrix R
is symmetric and its diagonal elements are equal to 1.

If we use dimension normalization, the correlation matrix can be obtained either from
matrix M or from matrix N and the result is the same matrix R. This is not true for log
or min-max normalization. If x1 = Wlx and y1 = Wzy then Fir—1 2 W1? and y; = W21]. Also
le = W153 and Syl = WgSy. Therefore, the computation of each r is the same for every
pair of variables in the matrix.

Why is it interesting to obtain a correlation matrix from our data set? High positive or
negative correlation will indicate clearly that two variables are linearly related. A variable
that is highly correlated with execution time is clearly a target variable to observe since
there might be a causal relation among this variable and execution time itself. Also, if
there are any coeﬂicients of value +1 or —1 off the diagonal of R, this means that the
two variables involved are exactly the same variable with different names, or a multiple
of another variable and one of them should be eliminated from the analysis. Please note
that when the correlation coefﬁcient of two variables is one, either the two columns are the
same measurement or a linear combination of measurements, therefore the data matrix is
singular, not invertible, and the solution of any linear system of equations relating these
variables is not possible to compute. Many statistical methods do not apply in this case.

We have computed matrix R from one of our data sets. A visual display of a correlation
matrix obtained in one of our experiments is shown in Figure 7.8. This is from the validation
experiment using matrix-vector multiplication.

In this case, the metrics most correlated with execution time were those related
to paging activity/faults and paging activity/page reclaims and percentage of
the time the CPU was idle. In these cases, the correlation was negative, indicating that

as the metric value increased, the execution time decreased.

74

Performance Metric Index

I
I
f.-
I
a
a
I,
is:
m.“
is

 

5 10 15 20 25 30 35
Performance Metric Index

Figure 7.8. Visual display of the correlation matrix of the data obtained from the validation
experiment matrix-vector multiplication.

7.3.4 Multidimensional Metric Subset Selection.

If we recall the deﬁnition of N from the previous section, columns of N are normalized
vectors containing measurements obtained for each speciﬁc metric, and each element of this

vector corresponds to an observation. That is

N = [ n;[0] n;[l] n;[K— I] 7 (7-8)

where n;[k] is a column vector containing a set of normalized measurements for performance
metric It.

For parallel code running on high performance systems, the number of metrics, K, is

75

extremely large [15]. This has prompted the use of visualization tools to aid programmers
understand performance data. The programmer’s interpretation plays an integral role in
understanding the relations between performance data and code.

The large number of measurements has been a problem faced in other areas such as
artiﬁcial intelligence, pattern recognition, and data mining [73, 30, 31, 74]. The solution
to this problem is not trivial and many approaches have been proposed. We have selected
the method we believe is most appropriate for the problem of high-performance computer
metrics data and automatic performance evaluation: feature subset selection.

A feature is a basic primitive deﬁning a problem [2]. A collection of features describes
an application. Each column of matrix M is a feature and contains measurements for
one performance metric. Therefore, we identify each metric as a feature. The amount of
information contained in a full set of features may be occluded by the large amount of
metrics. This problem can be solved either by two methods used in pattern recognition and
data mining: feature subset selection and feature subset extraction. In the ﬁrst method, a
subset of the original set of features is selected. In the second, a combination of features
is used to generate new features, which in turn contains a smaller number of features than
the original one.

Let F denote a set of features containing K features. The problem of feature selection
would be deﬁned as ﬁnding the optimal set of p features such that p < K (see Figure 7.9).
What is considered to be optimal? It means to optimize a cost function J (.) selected by
the objective of the selection. In feature extraction, the goal is to use a transformation W
such that the new set WF of transformed features contains q features such that q < K
and it is the best transformation in terms of optimizing a cost function J () In our case,
feature selection is more suitable than feature extraction since feature selection is done in
the measurement space, therefore, the physical meaning of a speciﬁc feature is not lost
in the selection process. In feature extraction, a transformation is applied to the metrics
involved in the calculation and therefore new metrics are created, not necessarily having a

physical meaning.

76

All features Important features

 

 

Selection

_>

 

 

 

 

 

 

 

Figure 7.9. Feature subset selection

Feature selection has also been deﬁned as the process of selecting relevant features for
a particular task. A feature is relevant if when it is removed from the set, the set will
deteriorate. This is a function of the measure selected as objective function or cost function
[2]. In Artiﬁcial Intelligence, improving the learning process is the goal so relevant features
are the ones that are required for learning [75]. From [76] the relevance of a feature is
deﬁned in terms of the classiﬁcation problem.

The feature selection process has several beneﬁts. First, it reduces data redundancy.
As explained before, when there is a problem of collinearity of two metrics, the matrix of
data is not invertible, and some statistical methods do not apply. Second, it can aid in
ﬁnding natural groups in the data [77]. Third, it allows a better understanding of the data.
When feature selection is used for classiﬁcation, it minimizes the curse of dimensionality.
The curse of dimensionality refers to the fact that as the number of features increases, the
number of observations required for classiﬁcation grows exponentially, creating the need of
enormous amounts of observations for proper classiﬁcation. Feature selection can alleviate
this problem.

There are three basic questions we need to answer in order to apply feature selection
to any set of data. First, we need to identify the search process for ﬁnding the optimal
features. Second, what is the criteria for determining the best set, that is, what is the cost

function to use for evaluation. Finally, what strategy is going to be used to add or delete

77

features to the current subset. These questions have been addressed in this study for the
particular case of performance metrics. Notice that our goal is not to improve classiﬁcation
since we do not have classes in our data. Our main goal is to simplify our results so that it
improves the comprehensibility of the data.

Three basic search methods for feature selection [2, 78] are exhaustive search, heuristic
search, and nondeterministic search. In exhaustive search, all possible solutions are exam-
ined and compared. This method is time consuming and sometimes unfeasible, given the
large amounts of data to be processed. Heuristic search or weak methods refer to a guided
search where not all possibilities are exhausted. It may lose some optimal solutions but
in general it ﬁnds good solutions. These searches are faster than exhaustive search. Fi-
nally nondeterministic search refers to ﬁnding possible solutions at random and evaluating
them. Given the amount of data in typical performance evaluation problems, exhaustive
search becomes unfeasible due to time constraints. Both heuristic search and random search
are applicable to our problem. Heuristic search was used in our case study since it is the
most common method used in pattern recognition. Future analysis may include evaluating
random search methods.

The search direction in which a subset of features is generated depends on the ﬁnal
number of features desired in the subset (p) compared to the total number of features (K).
If we do not have knowledge of p, any search direction is possible. There are three different
possibilities for search direction: sequential forward search, sequential backward search, and
bidirectional search. In sequential forward search, features are added, one at a time, to the
ﬁnal subset based in the desired cost function. This is an appropriate method if p << K
making the search time smaller. In sequential backward search, features are discarded one
at a time based on the cost function. Irrelevant features are discarded. When p ¢< K, the
search time is smaller with this method. Finally, bidirectional search is appropriate when p
is unknown. In our case, we need to identify p before starting the feature selection process.
However, results show that the obtained values of p are much smaller than K, therefore

sequential forward search is preferred over backward search methods.

78

The last question that remains unanswered is what cost function is appropriate for our
problem? In order to answer this question, we have explored the possibilities from [2]. In
this work, they have classiﬁed existing measures for feature selection. Figure 7.10 shows

the taxonomy for measures used for feature selection methods.

Feature Selection Measures

Accuracy—based Class Separability - based
Classic Consistency
lnforrnation Distance Dependence

Figure 7.10. Classiﬁcation scheme Of feature selection measures [2]

Accuracy measures are those based on the accuracy of the classiﬁer used for data clas-
siﬁcation. This means that they are based on a speciﬁc classiﬁer. This does not apply for
our case since we are not having a classiﬁcation problem. For our case, class separability
methods are more appropriate. These are based on how to separate the data into its natural
groups. These are subdivided into classic measures and consistency measures. From those,
consistency measures are also discarded. It refers to maintaining a minimum consistent
set where all instances are classiﬁed as one class without any inconsistencies. It has to do
with classiﬁcation itself. We are left with three classic measures: information, distance, and
dependence. All these three are appropriate measures for our data. Information measures
select the subset based on those features which minimize uncertainty. Distance measures
are those which try to separate classes as far as possible using a distance function. Depen-
dence measures select features on association with interesting variables. We have selected

information measure for our study.

79

The feature selection problem can be further subdivided into supervised or unsupervised
selection. There is vast literature in supervised feature selection methods [75, 76, 74, 2, 73,
30, 79]. Supervised feature selection refers to methods devised when we have data to train
the system, that is, we know instances where there is a class associated to each instance,
therefore we know what the correct classiﬁcation of an instance is. Supervised learning is
subdivided into two different models according to the measures used to obtain the metrics:
wrapper and ﬁlter model. The wrapper model [76] uses the classiﬁer accuracy as cost
function. The ﬁlter model [80] is independent of the classiﬁcation method. Measures based
on distance or information are used and the results are based on the data itself. Both ﬁlter
and wrapper methods are composed of four parts: feature generation, feature evaluation,
stopping criteria, and testing. Feature generation and evaluation have been previously
discussed. Stopping criteria is deciding when will the search process will stop. Will it be
based on a threshold, a criterion, or a speciﬁc number of features to select? Testing refers
to how to evaluate the accuracy of the results. The wrapper method can be viewed as a
machine learning approach and the ﬁlter method can be seen as a data mining approach
[2].

Even if there is vast literature in supervised selection, our problem is classiﬁed into
unsupervised selection. We do not have a speciﬁc class associated to performance data
metrics. A possible class associated to the data would be: acceptable performance and not
acceptable performance. But this is very subjective. According to Liu and Motoda in [2],
there are two different methods for unsupervised learning: clustering and entropy based
methods. In clustering, features are grouped together according to some measure. Ahn and
Vetter in [15] have used clustering to identify which metrics are relevant. When clustering
is used for feature selection, an ordered list of features based on relevance is not obtained.
Methods based on entropy can rank features and are used for unsupervised feature selection.
We have used the entropy based method described in [81] since it is based on the concept

of information about the system.

80

Entropy based methods

In his classic paper [82], Shannon described a communication system as composed of source
of information, a transmission medium, and a receiver. According to Shannon, a signal
is an entity which carries information. Information is anything that can be sent from one
point to another in the physical world. He also introduced a measure of the amount of
information contained in a message: entropy. Given its random nature described above, an
observable computing system is a source of information. We would like to reconstruct the
message that the system is giving us to improve the system’s performance.

Let X denote a discrete random variable. All possible values of X are x0,x1, ---,$n—1
with probability mass function Px(x,-), where Px(x,~) = P[X = x;] = p;. Entropy will

measure the average amount of uncertainty of the random variable and it is computed by

H(X) = - ZPilOEPit (7.9)

where n is the number of possible values Of the random variable [83]. When the log
function is base two, information is measured in bits. Notice that the more probable a
message is, it contains the less information. An unusual message contains more informa-
tion than a regularly received message. Relative entropy measures the distance between
two probability distribution functions. The joint entropy of random variables X and Y
with the distribution function p(x,y) is deﬁned as H (X ,Y) = -— 22 p(x,y) log p(x,y).
The conditional entrOpy of random variable X and Y following them distribution function
p(x, y) is deﬁned as H(YIX) = ZZp(x,y)logp(y|x). Mutual information is a mea—
sure of the amount of informationxa 1irariable has about another variable. It is deﬁned
as I(X;Y) = Zp(x,y) log ppxxgyy = H(X) — H(XIY). This means that it is the amount
of informationxfiontained in X minus the information of X given that Y is known. Also
I(X;Y) = H(X) +H(Y) — H(X,Y).

The measure of entropy used in this work is:

81

N—l N

E = — Z Z (S..- x logs-j + (1 — 5th x logo — Sal) (7.10)
i=1 j=i+1

where Sij is called the similarity value of two instances. It is deﬁned as Srj = e‘a’wii

where Dij is the Euclidean distance between instances x,- and xj. Alpha is an empirical

— lr_1_0.5

D , and D is the average distance

value computed by the data and it is deﬁned as a =
of all the metrics.

This was the cost function used for our search. We selected sequential forward search as
the search strategy and entropy as the cost function. Results are shown in chapter 8. Now
a question arises: how many metrics are required to explain the behavior of the system?
Feature selection methods assume that the number of features to select, or its equivalent,

the number of clusters to identify, It has been previously deﬁned. To answer this question

we have addressed the problem of data dimensionality estimation techniques.

Dimensionality

Dimensionality has been studied for a long time [31, 32, 84, 85]. The dimension of the data
is important to determine the number of features required to represent the data. There
are two main deﬁnitions of dimension used in pattern recognition: spanning dimension,
and intrinsic dimension [86, 31]. Intrinsic dimension comes from linear algebra, and it
is deﬁned as the smallest set of vectors required to span the data set. It is also known
as embedding dimension. The spanning dimension corresponds to the smallest number
of parameters required to model the data without degrading the data set [31]. Intrinsic
dimension is appropriate for determining the number of features representing the data [31].
The difﬁculty resides in estimating the intrinsic dimension from a data set.

A series of classical methods have been used to estimate intrinsic dimension. These are

based on principal component analysis of the data.

0 Cumulative Percentage of Total Variance: The most popular method for de-
termining intrinsic dimension is called the cumulative percentage of total variation

method or also known as K-L algorithm [87, 68]. It is based on the computation of

82

the eigenvalues of the correlation matrix of the data. Each eigenvalue contributes
to a percentage of the total variance. Those eigenvalues whose eigenvectors explain
most of the variance are selected. The number of eigenvalues required to reach certain
threshold in terms of percentage are selected as k. A typical threshold is 95% of the

total variance.

0 Kaiser-Guttman: The eigenvalues of the correlation matrix of the data are com-
puted. Those eigenvalues greater than one are selected and the number of eigenvalues

greater than one is k [87, 88].

o Scree test: The eigenvalues of the correlation matrix of the data set are sorted in
descendent order and plotted. The point where the curve ﬂattens is selected as the
cutoff point, and this is the number of principal components to select. This estimates

the intrinsic dimensionality of the data and it is the value k [87].

The ﬁrst two methods are appropriate for automatic performance analysis since the
computation of It occurs without the programmer’s intervention. In scree test, a visual
inspection of the graph showing the eigenvalues of the correlation matrix is required, making
the method not automatic. All three methods were used in our data analysis for reference
purposes.

There are additional methods for dimensionality estimation to explore for future work.
In recent literature, Dy and Brodley studied the order identiﬁcation problem from the
wrapper model point of view. They wrap the search of k around the clustering algorithm
for unsupervised learning [29, 77, 89], computing k as well as the feature subset at the
same time. We have not used these methods since they are based on a classiﬁer and we are
working with unsupervised data.

Once the important metrics describing the system are identiﬁed, we use ANOVA to

analyze the results.

83

7.3.5 ANOVA

When an experiment is designed using DOE techniques, causal relations can be established
among independent factors and results [9]. Analysis of Variance (ANOVA) is a technique
used in this cases to determine if the differences in the results obtained are due to chance
or to signiﬁcant effects of the controlled factors (see Section A.5.2). We proceed to explain
AN OVA for the 4-way factorial design and the split-plot designs used in our case-study.
We will explain ANOVA using one of our experimental results. In experiment four, we
conducted a validation experiments using matrix-vector multiplication algorithms. We de-
signed a full factorial experiment with four factors: problem size, algorithm, data structure,
and compiler options. There were two problem sizes, three algorithms, two data structures,
and four compiler options. We wanted to study the effects of these factors on the set of
computer performance metrics obtained from our system. Since there are four factors, the

four-way ANOVA model for this experiment is

Yr = ,U + Ar + Bj + Ck + Dz + AiBj -I- AiCk + All); + BjCk + Ble + (7.11)

+CkD1-I- AiBjCk + AiBle + .410sz + 33'0sz + AiBjCle + 6 (7.12)

where A; represents the effect of problem size, 33- effect of algorithm for matrix-vector
multiplication, Ck effect of data structure, and D; the effect of compiler options. Additional
terms account for the interactions among factors.

One of the important performance metrics selected by the feature selection method
previously deﬁned in section 7.3.4 for this experiment was memory/free. We used the
software SAS for data analysis and we are showing its output for illustrative purpose. We
compared memory/free based on the factors previously described and we obtained the

following results:

Dependent Variable: memory_free
Sum of

Source DF Squares Mean Square F Value Pr > F

84

Model 47 45030792685 958101972 1.14 0.3308

Error 48 40488348470 843507260

Corrected Total 95 85519141155

R-Square Coeff Var Root MSE memory_free Mean

0.526558 2.253901 29043.20 1288575

Source DF Type I SS Mean Square F Value Pr > F
Size 1 2833428803 2833428803 3.36 0.0730
CompOpt 3 227278711 75759570 0.09 0.9653
Size*Comp0pt 3 4874495705 1624831902 1.93 0.1379
Alg 2 10491710221 5245855110 6.22 0.0040
Size*A1g 2 918326201 459163100 0.54 0.5838
Comprt*A1g 6 770134103 128355684 0.15 0.9877
Size*Comp0pt*Alg 6 5079982917 846663820 1.00 0.4340
DataStr 1 871925920 871925920 1.03 0.3144
Size*DataStr 1 167768175 167768175 0.20 0.6576
CompOpttDataStr 3 1191135735 397045245 0.47 0.7041
SizetCompOpt*DataStr 3 1136874217 378958072 0.45 0.7190
AlgtDataStr 2 2635759532 1317879766 1.56 0.2201
Size*A1g*DataStr 2 1809587849 904793924 1.07 0.3502
CompOpttAlgtDataStr 6 5622169555 937028259 1.11 0.3701
Size*Comp0*A1g*DataS 6 6400215040 1066702507 1.26 0.2914

We perform our analysis at a — level = 0.05. Here, results show that the algorithm is
the only factor affecting the variable memory/free which accounts for the pages of RAM
in KBytes that are accessible when a process needs memory. Using Duncan’s test on the

algorithm factor as post hoc test we get the following results:

t Tests (LSD) for memory_free

85

Alpha 0.05

Error Degrees of Freedom 48
Error Mean Square 8.4351E8
Critical Value of t 2.01063

Least Significant Difference 14599

Means with the same letter are not significantly different.

t Grouping Mean N Alg
A 1298664 32 1
A
A 1292889 32 2
B 1274171 32 3

Using contrasts to analyze the algorithm factor we get:

Dependent Variable: memory_free

Contrast DF Contrast SS Mean Square P Value Pr > F
Comp. Alg 1 & 2 1 533689623 533689623 0.63 0.4303
Comp. Alg 2 k 3 1 5605473681 5605473681 6.65 0.0131

Comp. Alg 1 & 2 with 3 1 9958020598 9958020598 11.81 0.0012

This states that algorithm one and two are not signiﬁcantly different from each other
at the 0.05 level for variable memory/free. Algorithm three is signiﬁcantly worse than

algorithm 1 or 2.

86

The previously mentioned procedure is used when a factorial design is analyzed. How-
ever, sometimes a full factorial design cannot be done for feasibility constraints. This was
our case when doing the experiments with the application code. Randomizing problem size
from experimental run to experimental run would have caused excessive amount of time.
We used a split-split plot design of experiment [25]. The linear model for this experiment
differs from the previous one in that an additional factor, block, needs to be added to the
model and considered into the error terms in the model. The following SAS code shows the

error terms added to the model where the variable block was incorporated to the model.

proc anova;
class block Size Alg CompOpt;

model memory_free = block | Size | Alg l CompOpt ;

test h=Size e=block*Size;

test h=A1g SizetAlg e=block*Size*Alg;

test h=Comp0pt A1g*Comp0pt Size*Comp0pt Size*Alg*Comp0pt
e=block*Size*Alg*Comp0pt;

run;

quit;

In the ﬁrst experiment we compared the effects of thirteen compiler options, three prob-
lem sizes, and two different multiplication algorithms on the result which was assessed by a
set of metrics we selected. We have decided to run three replicates of the experiment. This
yields 234 experimental runs for one experiment. The number of iterations for obtaining
the solution of the iterative solver has been ﬁxed to remove the impact of reduced matrix
conditioning. In each one of our three blocks we select at random the problem size, then
for each of these, at random we select the matrix multiplication algorithm used to solve the
problem. Then in each one of these subplots we randomly select the compiler options used

to produce the executable code. This means we have more precision in looking at the effect

87

of compiler options and least precision for problem size effect.

7.4 Summary

We have presented statistical methods as a powerful tool in the analysis of performance
data. Depending on whether the data comes from traces or from summaries, we can classify
them as an output from a random process or a random variable. A performance data
matrix format was speciﬁed for applying statistical methods to the data. Preprocessing
techniques typically used in pattern recognition were tested on our data set to verify which
one was more apprOpriate. Dimension normalization turned out to be the most effective
preprocessing technique.

Some of the techniques used were: correlation analysis, feature subset selection, and
ANOVA. Correlation analysis establishes the linear relationship among variables. Feature
subset selection was used to determine how many and which important performance metrics
should be looked at using an entropy cost function. ANOVA established which controlled
factors were causing variations on the performance metrics selected by the feature selection
method. Post hoc analysis and analysis of means can provide additional information on the

results after the null hypothesis is rejected.

88

CHAPTER 8

Results

In this chapter we present the results obtained in a case study to test the proposed method-
ology. We show results from four different experiments. Experiment one and two are used
to characterize the observable computing system (OCS). Experiments three and four are
used to validate the methodology.

Reviewing the pr0posed methodology and its details, we now present a summary of the

steps:
1. Preliminary problem analysis

The case study code used in this research is called Prism and it implements a ﬁnite
element boundary-integral (FE~BI) numerical method for the analysis of conformal
antennas. According to the input parameters, the iterative solver method is selected
by the code and preconditioning is either enabled or disabled. This application runs
on a Sun Enterprise 450 and proﬁling pointed to a dense matrix-vector multiplication
subroutine as the most time consuming routine. Prism was parallelized using OpenMP
directives. The design factors selected for experimentation were: compiler options,

problem size, and algorithm.

2. Design of experiment

Two different experiment designs were used. The ﬁrst type was a split-split plot design
and the second type was a fully-randomized full-factorial design. The experiments

were:

89

0 Experiment 1: Parallel implementation of Prism

Experiment 2: Serial implementation of Prism

0 Experiment 3: Inefficient memory access pattern in Prism, validation experiment.

Experiment 4: Matrix-vector multiplication kernel, validation experiment.

3. Data Collection

We used both software and operating system instrumentation. Software instrumenta-
tion was done using the KAP / Pro statistical library. Operating system data collection

was done using the unix commands sar, iostat, and vmstat.

4. Data Analysis

Perl scripts extracted the data and converted it to a format used by two widely used
statistical packages: Minitab 13 and SAS v8. The extracted data was normalized
using dimension normalization with Euclidean norm. Correlation analysis was used
to determine the most correlated metrics with execution time. We estimated the
intrinsic dimension of the data using three commonly used estimators: scree test,
KC, and cumulative percentage of variance. Sequential forward search with entropy
cost function was used to determine the most important metrics. Once the important
metrics were identiﬁed, ANOVA and post hoc comparisons were used to establish

which factors affected important metrics and reach conclusions.

Results of these experiments are presented in the following sections.

8.1 Experiment 1: Parallel Implementation of Prism

The ﬁrst experiment was used to characterize the interactions between our application and
the system. Prism was parallelized using OpenMP constructs. As described in section
5.4.1, two different algorithms for matrix-vector multiplication were used, thirteen compiler

options and three different problem sizes were tested. The experiment design was done

90

using a split—split plot design and the actual order of execution of each experimental run is

shown in appendix D. We ran 234 experimental runs and 47 different values were measured.

8.1.1 Correlation Analysis.

Those metrics most correlated with execution time, with correlation higher than 0.9, are
shown in table 8.1, where the correlation was negative in all cases. Negative correlation is

interpreted as follows: execution time increases when the metric value decreases.

Table 8.1. Metrics with largest correlation with execution time in experiment 1.

 

 

 

 

 

 

 

 

 

LOrder T Label 1 Description [ Category Correlation]

1 lwrit/s Accesses of system buffer Buffer Activity -0.965
cache to write

2 lread/s Accesses of system buffer Buffer Activity -0.965
cache to read

3 COtOdO/wps Write per second per disk I/ O -0.960

4 c0t0d0/util Percentage of disk utiliza- I/O -0.958
tion per disk

5 disk/SO Disk operations per second Disk -0.948

6 page/mf Minor faults in units per Paging activity -0.910
second

7 vﬂt/s Address translation page Paging activity -0.908
faults per second

 

 

 

 

 

 

 

All correlations shown in table 8.1 are high and signiﬁcant. The two most correlated
metrics are access to buffer cache to read and to write. These report logical I / 0 requests and
occur if a program opens device for I/O. Then the next two metrics are also related to I/O
for a speciﬁc disk. These two measurements are writes to disk per second and percentage
of disk utilization. Similarly, metric 5 is disk operations per second. As we can see from
this pattern, it shows that this application’s bottleneck is I/O or disk access. The last two
metrics are related to paging activity. Ffom inspecting the code, we understand that this
is a typical behavior since a dense matrix-vector multiplication algorithm dominates the

computation and the matrices involved in this application are extremely large.

91

8.1.2 ANOVA

Three-way AN OVA at signiﬁcance level a = 0.05 was performed. ANOVA analyzes the
effect of qualitative factors on one dependent variable. Table 8.2 shows ANOVA results for
those metrics obtained in table 8.1. This table shows whether the factors had an effect or

not on the variable.

Table 8.2. Effect of factors and interactions on the most correlated metrics with execution
time for experiment 1.

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Item Name Size (S) (A) Option (C) s .. A [ s * C A * C F5 :0: A ,. C
O Execution time Yes Yes Yes No No Yes No
1 lwrit/s No Yes Yes No No No No
2 lread/s No Yes Yes No No No No
3 c0t0d0/wps N 0 Yes Yes N 0 Yes No No
4 c0t0dO/util No Yes Yes No No No No
5 disk / SO Yes Yes Yes No Yes No N o
6 page / mf Yes Yes Yes No Yes N 0 Yes
7 vﬁt /3 Yes Yes Yes N 0 Yes No Yes

 

 

 

 

 

 

 

 

 

 

 

Recall that all the shown metrics are correlated with execution time, therefore we also
include an ANOVA analysis of execution time. It is interesting to notice from the analysis
that the choice of algorithm and compiler Options signiﬁcantly affect all metrics correlated
with execution time.

Following this analysis we proceed to obtain the number of metrics required to describe
the behavior of the system. We normalized the data using the Euclidean norm and then

estimated the intrinsic dimension of the data.

8.1 .3 Dimensionality

In section 7.3.4, we described three different methods to estimate the intrinsic dimension

of the data set: cumulative percentage, Kaiser-Guttman (KC), and scree test. Figure

92

8.1 shows an example of a plot of the eigenvalues of the resulting correlation matrix for
this experiment. This graph illustrates the scree test and the Kaiser-Guttman criteria

(eigenvalues greater than one).

 

12 T l

I
"Eingpo" using 1:2 —9—

Eigenvalue

 

 

 

 

0 5 10 15 20 . . ' i 25
Eigenvalue number

Figure 8.1. Eigenvalues of correlation matrix in experiment 1.

Notice the change in the slope of the curve at ﬁve eigenvalues and at eight eigenvalues.
Scree test might have two or three inflection points in the curve and this is one of the cases.
Notice that nine eigenvalues are greater than one. This is the KC criteria. Table 8.3 show
how many metrics should be kept to preserve the variability of the data, according to the
three estimation methods. For this data set, nine metrics can explain the variability of the
data.

To validate these tests for intrinsic dimensionality estimation we have created a synthetic
data set by the use of a random number generator. Nine columns were generated at random.
Then subsequent columns are multiples of the ﬁrst nine columns plus noise added to them.
The columns of this matrix are then manipulated to have the same mean and variance as

the matrix from the data set obtained in this experiment. We use this matrix as input to

93

Table 8.3. Number of metrics to keep variability of the current data according to three
different criteria for experiment 1.

 

E Test 1 Estimated intrinsic dimension ]

 

 

Scree test 8

 

Cumulative percentage (95%) 9
K-G 9
9

Maximum of the three methods

 

 

 

 

 

 

the three estimators of dimensionality and we get that all three methods found that nine is
the dimension of the data set. Appendix I shows the code used for this test.

Figure 8.2 shows the scree test and the KC criteria for the synthetic data set.

 

 

 

 

 

 

Eigenvalue number

Figure 8.2. Eigenvalues of correlation matrix for synthetic data.

94

8.1 .4 Metric Selection

Sequential forward search (SFS) was applied to determine which subset of metrics will pre-
serve most of the data variability. Table 8.4 show those metrics with the highest information

content for this experiment, as selected by the SFS algorithm. The cost function used for

performing the search was entropy as described in [81].

Table 8.4. Metrics with highest information content in experiment 1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Utem ] Name I Description 1 Category

1 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).

2 cpu/sy Percentage of time in system mode. CPU utilization

3 bwrit/s Writes per second of data from system Buffer activity
buffers to disk.

4 %wcache Cache hits ratios for write as percent- Buffer activity
age.

5 cpu/ us Percentage of time in user mode. CPU utilization.

6 page/fr Paging activity in units per second. Paging
Kilobytes freed.

7 vﬂt/s Address translation page faults per Paging activity
second.

8 atch/s Page faults per second that are satis- Paging activity
ﬁed by reclaiming a page currently in
memory.

9 %wio Portion of time running idle with some CPU utilization
process waiting for block I/O.

 

 

We should notice that this method has selected a variety of metrics rather than only I / O

or memory related ones. This time we have measurements from virtual memory statistics,

CPU utilization, buffer activity, and paging.

8.1.5 ANOVA

Once the metrics have been selected, we proceed to analyze which of the factors studied

signiﬁcantly affect these metrics. The three factors studied in this case were problem size,

95

 

algorithm, and compiler options. Table 8.5 show analysis of variance results for those

metrics.

Table 8.5. ANOVA on the metrics shown in table 8.4. Main effects.

 

 

 

 

Factor Metrics affected by the factors
Size (S) cpu/sy, vflt/s
Algorithm (A) cpu/sy, bwrit/s, vﬂt/s
Compiler Option (C) memory /free, cpu/sy, bwrit/s, %wcache, cpu/ us, vflt/s

 

 

 

 

We notice here that even thought that kilobytes freed in paging activity, attaches per
second, and portion of time idle waiting for I/O were selected as important metrics, none
of the factors studied affect them. There might be other factors affecting these metrics. On
the other hand percentage of the time in system mode and address translation per faults

were affected by all three factors.

8.1.6 Another method for subset selection

The independence of metrics can be used as the cost function to explain the variability
of the data since it is related to the amount of information contained in the performance
data matrix. In the subset selection method suggested by Vélez and Jiménez [90], the
criterion of independence between columns is used as a measure for subset selection. Those
features that are most independent and explain the highest correlation are selected based
on principal component analysis (PCA) and singular value decomposition (SVD). Principal
component analysis is a method used to project actual variables into new uncorrelated
variables. The SVD of matrix A is A = U E VT where U and V are orthogonal and
E = diag(01,02,--- ,ar) with 01 2 02 2 _>_ 0,. Z 0. The 0,3 are called the singular
values. The algorithm proposed in this work has been used in the past for unsupervised
feature subset selection applied to hyperspectral imagery.

Table 8.6 show those metrics with the highest variability for this experiment, as selected

96

by the SVD algorithm.

Table 8.6. Metrics with highest information content selected by SVD for experiment 1.

 

 

 

 

 

 

 

l Item] Name I Description I Category

1 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).

2 pﬂt/s Page faults from protection errors per Paging Activity
second (illegal access to page).

3 page/ re Paging activity in units per second. Paging
Page reclaims.

4 c0t1d0/wps Writes per second per disk. I/ O

5 %wio Portion of time running idle with some CPU Utilization

process waiting for block I/ O.

 

6 page/sr Paging activity in units per second. Paging
Pages scanned by click algorithm.

 

7 page/pi Paging activity in units per second. Paging
Kilobytes paged in.

 

8 page/p0 Paging activity in units per second. Paging
Kilobytes paged out.

 

9 faults / cs Trap/ Interrupt rates per second. CPU Memory Faults
context switches.

 

 

 

 

 

 

The selected metrics describe activity which experts usually look for when tuning a
program: paging activity, cpu utilization, memory faults, and virtual memory statistics.
Table 8.7 show analysis of variance results for those metrics. Only two of the metrics

selected by SVD were selected by SFS.

Table 8.7. ANOVA on the metrics shown in table 8.6. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) faults/cs
Algorithm (A) faults/cs
Compiler Option (C) memory/free, page/p0, faults/cs

 

97

It stands out that CPU context switches is affected by all three studied factors. Algo-

rithm and problem size do not affect any of the other metrics.

8.2 Experiment 2: Serial Implementation of Prism

Here the application was using similar algorithms as in the parallel experiment but without
OpenMP calls. All other factors remain the same. The two algorithms used in this exper-
iment are shown in Appendix C. These are identiﬁed as Algorithms D and E. The actual

order of execution of each experimental run is shown in appendix E.

8.2.1 Correlation Analysis.

Once again, those metrics most correlated with execution time were computed. Table 8.8

shows those metrics with correlation higher than 0.9 with execution time.

Table 8.8. Metrics with largest correlation with execution time for experiment 2.

 

 

 

 

 

 

 

[ Rank I Label 1 Description I Category I Correlation

1 c0t0dO/wps Writes per second per disk I/O -0.985

2 disk / 30 Disk Operations per second Disk -0.985

3 lwrit/s Accesses of system buffer Buffer Activity -0.981
cache to write

4 COtOdO/util Percentage of dik utiliza- I/O -0.981
tion per disk

5 lread/s Accesses of system buffer Buffer Activity -0.980
cache to read

 

 

 

 

 

 

 

Notice that all these were also highly correlated with execution time in Experiment 1. In
this case, regardless of whether the application is running serially or in parallel, the metrics

which have a linear relation with execution time are the same.

98

8.2.2 ANOVA

For those metrics most correlated with execution time, ANOVA was used to determine

which ones of the factors affect the obtained metrics. Table 8.9 show ANOVA for these

metrics.

Table 8.9. Effect of factors and interactions on the most correlated metrics with execution
time in experiment 2.

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Item Name Size (S) (A) Option (C) s * A [ s * Ci A * 0 Ts * A a: C
O execution time Yes Yes Yes No Yes Yes No
1 COtOdO/wps No No No No No No No
2 disk/SO No No No No No No No
3 lwrit/s No Yes Yes No No Yes No
4 c0t0d0/util No Yes Yes No No Yes No
5 lread/s No Yes Yes No No Yes No

 

 

 

 

 

 

 

 

 

 

 

We can notice that the problem size does not affect any of the metrics correlated with
execution time. Also disk writes per second (COtOdO/wps) and disk operations per seconds
(disk/SO) are not affected by any of the studied factors. As in experiment 1, execution time
is affected by all three studied factors but in this case there interaction between problem size
and compiler options. This means that compilers options do not cause the same behavior

as problem size varies.

8.2.3 Dimensionality

The three methods explained previously were used to estimate the intrinsic dimensionality
of the data. Table 8.10 shows the number of metrics to keep to preserve the variability of

the data. The maximum values of the three methods was used as the dimension of the data.

99

Table 8.10. Number of metrics to keep variability of data according to three different criteria
in experiment 2.

 

 

 

 

 

 

I Test I Experiment with serial implementation I
Scree test 7
Cumulative percentage (95%) 6
K-G 8
Maximum of the three methods 8

 

 

 

 

8.2.4 Metric Selection

Using SFS and the results from Table 8.10, those metrics shown in Table 8.11 were obtained

as the most relevant ones. The cost function used in this search is entropy.

Table 8.11. Metrics with highest information content in experiment 2.

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Name Description Category
1 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).
2 pswch/s Process switches System swapping activity
3 %wcache Cache hit ratios for write as per- Buffer activity
centage
4 cpu/sy % of time the system spent in sys- CPU utilization
tem mode
5 bwrit/s Writes per second of data from Buffer activity
system buffers to disk
6 faults/ cs CPU context switches. Inter- Memory faults
rupts per second.
7 faults/ in Non-clock device interrupts. In- Memory faults
terrupts per second.
8 pgout/s Page-out requests per second. Paging activity

 

 

We can compare this table with table 8.4 and identify that half of the metrics were also
selected for the parallel code but process switches, memory faults, and page-out requests
are additional metrics in this case.

Table 8.12 shows ANOVA results for these metrics. From these results we observe that

100

process switches, percentage of time in system mode, and CPU context switches are not

affected by any of the studied factors.

Table 8.12. AN OVA on the metrics shown in table 8.11. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) %wcache, bwrit/s
Algorithm (A) faults/in, pgout/s
Compiler Option (C) memory/ free, bwrit/s, faults/in

 

Using the method presented in [90], those metrics shown in Table 8.13 were obtained as
the most relevant ones. These metrics describe buffer and paging activity, virtual memory

statistics, and CPU utilization. This time, execution time was selected as a relevant metric.

Table 8.13. Most important metrics for experiment 2 according to SVD.

 

Item Name Description Category

 

1 atch/s Page faults per second that are Paging Activity
satisﬁed by reclaiming a page cur-
rently in memory (attaches per

 

 

 

 

 

 

second).

2 pflt/s Page faults from protection er- Paging Activity
rors per second (illegal access to
page).

3 bread/s Reads per second of data to sys- Buffer Activity
tern buffers from disk.

4 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).

5 page/p0 Paging activity. Kilobytes paged Paging
out per second.

6 pgin/s Page-in requests per second. Paging Activity

7 cpu / wt Report the percentage of time the CPU Utilization

system has spent waiting for I/O.

 

 

 

 

 

 

8 execution time Total execution time. Overall

 

101

Table 8.14 shows ANOVA results for these metrics. Notice that only execution time is

affected by all three factors.

Table 8.14. ANOVA on the metrics shown in table 8.13. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) execution time
Algorithm (A) cpu / wt, execution time
Compiler Option (C) atch/s, memory/free, cpu / wt, execution time.

 

8.3 Experiment 3: Inefﬁcient Memory Access Pattern Algo-

rithm

In this experiment we are testing algorithms A, B, and C as described in appendix C.
Algorithm C is purposely having an inefficient matrix-vector multiplication algorithm. It
accesses rows and columns in reverse order, resulting in a reduction in data locality. This
algorithm is used to validate results by exposing metrics related to memory access. All
other factors have the same levels as in the previous experiments. The actual order in

which experimental runs were executed for this experiment are shown in appendix F.

8.3.1 Correlation Analysis

Those metrics most correlated with execution time are shown in table . Table 8.15 shows
those metrics with correlation higher than 0.9 with execution time.

These are the same metrics found in experiment one. We are perceiving a pattern in
those metrics most correlated with execution time since they are approximately the same
metrics. However, we cannot generalize since all three examples come from the same code

and application. We should study a different application to make a fair comparison.

102

Table 8.15. Metrics with largest correlation with execution time for experiment 3.

 

IRankI Label I

Description

Category I Correlation I

 

 

 

 

 

 

 

 

1 lwrit/s Accesses to system buffers Buffer activity -0.9656
to write.

2 lread/s Accesses of system buffers Buffer activity -0.9612
to read.

3 page/mf Minor faults per second. Paging -0.9235

4 vﬂt/s Address translation page Paging activity -0.9214
faults per second.

 

 

 

8.3.2 ANOVA

AN OVA at a level equal to 0.05 was computed. Table 8.16 shows ANOVA results for those

metrics obtained in Table 8.15.

Table 8.16. Effect of factors and interactions on the most correlated metrics with execution
time for experiment 3.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Item Name Size (S) (A) Option (C) S t A I S t C l A t C ] S * A t C
0 execution time Yes Yes Yes N o No Yes N o
1 lwrit/ s N 0 Yes Yes No N 0 Yes No
2 lread / s N 0 Yes Yes No No Yes No
3 page / mf Yes Yes Yes N 0 Yes Yes Yes
4 vflt /s Yes Yes Yes N 0 Yes Yes Yes

 

 

Like in experiment one, the selection of algorithm and compiler options affect those

metrics most correlated with execution time. All three factors affect execution time and

paging activity. Notice that buffer activity is not affected by problem size.

103

8.3.3 Dimensionality

Table 8.17 shows the dimension estimated by all three methods. Nine metrics were estimated

as necessary for this data set.

Table 8.17. Estimate of the intrinsic dimension of this data set.

 

r Test I Experiment with serial implementation

 

 

Scree test 7

 

Cumulative percentage (95%) 9
K-G 9
Maximum of the three methods 9

 

 

 

 

 

 

8.3.4 Metric Selection

The metrics selected by SFS with entropy cost function are presented in Table 8.18.

Table 8.18. Metrics with highest information content in experiment 3.

 

 

 

 

 

 

 

 

 

 

Item Name Description Category

1 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).

2 bwrit/s Writes per second of data from Buffer activity
system buffers to disk

3 %wcache Cache hit ratios for write as per- Buffer activity
centage

4 lwrit/s Accesses of system buffer cache to Buffer activity
write

5 cpu/sy Percentage of time in system CPU utilization
mode

6 cpu / id Percentage of time the system has CPU utilization
spent idling

7 page/p0 Kilobytes paged out per second Paging

8 pflt/s Page faults from protection er- Paging activity
rors per second (illegal access to
page).

9 de/wps Writes per second per disk I/O

 

 

 

 

 

 

104

There are a variety of metrics including virtual memory statistics, buffer activity, cpu
utilization, paging, and I/ O related.

Table 8.19 shows ANOVA results for these metrics. This time, the algorithm selected
has an effect on almost all metrics. This is as expected since we designed this experiment
for contrasting three matrix-vector multiplication with different behavior, one of which has

a bad memory access pattern.

Table 8.19. ANOVA on the metrics shown in table 8.18. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) cpu/sy
Algorithm (A) memory/free, bwrit/s, %wcache, lwrit/s, cpu/sy, cpu / id, pflt/s, de/wps
Compiler Option (C) memory /free, bwrit/s, %wcache, lwrit/s, cpu/sy, cpu/id, page/p0

 

Using the method presented in [90], those metrics shown in Table 8.20 were obtained
as the most relevant ones. Likewise, the algorithm factor affects a large number of metrics
as expected from the validation experiments. We compare this with Table 8.7, where only

one metric is affected by the algorithm.

Table 8.20. Most important metrics for experiment 3 according to SVD.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Name Description Category

1 pﬂt/s Page faults from protection errors Paging activity
per second

2 disk/s2 Disk operations per second Disk

3 cpu / wt Percentage of the time the system CPU utilization
has spent idling

4 page/sr Pages scanned by clock algorithm Paging

5 cOtOdO/rps Reads per second per disk I/O

6 bread/s Reads per second of data to sys- Buffer activity
tem buffers from disk

7 page/p0 Kilobytes paged out per second Paging

8 faults / cs CPU context switches. Memory faults

9 de/wps Write per second per disk I/O

 

105

Table 8.21 shows ANOVA results for these metrics.

Table 8.21. ANOVA on the metrics shown in table 8.20. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) faults/cs
Algorithm (A) pﬂt/s, disk/s2, cOtOdO/rps, faults/cs, de/wps
Compiler Option (C) cOtOdO/rps, page/p0, faults/ cs

 

We also realize that, for this experiment, compiler options have a effect on many dif-
ferent metrics, in contrast to the previous experiment. This is indicative of interaction
between algorithm and compiler option. Examining Appendix F, we ﬁnd three metrics

having interaction with algorithm: lwrit/s, cpu/sy, and cpu/id.

8.4 Experiment 4: Matrix-Vector Multiplication Tests

In this experiment we are testing only matrix-vector algorithms. This time the design
of experiment is fully-randomized full-factorial. Four factors are used in the experiment:
problem size, algorithm, data structure, and compiler options. The actual order in which

experimental runs were executed for this experiment are shown in appendix G.

8.4.1 Correlation Analysis

Table 8.22 shows those metrics with correlation with execution time higher than 0.6. Notice
that correlations are much lower than in previous experiments. Moreover, the number
of metrics correlated with execution time is drastically reduced. One possibility of this
behavior is that the algorithm is only exercising memory usage of the system while the

complete application is using different aspects of the system.

106

Table 8.22. Metrics with largest correlation with execution time.

 

 

 

 

[Rank I Label I Description I Category I Correlation
1 page/mf Minor faults per second Paging -0.6746
2 page / re Page reclaims per second Paging -0.6343

 

 

 

 

 

 

 

8.4.2 ANOVA

AN OVA at a level equal to 0.05 was obtained. Table 8.23 shows AN OVA results for those

metrics obtained in Table 8.15. We also include execution time in the analysis.

Table 8.23. Effect of factors and interactions on the most correlated metrics with execution
time for experiment 3.

 

 

 

 

 

 

 

Problem Algorithm Compiler Data
Item Name Size (S) (A) Option (C) Structure (D)
0 execution time Yes Yes Yes Yes
1 page / mf Yes Yes Yes Yes
2 page / re No Yes Yes Yes

 

 

 

 

 

 

 

8.4.3 Dimensionality

Table 8.24 shows the dimension estimated by all three methods. Seven metrics were esti-

mated as necessary for this data set.

Table 8.24. Estimate of the intrinsic dimension in experiment 4.

 

 

 

 

 

 

Test Experiment with serial implementation I
Scree test 6
Cumulative percentage (95%) 7
K-G 6
Maximum of the three methods 7

 

 

 

 

107

8.4.4 Metric Selection

The metrics selected by SFS with entropy cost function are presented in Table 8.25. As in
experiments one through three, we still observe a variation in the types of metrics selected.
We obtain CPU utilization, virtual memory statistics, buffer activity, and paging activity
related metrics. In contrast to previous experiments, here one type of metric does not
dominate the results. In experiment 1, paging activity related metrics would be represented
more than others. In experiment 2, metrics associated with memory faults were more visible.

In experiment 3, buffer activity related metrics were relevant.

Table 8.25. Metrics with highest information content for experiment 4 .

 

 

 

 

 

 

 

Item Name Description Category

1 memory / free Usage of virtual and real memory. Virtual Memory Statistics
Free size of the free list (Kbytes).

2 %sys Portion of time running in system CPU utilization
mode

3 memory/swap Amount of swap space currently Virtual memory statistics
available

4 bwrit/s Writes per second of data from Buffer activity
system buffers to disk

5 page / re Page reclaims per second Paging

6 cpu/sy Percentage of the time the system CPU utilization

has spent in system mode.

 

7 pgout/s Page-out requests per second Paging activity

 

 

 

 

 

 

Table 8.26 shows ANOVA results for these metrics. ANOVA this time show that the
selected factors affect the metrics, as expected. We have to remember that we selected a
variety of levels causing signiﬁcant effects in the results on purpose to validate the method-
ology.

Using the method presented in [90], those metrics shown in Table 8.27 were obtained
as the most relevant ones. Here execution time was selected as important. Similar to the

previous method, there is a large variety of selected metrics.

108

Table 8.26. ANOVA on the metrics shown in table 8.18. Main effects.

 

 

 

 

 

 

 

Factor Metrics affected by the factors
Size (S) %sys, bwrit/s, cpu/sy, pgout/s
Algorithm (A) memory/ free, %sys, page/re,cpu/sy, pgout/s
Compiler Option (C) bwrit/s, page/re, pgout/s
Data Structure (D) %sys, bwrit/s, page/re, cpu/sy, pgout/s

 

 

Table 8.27. Most important metrics for experiment 4 according to SVD.

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Name Description Category

1 ExecTime Execution Time Overall

2 pgin/s Page-in requests per second Paging activity
3 c1t1d0/ util Percentage of disk utilization I/O

4 bwrit/s Writes per second of data from Buffer activity

system buffers to disk

5 ppgout/s Pages paged out per second Paging activity
6 faults/ cs CPU context switches Memory faults
7 page/p0 Kilobytes paged out per second Paging

 

Table 8.28 shows AN OVA results for these metrics and it shows that the studied factors

have an effect on the resulting metrics.

Table 8.28. ANOVA on the metrics shown in table 8.27. Main effects.

 

 

 

Factor Metrics affected by the factors
Size (S) Execution time, bwrit/s, ppgout/s, faults/cs, page/p0
Algorithm (A) Execution time, ppgout/s, faults/cs, page/p0

 

Compiler Option (C) Execution time, bwrit/s, ppgout/s, faults/cs, page/p0

 

 

 

 

Data Structure (D) Execution time, bwrit/s, ppgout/s, faults/cs, page/p0

 

109

8.5 Analysis of Results

Results Show some interesting ﬁndings. First, three metrics are selected as important in the
different experiments: memory/free, bwrit/s and cpu/sy. These indicate usage of virtual
memory, writes to disk, and percentage of the time the system is in system mode. Their rel-
evance in different variations of the experiment indicate that they should be observed when
performing experiments. Moreover, bwrit/s also show up in the work by Ahn and Vetter
in [15]. This is a surprising results since their application is using MP1 on a distributed
memory system while we use OpenMP on a shared memory system.

Second, the scree test does not provide a reliable estimate of the dimension of the data
set. The graph might have two or three inﬂection points, making it hard to determine the
estimated dimensionality of the data set.

Another analysis we did is the percentage of metrics kept by the dimensionality estima-

tor.

Table 8.29. Percentage of metrics kept for the analysis.

 

 

 

 

 

Experiment Orig. No. Metrics Estimated Dimension % of Metrics Retained
Exp 1 47 9 19.15%
Exp 2 43 8 18.60%
Exp 3 52 9 17.31%
Exp 4 36 7 19.44%

 

 

 

 

 

 

Table 8.29 shows that approximately 18% of the metrics are retained as important.

When comparing SVD and SFS with entropy cost function, SF S provided metrics more
in accordance to what our experience would tell us were more important than the SVD
method. This might be due to the use of an entrOpy cost function which represents the
amount of information contained in the data.

Once that we know that a factor is affecting a metric, we can apply post hoc comparisons

and analysis of means to study the causes. In experiment one, we noticed that execution

110

time is affected by the compiler Options. We used analysis of means along with the least
signiﬁcant difference (LSD) post hoc comparison to study the classiﬁcation of compiler

options according to the test. We Obtained the following result in SAS:

t Tests (LSD) for ExecTime

Means with the same letter are not significantly different.

Comp

t Grouping Mean N Opt

A 1162.00 2 1

A 1160.00 2 3

B 663.00 2 13
B

B 662.50 2 2
B

B 623.50 2 4
B

B 616.50 2 11
B

B 615.50 2 6
B

B 602.00 2 5
B

B 601.50 2 8

111

B 601 . 50 2 7
B

B 600 .00 2 10
B

B 599.50 2 12
B

B 598 .00 2 9

When we carefully analyze these results we can observe that compiler options one and
three correspond to combinations of ﬂags not containing the -fast ﬂag while the remaining
options contain the -fast ﬂag. Therefore, in this particular case, the -fast ﬂag is the only
one signiﬁcant. Compiler Option 9 will give the best execution time. We can conclude that
some ﬂags are more important than others and that the execution time depends on the

compiler used.

8.6 Scientiﬁc Programmer Actions

Once the interpretation or evaluation of the observable values is available, this knowledge
can be converted into suggestions on how to improve the system-software interactions. The
methodology we have described is very general and applies to any Observable computing
system. An additional step is required for automatic performance tuning, however, this step
is system dependent. We suggest one possible implementation for this action. It involves
signal classiﬁcation and knowledge from the analysis incorporated into a knowledge based
system.

As explained earlier in section 7.3.4, feature subset selection is used for obtaining the
relevant metrics describing the observable computing system. From this set of metrics appli-
cation signatures can be extracted describing the trajectory of the metrics. An application

signature is a piecewise linear ﬁtting of the curve obtained from the trajectory of any given

112

metric in the value-time plane. In the work by Lu and Reed [91], the authors suggest using
a performance contract in which the Observed signature is compared to a model signal to
adaptively control a grid application. The comparison is made through the use of a degree
of similarity measure.

Since the degree of similarity of application signatures can point to signatures which
indicate possible sources of problems, we suggest the use of the information obtained from
the degree of similarity algorithm and the information from the statistical analysis of the
proposed methodology to classify the type of problem in the system and prescribe a solution
to the diagnosis. This information can be given to a knowledge based system with a set
of rules to prescribe a solution at the high level. Performance evaluation systems which
contain a knowledge-based system for prescribing a solution include Kappa-Pi [92], and

KOJ AK [93] so their model can be followed.

8.7 Summary

This chapter has presented results Obtained for the four experiments performed. Correla-
tions, dimension estimation, feature subset selection using sequential forward search with
entropy cost function, subset selection using SVD, ANOVA, and post hoc comparisons are
presented. Validation experiments showed that the method indeed point out the causes of

variations introduced in the code.

113

CHAPTER 9

Conclusion

9. 1 Research Summary

The efficient implementation of an application on an advanced architecture requires a tight
integration between software and hardware. This task is particularly difficult to achieve
due to the growing complexity of nowadays systems and the large number Of factors that
might affect performance. For instance, execution time is determined by factors such as
programming style, programming paradigm, language, compiler, libraries, architecture, and
algorithms [94]. These factors are typically selected by the application programmer without
knowing the actual effect of these decisions until the implementation is complete. After this
initial step, a tuning process is initiated where the implementation is improved until an
acceptable level of performance is achieved.

A widely accepted tuning methodology is shown in Figure 9.1. It incorporates the
application programmer knowledge and expertise into the loop, causing several problems
which withhold the widespread use of performance evaluation tools.

The ﬁrst problem is that scientiﬁc programmers are required to interact with instru-
mentation and analysis and evaluation tools. Most of the time, they are experts in their
respective ﬁelds but not on performance evaluation. If the converse is true, then the per-
formance analyst might not have enough insight into the application to understand the
relations between performance data and code. Moreover, scientiﬁc programmers also need

experience and in—depth knowledge on the particular computer system to tune their appli-

114

Programming Programming

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P d' St 1
ara 18m Y e Languages System
Conﬁguration
High-level ¥ Computer Instrumentation
code System Tools
Libraries Algorithm
Performance
Modify Use Data
1
_ Analysis and
Evaluation .
Programmer v > Evaluation
m T0015
/ / \ \
Experience Knowledge In-depth Understand Relations
Of KDOWICdgC on Between Performance Burden on Programmer
Tools Computer System Data and Code

 

Figure 9.1. Typical analysis ﬂow for tuning an application.

cation. The expertise level required for tuning an application to a particular platform limits
the acceptance of performance tools among the scientiﬁc community.

An alternative tuning methodology is proposed to overcome some of these problems.
The key point in this methodology is to obtain the appropriate information. A diagram
with the approach of the proposed alternative is shown in Figure 9.2.

The main goal of this research was to Obtain relevant information to improve the process

of tuning applications on advanced architectures.

9.2 Contributions

The contributions of this work can be summarized as follows.

115

Alternative

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Algorithms .....................................................................
I Experimentation
_ High-level 5 Computer _ Instrumentation _ Performance
code 5 System Tools Data
Mod'f V
1y . -
Problem Solvmg EnVIronment Statistical
Analysis
*—-* Programmer ~ Suggestion i Knowledge-Based Information
System

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 9.2. Proposed approach for application tuning. The dashed line shows the part of
this tuning methodology addressed by this research.

9.2.1 A Methodology for Obtaining Relevant Performance Information

A methodology for determining the relation between high-level factors and performance

data was developed [95, 96]. This methodology is novel in two main aspects:

0 The integration of statistical methods is used to establish relations in the mapping

process

0 It removes the burden from the scientiﬁc programmer to interpret performance data.

First, this methodology is the only one which combines several statistical procedures to
relate factors to response variables on the performance data analysis problem. No other
method has approached the problem from this perspective in which associations are obtained
through statistics.

In previous literature, approaches to relate high-level factors to performance information
either suggest tool developers the type of information to be collected from the system at
each level in the mapping process to establish relations [10], or it tries to estimate the

information lost in the mapping process by incorporating knowledge on the behavior of

116

software/ hardware, [8]. These methods are not appealing since they still have a portability
problems and depend on the user expertise.

Second, the scientiﬁc programmer is not required to interpret performance information
from tools. The methodology obtains unbiased information and relates it to high-level fac-
tors. This methodology is composed of four steps: problem analysis, design of experiments,

data collection, and data analysis as illustrated in Figure 9.3.

 

 

 

 

Preliminary Problem Design of Data Data
Analysrs Experiments Collection Analysis

 

 

 

ll

 

 

 

 

 

 

 

 

 

 

 

 

Figure 9.3. Proposed methodology to extract information in an observable computing system

(OCS).

Integration is the key to obtain the information in an unbiased manner. A computational
electromagnetics case study was used to illustrate the usefulness of this methodology. As an
example, one of the experiments demonstrated that for this particular application, memory
usage related metrics were important and automatically selected by the method. Moreover,
when execution time was examined, the analysis of compiler options showed that only the
~fast ﬂag was having a signiﬁcant effect on execution time. The proposed methodology

may be incorporated to future automatic performance evaluation tools.

9.2.2 The Use of Design of Experiments for Performance Analysis Ex-

perimentation

We have identiﬁed a systematic way of performing the design of experiments (DOE) for
performance analysis. Design of experiments refers to the planning of experiments to extract
the most amount of information with the minimum effort. It concerns with the way in
which the treatments will be administered to the subjects in a study. A correct design
should minimize the eﬂects of uncontrollable factors and determines whether variations in

the response are signiﬁcant or due to the random nature of the process.

117

Several designs are available for the experimenter, from which we have selected two

appropriate for the performance evaluation problem:

0 full-factorial design
0 split-split plot design

When DOE is used in the experimentation step of the methodology, then Analysis of
Variance (AN OVA) can be used for data analysis. The combination of ANOVA with DOE
allows reaching conclusions about the effect of factors at the high-level on the performance
metrics obtained by instrumentation tools. These conclusions are unbiased, based on proba-
bility, and removed from subjective judgements. The use of DOE also minimizes the effects
of factors not considered during experimentation.

Previous work in applying DOE and ANOVA to the performance analysis problem were
limited in the type of performance data they were applied to: either execution time or CPI
(cycles per instruction) [3, 11, 12, 19, 20], and from those, only the work by Alabdulkareem
et al. analyzed large parallel codes.

This research has shown that the use of screening experiments along with the prOper
design of experiment limits the number Of factors in the experiment, and making more

feasible the entire experimentation.

9.2.3 The Usage of Data Reduction and Statistical Analysis

The last step in the proposed methodology is data analysis. Performance data collected
during experimentation will not yield useful information until it is carefully analyzed. Our
contribution to data analysis is the combination of dimensionality estimation, feature subset
selection, and AN OVA to Obtain information relevant to performance analysis when map-
ping algorithms to advanced architectures. The use of these techniques assists in locating,
in an unbiased manner, sources of performance improvement.

In data analysis each metric is considered a feature. Statistical methods are the basis for

data analysis. Speciﬁcally, four statistical techniques were used in the selected case study:

118

correlation analysis

a normalization

o dimension estimation

feature subset selection

This is illustrated in Figure 9.4.

 

Raw Convert

 

 

 

 

 

 

Data Format
Performance
Data Correlation
Matrix AV Matrix :

 

 

 

 

 

 

 

 

 

V

A Subset
. r . . f '——_>
Normalize Dimensmn ,
Selection

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Information
_ —-———>
r Anova
; Post Hoc
Compansons

 

 

 

Figure 9.4. Summary of statistical analysis techniques used for extracting information about
performance outcomes.

Correlation

The degree of association between two variables, if any exists, is obtained through cor-

relation, which measures the linear association among variables. This measure was used

119

to extract which metrics were most associated to execution time and also tO remove the
collinearity problem present in software instrumentation data.

Correlation analysis revealed that software instrumentation metrics exhibit collinearity.
This implies a redundant information content in the data, limiting the set of statistical

methods applicable for its analysis.

Normalization

Most statistical techniques are biased by the magnitude or order of the data to be pro-
cessed. We have identiﬁed the need of data normalization before the utilization of statistical
methodologies to minimize the bias [97]. Three different normalization schemes were tested

on performance data:

0 log normalization
o min-max normalization

o dimension normalization

Dimension normalization, in which each metric vector is divided by its Euclidean norm,

was selected for performance data analysis based on a separability criteria.

Dimensionality Estimation

Dimensionality estimation along with sequential forward feature subset selection were used
to identify which metrics are the most important from a large set of performance data [97].
Three methods for estimating intrinsic dimensionality were tested: Cumulative Percentage
of Total Variance, Kaiser-Guttman, and Scree test.

Intrinsic dimensionality estimation and unsupervised feature subset selection identiﬁed
the metrics containing the most performance information. On average, only 18% of the

metrics were found to be important.

120

Feature Subset Selection

Sequential forward search was used and an entropy based cost function was selected as the
most appropriate for the type of data we are working with [96]. Entropy measures the
amount of information content in the data.

Multidimensional analysis methods have been used in the past to reduce the dimension of
performance data. In their work [14], Vetter and Reed used statistical projection pursuit, a
multidimensional projection technique to identify “interesting” performance metrics from a
monitoring system. In [15], Ahn and Vetter used several multivariate statistical techniques
on hardware performance metrics to characterize high-performance computing systems.
They speciﬁcally evaluated the use of principal component analysis (PCA), clustering, and
factor analysis to extract performance information. None of these works have used sequential
forward search for the selection of important metrics or evaluated metrics based on the
amount of information present in the data. Moreover, our work is the only one which
studied the effect of different normalization schemes for performance data analysis and

which estimated the dimensionality of the data previous to discarding relevant data.

9.3 Validation

In order to validate our results, two experiments were designed. In the ﬁrst, an algorithm
with an inefﬁcient memory access pattern was used. This was purposely introduced to
visualize the effect of algorithms on the metric values. The results of this experiment
showed that metrics associated to disk writes and memory access were selected as the most
important ones. In contrast to previous experiments, AN OVA results showed that most
of the metrics were affected by the selected algorithm. This demonstrated that the chosen
algorithm affects the most important metrics of the system.

The second validation experiment was designed as a full factorial fully randomized test of
matrix-vector multiplication algorithms. Four factors were studied: problem size, compiler
options, algorithm, and data structure. Using screening experiments we studied compiler

Options to select those having a dissimilar effect on the execution time. A large variety

121

of metrics were selected as important by the proposed methodology, in contrast to other
experiments, for which metrics associated to memory access were the most important ones.
This outcome assured that the subset selection mechanism was performing appropriately.
To validate intrinsic dimensionality estimation tests, a synthetic data set was created by
a random number generator. Nine columns were generated at random. Subsequent columns
were set as multiples Of the ﬁrst nine columns plus noise added to them. The columns of
this matrix were then manipulated to have the same mean and variance as the matrix from
the data set obtained in Experiment one. When this synthetic performance data matrix
was used as input to the three dimensionality estimators we found that all three methods
concurred indicating that the dimension of the data set was nine. Figure 8.2 shows the
scree test and the K-G criteria for this synthetic data set. This outcome shows that the

evaluated dimensionality estimators were consistent.

9.4 Conclusions

In summary, the application of the proposed methodology reveals that a detailed problem
study preceding a systematic design of experiments, yields useful data on which appropriate
statistical tools can provide unbiased information about the application-system interactions.
Moreover the information Obtained from this methodology can be converted into appropri-
ate suggestions, observations, and guidelines for the scientiﬁc computing expert to tune

applications to a particular computing system.

9.5 Future Work

The next step in the development of an automated performance evaluation system would
be the design of a knowledge-based system with a set of rules to provide suggestions to
scientiﬁc programmers.

To obtain additional information about this data set, we might consider the assignment
of classes to performance outcomes. Examples of classes that might be appropriate for this

purpose include good and bad memory accesses, excessive idle time, large communication

122

overhead, etc. Once classes are assigned to particular experimental runs, a classiﬁer might be
designed for classifying incoming sets of performance metrics. Here, metric space reduction
would be particularly useful to improve the accuracy of the classiﬁer. Moreover, subset
selection may be wrapped around the classiﬁcation criteria.

Additional research ideas that have emerged from this work include:

0 Establishing a comparison between different entropy estimators for performance data

evaluation. This could improve the accuracy of the metric subset selection method.

0 Evaluating the use of sequential backward search and oscillating methods for metric
subset selection. This evaluation could also lead to accuracy improvement in selecting

important metrics.

0 Establishing a comparison between hardware and software metrics for performance
evaluation. Although this research has been based on software metrics, hardware
metrics could provide alternate information content about the Observable computing

system.

0 Comparing results between different architectures and programming paradigms. This
could highlight differences and similarities between them, providing additional insight

into the automated performance evaluation problem.

123

APPENDICES

124

APPENDIX A

Foundations of Computational

Science and Engineering

A.1 Mathematical Preliminaries

In this section we will present the basic mathematical concepts which together with other
fundamental concepts in computer science will serve as the basis for our theoretical frame-
work formulated throughout the work.

We will start describing concepts such as information, signal, function, and vector space.
We will then continue with mathematical concepts associated with the scientiﬁc and engi-
neering applications treated in this thesis.

In the most general sense, in this work, the concept of information can be deﬁned as
anything which can be sent from one point to another in the physical world. A signal
can be deﬁned as the entity which carries information. There cannot be a transferring of
information from a given point to another without an associated signal.

It is important to point out that we deﬁne a high performance computing machine in the
most general sense as a computational structure with a well deﬁned number of computing

processors or nodes and an associated network tOpology.
Deﬁnition A.1 Cartesian Product of Two Sets.

Let A and B be any two arbitrary sets. The Cartesian or direct product of the set A

125

times B is a new set, denoted by A x B deﬁned as follows:

AXB={(ak,bl):akEA,blEB}. (A.1)

The above expression is read as follows: A x B is the set formed with all ordered pairs
(ak, bl) such that ak belongs to the set A and b) belongs to the set B.
In general, the Cartesian product of N sets, for instance, A0, A1, A2, - - - , A N_1, is a new

set deﬁned as follows:

AOXAIX°"XAN—1={(ak0’ak1"”’ak‘v“)z (A2)

ako 6 140,045] 6 141,-” ,ak~_, E AN_1} .

Deﬁnition A.2 Relation.

Let A x B be the Cartesian product of the sets A and B. The relation p deﬁned on this
set is a proper subset of A x B. That is p C A x B. We call a set G a proper subset of a
set H if G is not the null set or the set H itself. If p C A x B and (ak, b1) 6 p, we then say

that a), is related to b).
Definition A.3 Function.

Let A x B be the Cartesian product Of the sets A and B. A function f deﬁned from
A to B is a relation such that the ﬁrst entry of every pair in the relation is unique; that
is, it only appears once and only once through the pair of elements of the relation. If the
relation f C A x B is a function, we then call the set A the domain of the function and the
set B the co-domain of the function. We will use the following notation to describe a given

function f C A x B:

: A B
f -_) (A.3)

0!: H bl = f(ak)-

Deﬁnition A.4 Natural Indexing Set.

We deﬁne the set ZN = {0,1,2,--- ,N — 1} as the natural indexing set of N objects.

126

Deﬁnition A.5 Mathematical Signal.

A mathematical signal is deﬁned as any mathematical function used to represent a
physical signal. Not all physical signals admit mathematical representation and not all

mathematical functions can be associated with the physical world (see Figure A.1).

mathematical physical
functions signals
mathematical
signals

Figure A.1. Venn diagram of mathematical signals.

In this work we are interested in physical signals that admit mathematical representa-
tions. Most of the signals used in this work will be of the random or statistical nature.

A signal is called real or complex if its co—domain is the set of real numbers or the
set of complex numbers, respectively. A signal is said to be continuous if its domain is a
continuous subset of the set of real numbers. We call a signal a discrete signal if its domain
is a subset in one to one correspondence with a subset of the set of integers. Finally, a

signal is said to be a digital signal if its co—domain is a ﬁnite set.

Deﬁnition A.6 Metric Space.

A metric space (X, (1) is a set X with a map (I : X x X —-> 113+ U 0, such that

1. d(x,y)=0®:r=y

2- d(x,y) = d(y,x)

3. (103,2) 5 d(:v,y) + 61(17, Z) Vx,y,z-

127

The function d is called a metric on X [98].
Let A be the set Of possible states in the computer system. The function f that relates
f : A —> IR is called a measurement. A measurement describes a physical characteristic of

the system under study.

A.1.1 Other Terms

The following terminology will allow us to describe unequivocally the context of our work
and its scope.

A model is a mathematical expression describing the behavior of a system which can
predict the observation based on a error measure. A model is good depending on the criteria
selected to determine the modeling error.

A System is a set of objects and their interrelationships according to a prescribed set
Of rules. An Observable Computing System or 005 is any given computing system with a
deﬁned set of observable measures. Observable is the physical manifestation of a given quan-
tity or variable. An observable is capable of exchanging information between an Observer
and a system.

A stochastic process is an indexed family of random variables over the same sample
space. This index is typically time. A stochastic process is mean-square ergodic in the
mean if the corresponding time average converges to the ensemble average in the mean-
square sense. A random sequence X [n] converges in the mean-square sense to the random
variable X if E{|X[n] — X [2} ——> 0 as n —+ 00, where E{} denotes expected value. The

expected value of a discrete random variable is deﬁned as

E{X] = Z $,Px($,-) (A.4)

where Px(:s,-) denotes the probability mass function of the discrete random variable X and

it is deﬁned as Px(a:,-) = P[X = a:,-] [99].

128

A.2 Application

Our case study uses ﬁnite elements analysis for a computational electromagnetic application.

It uses an iterative solver with a diagonal preconditioner to ﬁnd the solution.

A.2.1 Finite Elements Analysis

In engineering sometimes problems are not easily solved using analytical methods. Finite
Element Analysis (FEM) is a numerical method used to solve problems in areas such as
aerospace, automotive, civil, mechanical and electrical engineering that involve differential
equations. These equations are transformed into a ﬁnite dimensional space for solution
purposes. The general procedure of this method is to take a large system under study and
divide it into smaller elements of ﬁnite dimensions. These elements are called ﬁnite elements.
These elements are joined together to form the larger system through “nodes”. Equations
for individual elements are formulated and solved taking into consideration boundary con-
ditions. The use of ﬁnite elements convert the problem to a solution fo a system of linear

equations.

A.2.2 Iterative Solvers

The solution of a large system of linear equations can be found either using direct or iterative
solvers. Direct solvers determine the solution in a finite number Of steps, while iterative
solvers begin with an initial guess of the solution and iteratively improve it until a good
enough solution is Obtained. Iterative methods can be either stationary or nonstationary. In
stationary methods, computation of the next step is based on a matrix-vector multiplication
operation plus a vector addition. These Operators do not vary from iteration to iteration.
In non stationary methods, the information required for the next approximation vary with
each iteration [100].

Stationary methods include the Jacobi and the Gauss-Seidel. Nonstationary methods
include the conjugate gradient (CG), Generalized Minimal Residual (GMRES), BiConjugate
Gradient (BiCG), Conjugate Gradient Squared (CGS), and Biconjugate Gradient Stabilized

129

(Bi-CGSTAB). Iterative methods implemented in the target code are BiCG, CGS, and Bi-
CGSTAB.

The rate at which iterative methods converge to the solution depends on the eigenvalues
of the coefﬁcient matrix. The convergence rate of the method can be improved through
preconditioning. This step transforms the system into an equivalent one with the same
solution but with different eigenvalues [100, 101]. A diagonal preconditioner is commonly
used for this purpose. If A denotes the coefﬁcient matrix in the linear system, the matrix
for a diagonal preconditioner is formed by

am if i = j,

Cij = (A-5)

0 otherwise

A matrix-vector multiplication algorithm is the basis of the iterative method used in

our case study. We now proceed to explain matrix-vector multiplication schemes.

A.2.3 Matrix-Vector Multiplication

Operations on vectors and matrices are the basis of our work.
Let N be the set of natural numbers. For mm. 6 N, a rectangular array A of m x n

elements belonging to a ﬁeld F is called a matrix and is represented as

 

 

F 011 aln '
A = (as) = E E (A-G)
.. am, am” _
where or) 6 IF for all i = 1,2, - -- ,m and j = 1,2, - -- ,n. The parameter m represents the

number of rows and n the number of columns [102].

Special matrices

T
Let IR" be the vector space Of real n-vectors such that x E R" (if x = I 3;, 3n I
where x,- 6 IR A column vector x is a n x 1 matrix with n components [55]. Let C" be the

vector space of complex n-vectors. Then x E C" is a complex vector.

130

A matrix is called square if its number of rows is equal to its number of columns. Let
A be a n x it real matrix. A is symmetric if (av) = (aji) [103]. Let B be m x m complex

matrix. B is called complex symmetric if (bij) = (by).
Deﬁnition A.7 Matrix- Vector Multiplication.

Let A E Rm” and x E R”, that is, A is a m x it real matrix and x is a n-elements
vector. The matrix-vector product is the m-elements vector y = [y,-] with y,- = 22:, aijxj.

The matrix-vector multiplication operation can be viewed as an inner product or as a
linear combination of vectors [104]. Consider matrix A a combination of row vectors, that

is

al

02

am
— d

 

 

Then the matrix-vector multiplication can be viewed as an inner product of vectors a,-

with vector x. The following algorithm will perform the multiplication as an inner product:

y = 0 for i = 1 to m
for j = 1 to n
yCi)= y(i) + a(i,j)x(i)
end

end

If matrix A is stored by rows, as the language C does, it will favor this type of algorithm.
On the other hand, the operation can be viewed as a linear combination of the columns of

A. Consider A to be a combination of column vectors, that is

A 1: a1 a2 . . . an J (A'8)

131

Then the matrix-vector multiplication can be performed by multiplying each vector a,-

with element 23,-. The following algorithm will perform the multiplication in this fashion:

y = 0 for j = 1 to n
for i = 1 to m
y(i)= y(i) + a(i,j)x(j)
end

end

This scheme will favor matrices stored by columns, like Fortran’s convention.

From the computational point of view, the organization of the matrix in memory and
the scheme used to access memory locations affect the solution time. The effect of the size
of cache memory on performance is of great importance. An example on a Sun platform
performance in [50] show that 96% of cache hit may cause a program to be running at half of
its potential speed. When programming algorithms to perform matrix-vector multiplication
the effect of the architecture is an important consideration since different memory access

patterns will favor or degrade on performance.

A.3 Advanced Architectures

The computing power requirements for scientiﬁc applications have led to the development of
different approaches for meeting the demands of processing speed, memory size and latency,
and data input /output rates. Increases in performance have come from several advances

[105]. Some of these advances relate to:

o Microprocessors have become faster through the use of instruction-level parallelism,

multilevel caches, and faster clock speeds.

0 Different schemes have been developed for effective interconnection between processors

and memory.

0 Users and compiler developers have learned how to use multiple processors and deep

memory hierarchies.

132

0 Software tools have been improved.

Technology changes fast and processors architectures change in months [106].

In high-performance systems, efﬁcient memory access schemes are needed. Multiple
level caches are used to speed up data and instruction accesses. Instruction reordering and
data prefetching are used to avoid latency caused by slow memories. This causes compilers
to be left with the task of generating efficient codes to take advantage of the hardware.
Memory access in shared memory systems has to incorporate cache coherence mechanisms
to avoid the access of non-valid data from cache.

Processor architectures for high performance systems derive speed mainly from high
clock rates, deep cache memory hierarchies, superscalar-superpipeline designs, out of order
execution, and branch prediction schemes [107]. One example of a microprocessor for high
performance is the Intel Itanium 2 processor. The Itanium 2 is a 64-bit processor running at
1.3, 1.4, or 1.5 GHz, with three levels of cache: L1 is 32KB, L2 is 256 KB, and L3 is 3, 4, or
6 MB of cache. It is based on the EPIC (Explicitly Parallel Instruction Computer) architec-
ture. This architecture allows programmers or compilers to explicitly indicate parallelism
to the processor [108]. The Itanium 2 has six arithmetic logic units and four memory ports
allowing two integer loads and two integer stores per cycle [109].

Interconnection networks and system architectures are two important issues for achiev-
ing high performance in computing systems. Among the currently used architectures used
for advanced systems we can ﬁnd shared memory, distributed memory, distributed-shared
memory, and grid computing. In shared memory systems, parallelism is implemented when
processors write and read from a global shared address space. Cache coherence mech-
anisms are required for preventing problems with data consistency. Distributed mem-
ory, distributed-shared memory, and grid computing, on the other hand, are implemented
through the use of interconnection networks. In distributed memory, processors collaborate
to the solution of a problem communicating via a local area network. One example of this
type of architecture is a cluster. Distributed-shared memory is a hybrid, combining shared

memory with a distributed memory system. Each node is composed of a shared memory

133

collection of processors and they are then interconnected as a cluster.

Grid computing is another approach to obtain high-performance systems. The idea
behind grid computing is to obtain a highly powerful distributed system composed of phys-
ically distributed computers to obtain the best resources available to jointly solve a problem.
The system should be transparent to the users who will concentrate on the solution of their
respective problems and not on the computational requirements of the problem [110].

There are several issues in the development of such a global computing platform, such
as software compatibility, high-performance networks, security, and user-friendly interfaces.
Grid computing involves the interconnection of high-performance networks, implementing a
distributed ﬁle system, coordinating user access to different computational structures, and

making the environment easy-to-use and transparent to the user.

A.4 Languages and Environments

Two main programming styles are used for programming in parallel. One follows the shared

memory model and the other, the distributed memory model.

A.4.1 Shared Memory

OpenMP is an application program interface designed for supporting shared-memory multi-
threaded parallel programming. It is based on the fork-join model of parallelism where
multiple threads will share the workload.

OpenMP is a standard developed by a group of hardware, software, and application
vendors and it consists of a group of compiler directives and library for shared memory
parallelism. These libraries can be used with C, C++, and Fortran. The advantage of
OpenMP over other parallel libraries is the simplicity of its use. Parallelism can be incorpo—
rated incrementally into a program which was originally designed to be run in serial mode.
In other models, such as message passing, the introduction of parallel constructs is more

complex and time consuming.

134

A.4.2 Message Passing

Traditionally, message passing libraries have been used for parallel programming. These
consist of a set of routines and libraries for the execution of point to point communication
among processors. The standard for message passing libraries is called MP1.

MP1 implements a distributed memory paradigm. The two most widely used implemen-
tations of MP1 are MPICH and LAM. MP1 is used with C, C++, or Fortran. It is based
on point-tO-point communication between processors. MPI implements both blocking and

non-blocking instructions.

A.4.3 Problem Solving Environments

A problem solving environment (PSE) is an integrated computing system for developing and
running applications in a particular domain with the goal of improving the productivity Of
research scientists by providing an natural interface to construct applications [111]. PSEs
have been cited as one of the key technologies required for enabling petascale computing on

real applications by 2010 in [112].

A.5 Performance Measurement

Collecting information about tasks performed while the execution of a program is done
through instrumentation and performance metrics. Instrumentation is the process of gen—
erating a trace of the execution of the program either through software or hardware [113].
Performance metrics is the data collected by the instrumentation system [114] which pro-
vides information about the status of the code at different times while the code is executing.
Some of the performance metrics are summaries Of statistics while others contain detailed
information of the status of the system at different times. A trace is a series of measurements
that provide information on the status of a system over a period of time.

Performance instrumentation can be inserted at different levels in the system, that is at
hardware, system software, run time software, and application code [115]. Instrumentation

calls for the application code can be inserted at different points in the software life cycle,

135

that is, at the source code, at compile time, at object code while linking the libraries, and
at the running executable ﬁle.
The conventional method used for performance optimization is to ﬁnd the most time

consuming kernel of a program and Optimizing it [116].

A.5.1 Tools

A great effort has been placed on the development of tools for high-performance computing
[117, 118]. Given the high complexity of interactions among different components in a high
performance systems, tools are needed to aid the application programmers to develop their
applications. Tools for the development of serial code are very mature: there are a variety
of debuggers and proﬁling tools to aid programmers in code development and performance
monitoring. Tools for parallel code development do not follow a standard. These tools take
different approaches, different data formats, and different display techniques [119].

There are three steps in performance analysis: data collection, data transformation, and
data visualization and rendering. One data collection technique is the use of a proﬁlers,
which record the amount of time spent in different parts of the program. Gprof is an example
of such a tool for the Unix operating system. Data transformation and visualization modiﬁes
and presents performance data to be comprehensible and useful to the user.

It is important that tools provide the appropriate visualization techniques to display
information. Two basic principles for visualization of performance data are given in [120].
The ﬁrst one states that the displays should be linked directly to the performance model.
The second one suggests that visualization techniques should be designed and applied in
an integrated environment. Therefore the selection of the appropriate models for showing
performance information to the user is extremely important.

Paradyn, ParaGraph, AIMS, VAMPIR, Pablo, Scalea, and KAP/PRO are examples
of performance tools for the evaluation Of parallel programs [121, 122, 123]. ParaGraph,
AIMS, and VAMPIR are tools for the visualization of message-passing parallel programs.
Pablo and Scalea support both message-passing and data-parallel programming models.

KAP/ PRO supports shared memory parallel programming.

136

Paradyn is one of the most successful tools for performance diagnosis and it was the
ﬁrst tool to follow an automated search approach [124]. When it was ﬁrst introduced,
it combined several novel technologies, two Of which were unique in its class: dynamic
instrumentation and automated bottleneck search. Dynamic instrumentation is the process
of instrumenting executable code while it is executing. It executes its instrumentation
through the use of “trampolines”, points in the code where the instrumentation is inserted,
and the code is deviated from its normal flow to an alternate path to collect information.
Since Paradyn does not require either compilation or linking with source code, it is one of
the few tools available which can perform instrumentation on proprietary code. The second
technology Paradyn introduced is automated bottleneck search. Paradyn uses of a set of
general hypothesis about why, where, and when there is a performance bottleneck and it
will use instrumentation to collect information on whether the hypothesis is true or false. A
predeﬁned set of thresholds deﬁned by the user will be used for hypothesis testing. Paradyn
is used for long running codes of hours or days.

Scalea is another performance analysis tool for parallel programs [125]. It is composed of
an instrumentation system, runtime system, a performance repository, and a performance
analysis and visualization system. It supports OpenMP, MPI, HPF, and OpenMP/MP1
programs. Two novel features about Scalea is the classiﬁcation of performance overheads
and the support for multiple experiments.

Once the application is targeted to a speciﬁc platform statistics may be used to study

the behavior of the system.

A.5.2 Statistical Terms

In this section we present some statistical terminology useful in our work.

Accuracy refers to how close a measurement is to the real or actual value of a physical
quantity. Precision determines how close are measurements within one another, indepen-
dent of whether or not a measurement is accurate. Precision indicates the reproducibility
of repeated measurements under the same conditions.

Datum is an observation obtained from our system. More than one Observation is

137

collected as data. There are two different types of association among data observations:
descriptive and experimental. Descriptive relations are those involved data not controlled
by any means. We observe the system without controlling any factors and we establish
relationships just from the observations. Experimental relationships, on the other hand are
those in which an experiment is conducted and data collected. In this case, causal relations
can be established [9]. We call an object to the entity producing data. A variable is a feature
of the system with two or more values. The values of a variable are represented on one of
four different scales: nominal, ordinal, interval, or ratio. A nominal scale refers to variables
whose values cannot be ranked in any order. Variables differ by kind or category only.
Ordinal scale is that in which the values can be ranked in order and they are represented by
numbers but they represent a hierarchy. Both nominal and ordinal scales are considered non-
metric scales. Third, in the interval scale, equal differences between scales have meaning
but ratios have no meaning. Finally, the ratio scale carries the most information. Here
ratios have meaning and there is a zero point in the scale. Interval and ratio scales are
considered metric scales.

A statistical hypothesis is a statement about the characteristics of a population. The
claim initially believed to be true is called the null hypothesis or prior belief and is denoted
by H0. The alternative hypothesis is a statement contradictory to the null hypothesis and is
denoted H1. For example, one experiment may test the effect of problem size on execution
time for two different problem sizes. If it is believed that problem size does not affect

execution time then

H0=#1=#2
H13l11¢ll2

(A.9)

where ulis the mean execution time for problem size one and [1.2 is the mean execution time
for problem size two.

Hypothesis testing is a method to sample data to decide if the null hypothesis should
be rejected or not. A method based on sample data tO decide whether or not to reject H0

is called a test procedure. It is composed of a test statistic and a rejection region. A test

138

statistic is a function computed with sample data used to reach a decision about H0. A
rejection region is a set of values of the test statistic for which Ho will be rejected. There

are two different types of errors in hypothesis testing: type I and type II [126].

0 Type I error: The probability of error of selecting the alternate hypothesis when

the null hypothesis is true.

0 Type II error: The probability Of error of selecting the null hypothesis when it is

false.

A type I error is denoted by 01 (also called alpha value). Typical values of oz to make a
decision are 0.10, 0.05, and 0.01. The most commonly used value of a is 0.05. The F-ratio
is the ratio of two sampling variances. The p-value or signiﬁcance level is the probability
of Obtaining a statistic value as contradictory to the null hypothesis as the resulting one
when assuming that the null hypothesis is true [126]. It is determined by the F ratio and
the degrees of freedom associated to each sampling variance. The degrees Of freedom is the
number of samples in the data that are allowed to vary when the statistic is computed. The
p-value will tell us the probability of obtaining the value of the resulting F ratio or larger
by chance [127]. The smallest the p-value, the more contradictory is the data to the null
hypothesis. If p — value S a , the null hypothesis is rejected at level a. If p — value > oz,
then the null hypothesis is not rejected at level a.

The methodological design of an experiment is done to obtain the most information
possible with the smallest number of tests. A factor is an independent variable inﬂuencing
the results and might have two or more levels. Factors are classiﬁed as design, held-constant,
and allowed-to-vary factors [25]. Design factors are controlled in the experiment, held-
constant factors will be kept at a speciﬁc level during experimentation, and allowed-to-vary
factors are ignored and not controlled. We may also have nuisance factors, which are
factors not considered in the experiment. An experiment is a study in which changes are
made to controlled inputs to the system to observe and understand effects on the output. A

replication is a repetition of the same experiment. Randomization refers to the arrangement

139

of individual runs of the experiment in random order. A set of levels of controllable factors
administered to an experimental unit is called a treatment.

Design of experiments (DOE) refers to the way in which the experiment is arranged,
speciﬁcally, the way in which the treatments will be administered to the subjects in the
study. A correct design will minimize the effect of uncontrollable factors and will determine
whether variations in the output are random or signiﬁcant effects. An experimental unit
or experimental run is the basic unit to which a treatment is applied. Experimental error
refers to a measure of the variations in the observations with the same treatment.

The process of designing an experiment involves seven steps [25]:

1. Problem Statement: Establishment of the goals of the experiment.

2. Selection Of factors and levels: Classiﬁcation of factors into design, held-constant,

or allowed-to—vary factors. Selection of levels to be tested in each factor.

3. Decision on the output variable: Identiﬁcation Of the response variable for the

experiment. In our case, a set of multiple responses are used as output variables.

4. Selection of an experimental design: Selection of the number of replicates, num-
ber of samples to take, the order of runs for experimental units, and the randomization

scheme to be used.
5. Perform the experiment: Actual experimental run.

6. Analysis of data: Selection of a statistical model for the response variable. Test of

the model.

7. Conclusion: Formulation of conclusions drawn from the experiment.

There are three basic criteria to consider in an experiment: replication, randomiza-
tion, and blocking. Replication refers to when a treatment is applied more than once in
an experiment [128]. It improves precision and allows the calculation of an estimate of

the experimental error. We require to have at least two replicates of an experiment [25].

140

Randomization refers to the order in which experimental runs will be executed. Statistical
methods require that the measurements must be independent and identically distributed
(iid) random variables. This is ensured by randomization and it will average out the effects
of nuisance factors. Blocking is the allocation of experimental runs into homogeneous con-
ditions to improve comparisons. Blocking restricts complete randomization since factors
are only randomized within a block.

Some types Of experiment designs are: simple, factorial, and full-factorial. A simple
design is the most basic experiment. It refers to an experiment with only one factor [26].
If the factor only has two levels, that is, we are comparing two treatments, it is called a
simple comparative study. We can use this type of experiment for screening purposes, but
for complex interactions as those Obtained in high performance computing systems, they
are limited in the information they can provide.

In a factorial design, two or more factors are tested simultaneously. This allows to detect
interactions among factors. When the response of a factor depends on the level of another
factor we say there is interaction between them. For illustrative purposes we present an
example of interaction.

Assume we are studying the effect of problem size and the machine type on the execution
time for a particular application. We can plan two different experiments: one to test problem
size and another to test the type of machine. In our ﬁrst experiment we select machine A
and vary the problem size from size 1 to size 2. In the second experiment we select problem
size 1 and vary the type of machine: A and B. Assume we get the results shown in Figure
A.2.

From these graphs we might conclude that problem size does not affect execution time
while machine type does. However, it might have been possible that if the test of problem
size had been done on machine B, we could have a graph like in Figure A.3.

In this second graph we can deﬁnitely see that problem size affects execution time.
When we examine Machine A, execution time stays almost constant when varying problem

size, but when Machine B is used, execution time varies as problem size changes. There

141

Execution Execution

 

 

 

 

Time Time
/ /
t i 1e l
Sizel Size2 Machine A Machine B
Machine A Size]

Figure A.2. Experiment illustrating execution time Of two simple comparative studies.

Execution
Time

/

1 I
l I

Sizel Size2
Machine B

 

Figure A.3. Execution time when Machine B is used in the study .

is interaction between problem size and machine because variations in problem size have a
different effect under different levels of the factor machine type.

A full- factorial design involves studying every level of all combinations of factors [129] at
the same time. If we let F denote the number of factors and lk denote the number of levels
for factor k, the total number of experimental runs for one repetition of the experiment is
represented as:

F—l

TotExp = H lk. (A.10)
k=0

A full-factorial design involving only two levels for each factor is called a 2" factorial

142

design.

The randomization scheme is Of importance in deciding a speciﬁc design of experiment.
In a completely randomized design, the order in which experimental runs are arranged is
randomly allocated. When in a factorial experiment we are unable to completely randomize
the order of the runs, a split-plot design might be used. In this design, one factor is selected
for a treatment, and the order in which the treatments are applied to this factor is chosen
either at random, or on blocks. Next, a second factor is selected and, keeping the order
for experimental runs selected for the ﬁrst factor, a randomization scheme is selected for
the second factor. This could be repeated successively. When a third factor follows the
same restrictions, this is called a split-split plot design [25]. A partial randomization of
experiments causes a higher experimentation error and a more complex computer analysis
so split-split plot is suggested only when a completely randomized design is not feasible.

We will illustrate the concept with an example. Imagine we have two different servers
and we want to measure system response time under three different types of workloads.
Suppose that there is a restriction that we can only experiment on the servers at certain
times of the day, and not simultaneously. We have two factors: server type and workload,

with two and three levels each, respectively.

0 Factor A: Server 1 (SI) and Server 2 (S2)

0 Factor B: Workload 1 (W1), Workload 2 (W2), and Workload 3 (W3).

The Yates or standard order to list experiments is the following: ﬁrst, factors are listed
in alphabetical order and then the levels are listed from lowest to highest level. This does
not correspond to the run order. The order of running experiments should be randomized
to minimize the inﬂuence of nuisance factors.

In this example, the standard order is as shown in Table A.1.

For a fully randomized experiment, the last column shows a typical order for experimen-
tal runs. That is SZW2, SlW3, SZWl, 82W3, SlW2, and SlWl. As can be noticed from

this order, we start running on server one, then change to server two, and then switch back

143

Table A.1. Order of experiments for a fully randomized experiment.

 

 

 

 

 

 

 

 

Standard Factor A Factor B Fully
Order (Server) (Workload) Randomized
1 81 W1 6
2 81 W2 5
3 81 W3 2
4 82 W1 3
5 S2 W2 1
6 S2 W3 4

 

 

 

 

 

 

to server one. This is impractical given the existing restrictions on the use of the servers.
A more appropriate design would be to select randomly the server to run the experiment

on, and then, select the workload. For example:

Server 2, Workload 2

Server 2, Workload 1

Server 2, Workload 3

Server 1, Workload 1

Server 1, Workload 3

Server 1, Workload 2

This is called a split-plot design. The whole-plot is the server type and the subplot is
the workload.
Data Obtained from a design of experiment can readily be analyzed with AN OVA pro—

cedures.

ANOVA

Analysis of variance (ANO VA) is a statistical procedure for the analysis of the response of

an experiment to identify what is the cause of variations in the obtained data. In AN OVA,

144

the goal is to determine if there is an effect of different treatments on a population. The
null hypothesis tested by AN OVA is that no factor will inﬂuence the solution and that there
is no interaction between any factors. Once the alpha level for the ANOVA test is selected,
a set Of test statistics are computed and a conclusion on whether the null hypothesis is
probable or not is reached. In our case, AN OVA at a level 0.05 will be used to establish
relationships among factors and performance metrics. The AN OVA test is used when there
are more than two treatments to compare.

There are three assumptions for the ANOVA test. First, the treatments are independent
Of each other. This is assured by the use of randomization in the experiment. Second, the
distribution of the sample means should be normal. This is ensured by having a large enough
group of samples. Finally, the variances of the groups should be approximately equal. This
is known as the homogeneity of variance assumption. AN OVA is robust to deviations from
these assumptions if the design is balanced, that is, if the number of samples from each
population are the same.

A typical AN OVA procedure is summarized in the following steps:
0 Assume the null hypothesis H0 is that all means are equal.
0 Assume that the alternative hypothesis H1 is that at least one mean is different.

0 Assume

o Treatments are independent
0 Treatments follow a normal distribution

0 Homogeneity of variance

0 Set the a level, that is the allowed type I error on the results.
0 Determine the F ratio.

0 Determine the p—value and conclude based on whether p— value > a. If p—value > a
then the null hypothesis is assumed true. Otherwise, we conclude that there are

signiﬁcant differences in the means of the population.

145

If the null hypothesis is true, this means that factors do not affect results. For example, if
we Obtain a p — value > 0.05 in our previous example, this means that for this signiﬁcance
level, neither the server type, nor the workload signiﬁcantly aﬂ'ect the execution time.
Variations in execution time are due to the random nature of the measurement.

On contrast, if the null hypothesis is false, at least one of the means is statistically
different than the others. This does not determine which ones are different or how signif-
icantly different they are. Multiple comparisons are used to determine this. Two types of
comparisons are used: a priori comparisons and a posteriori comparisons [130]. A priori
comparisons are planned before the experiment and are based on a previous theory we
might have about the data itself. A posteriori or post hoc comparisons are not planned and
are done after the data is collected to propose a hypothesis [130]. A typical a priori compar-
ison used in statistical analysis is orthogonal contrasts, in which speciﬁc comparisons are
studied [25]. Some methods used for post hoc comparisons are Least Signiﬁcant—diﬂerence
(LSD), Student-Neuman-Keul’s, Tukey’s, and Scheffe’s test. In most research situations,

the outcome is the same regardless of the test used.

A.6 Summary

This chapter has presented the foundations for this dissertation. First a presentation of
the mathematical foundations is found. A mathematical signal was deﬁned as a mathe-
matical function used to represent a physical signal. The response of our system can be
regarded as a physical signal. Finite elements and iterative solvers are the basis of our
case study. These are presented along with matrix-vector multiplication algorithms, which
is the kernel routine in our case study. Advanced architectures are described. Important
characteristics of advanced architectures are high clock rates, deep memory hierarchies,
superscalar-superpipeline designs, out of order execution, and branch prediction algorithms.
We continue by describing languages for parallel programming and tools for improving per-
formance. The two most widely used languages for parallel programming are OpenMP and

MP1. Performance tools collect information about the system and present it in a useful way.

146

Two performance tools presented as examples are Paradyn and Scalea. Some important
statistical terms were deﬁned. We are using hypothesis testing to determine if the results
were due to chance or to a signiﬁcant effect of a factor. Design of experiments (DOE) refers
to the careful arrangement of experimental treatments in a systematic study. ANOVA is
used for the analysis of the response of an experiment. Post hoc comparisons are used for

classifying differences in the data.

147

APPENDIX B

Glossary

Abstraction: The act of leaving out of consideration one or more properties of a complex
object so as to attend to others to analyze or classify it. The process of formulating general
concepts by abstracting common properties of instances. A general concept formed by ex-
tracting common features from speciﬁc examples. Generalization ignoring or hiding details
to capture some kind of commonality between different instances. Each abstraction has a
number of primitive elements and composition rules [131].

Accuracy: How close a measurement is to the real or actual value of a physical quantity.

Algorithm: An algorithm is a well deﬁned procedure to solve a problem in a ﬁnite
number of steps.

Algorithm Performance: A measurement Of the computer performance of the imple-
mentation of an algorithm.

Conceptualization: The process of developing a new idea to solve a problem.

Correlation: A statistic representing how closely two variables co—vary; it can vary
from -1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive
correlation). A measure of how strongly related two variables x and y are in a sample [126].
Coefﬁcient indicating the linear relationship between a variable and another. A correlation
coefﬁcient close to one indicates high correlation. It is computed as the covariance between
two variables divided by the product of the standard deviation of each variable.

Computational Electromagnetic: The application of numerical methods for the

148

solution of partial differential equations and of integral equations in the application of
electromagnetic to areas such as guided waves, antennas, and scattering [39].

Computer performance characterization: A detailed description of the Operation
of a computer executing a set of instructions. It includes information on hardware and
software execution.

Implementation: An implementation is the task of turning an algorithm into a com-
puter program.

Instantiation: To describe the idea as a series Of steps to solve the problem. An
instantiation is a collection of algorithms to solve a problem.

Instrumentation: Instrumentation is the group of modules to collect and manage from
a program while it runs on a parallel or distributed system [56].

Mapping Abstraction: A rule of correspondence established between sets that asso-
ciates each element of a vocabulary describing a high-level abstraction with an element in
the set of concrete architectures [6].

Metric: The valuations of observable quantities of a target computing system stored
in the form of variables.

Observable: The physical manifestation of a given quantity or variable.

Observable Computing System or OCS: Any given computing system with a de-
ﬁned set of observable measures.

Operation: A process or an action, such as addition, substitution, transposition, or
differentiation, performed in a speciﬁed sequence and in accordance with speciﬁc rules.

Parallel Programming: The decomposition Of a program for executing in multiple
processing units at the same time.

Precision: How close are measurements within one another, independent Of whether
or not a measurement is accurate.

Random Variable: Function whose domain is the sample space and of an experiment
and whose range is a subset of the real line.

Relation: A subset Of the product of two sets, R: A x B. If (a, b) is an element of R

149

then we write a R b, meaning a is related to b by R. A relation may be: reﬂexive (a R a),
symmetric (a R b => b R a), or transitive (a R b and b R c => a R c).
System: A set of objects and their interrelationships according to a prescribed set of

rules.

150

APPENDIX C

Matrix-Vector Multiplication

Algorithms

C.1 Algorithm A

This is the original matrix—vector multiplication algorithm used in the application but mod-
iﬂed for OpenMP. This algorithm is more ineﬂicient when running in serial mode than the
original algorithm implemented in serial mode but when in parallel, avoids thread interac-
tion and allows implementation using OpenMP. It causes converges in the code both when

running serially and in parallel.

Subroutine BiMATVECCav(vector,product)
Implicit NONE
Include ’prism.inc’
Complex*16 vector(*),product(*)
c c Local variables
Integer row,col,index
Complex matEntry
c c Do the MATVEC c
!$0MP PARALLEL PRIVATE(index, col, matEntry)
!$0MP DO
Do row = 1,apUnk
DO col = 1,apUnk
If(row .LT. col) Then
index = BIrowEndPOint(row)+col
Else
index = BIrowEndPoint(col)+row
EndIf
matEntry = Ybi(index)

151

product(row) = product(row) + matEntry*vector(col)
EndDo
EndDo
!$0MP END D0
!$0MP END PARALLEL
Return
End

C.2 Algorithm B

This algorithm is similar to algorithm A but removes the if condition by doing two differ-
ent 100ps which splits the matrix by the opposite diagonal elements. There is no thread
interaction so it causes convergence both when running serially and in parallel.

Subroutine BiMATVECCav(vector,product)
Implicit NONE
Include ’prism.inc’
Complex*16 vector(*),product(*)
Integer row,col,index
Complex matEntry
c c Do the MATVEC c
!$0MP PARALLEL PRIVATE(index, col, matEntry)
!$0MP DO
Do row = 1,apUnk
Do col = 1,row
index = BIrowEndPoint(col)+row
matEntry = Ybi(index)
product(row) = product(row) + matEntrytvector(col)
EndDo
EndDo
!$0MP END D0
!$0MP END PARALLEL

!$0MP PARALLEL PRIVATE(index, col, matEntry)
!$0MP D0
D0 row = 1,apUnk
Do col = row+1,apUnk
index = BIrowEndPoint(row)+col
matEntry = Ybi(index)
product(row) = product(row) + matEntrytvector(col)
EndDo
EndDo
!$0HP END D0
!$0MP END PARALLEL

Return

152

1

End

C.3 Algorithm C

Algorithm C is the most inefﬁcient matrix-vector multiplication algorithm we have imple-
mented. It was used to verify the observability of metrics indicating a bad memory access

pattern. It prevents convergence of the code when running in parallel.

Subroutine BiMATVECCav(vector,product)
Implicit NONE
Include ’prism.inc’
Complex*16 vector(*),product(*)
Integer row,col,index,k
Complex matEntry
c c Do the MATVEC c
!$0MP PARALLEL PRIVATE (index, k, row,matEntry)
!$0MP D0
D0 col = apUnk,1,-1
k = apUnk -1
index = col
Do row = 1,col-1
matEntry = Ybi(index)
product(row) = product(row) + matEntry*vector(col)
product(col) = product(col) + matEntry*vector(row)
index = index + k

k = k-l
EndDo
product(col) = product(col) + Ybi(index)*vector(col)
EndDo
l$0MP END D0

!$0MP END PARALLEL

Return
End

C.4 Algorithm D

This is the original matrix-vector multiplication algorithm used in the application, similar

to Algorithm A. It is serial and causes convergence in the application code.

Subroutine BiMATVECCav(vector,product)

153

Implicit NONE
Include ’prism.inc’
Complext16 vector(*),product(*)
Integer row,col,index
Complex matEntry
c c Do the MATVEC c

Do row = 1,apUnk
Do col = 1,apUnk
If(row .LT. col) Then
index = BIrowEndPoint(row)+col
Else
index = BIrowEndPoint(col)+row
EndIf
matEntry = Ybi(index)
product(row) = product(row) + matEntry*vector(col)
EndDo
EndDo
Return
End

(3.5 Algorithm E

This algorithm is similar to algorithm B but runs serially. It causes convergence in the

application code.

Subroutine BiMATVECCav(vector,product)
Implicit NONE
Include ’prism.inc’
Complex*16 vector(*),product(*)
Integer row,col,index
Complex matEntry
Do row = 1,apUnk
Do col = 1,row
index = BIrowEndPoint(col)+row
matEntry = Ybi(index)
product(row) = product(row) + matEntry*vector(col)
EndDo
EndDo

Do row = 1,apUnk
Do col = row+1,apUhk
index = BIrowEndPoint(row)+col
matEntry = Ybi(index)
product(row) = product(row) + matEntry¢vector(col)
EndDo

154

EndDo

Return
End

C.6 Algorithm F

Matrix-vector multiplication algorithm described in page 22 of [55] to be used for a validation
experiment. It was implemented in parallel using OpenMP directives. To avoid thread
overwriting the information an ATOMIC directive is included.

Subroutine MatVectMult2(ProbISize,DenseMatrix,VectorIn,product)
c Golub & Van Loan algorithm
Integer ProblSize
Complext16 DenseMatrix(*),VectorIn(*),product(*)
Integer row,col,index
c c Do the MATVEC multiplication c
!$OMP PARALLEL PRIVATE(index, row)
!$OMP DO
Do col = 1, ProblSize
Do row = 1, col-1
index = (row-1)*ProblSize - row*(row-1)/2 + col
!$OMP ATOMIC
product(row) = product(row)+DenseMatrix(index)*VectorIn(col)
EndDo
Do row = col, ProblSize
index = (col-1)*ProblSize - col*(col-1)/2 + row
!$OMP ATOMIC
product(row) = product(row)+DenseMatrix(index)*VectorIn(col)
EndDo
EndDo
!$OMP END DO
!$OMP END PARALLEL c
Return
End

C.7 Algorithm G

Matrix-vector multiplication algorithm modiﬁed to read the data in reverse. It was imple-
mented in parallel using OpenMP directives. We expect it to have a poor performance and
it is used for validation purposes.

Subroutine NatVectHultS(ProblSize,DenseNatrix,VectorIn,product)

155

Integer ProblSize

Complex*16 DenseMatrix(*),VectorIn(*),product(*)
Integer row,col,index,k

Complex*16 matEntry

c Do the MATVEC multiplication
!$OMP PARALLEL PRIVATE (index, k, row, matEntry)
!$OMP DO

!$0MP

!$OMP

!$0MP
!$DMP

Do col = ProblSize, 1, -1
k = ProblSize - 1
index = col
Do row = 1, col-1
matEntry = DenseMatrix(index)
ATOMIC
product(row) = product(row) + matEntry*VectorIn(col)

product(col) = product(col) + matEntry*VectorIn(row)
index = index + k
k = k-1
EndDo
ATOMIC

product(col) = product(col) + DenseMatrix(index)*VectorIn(col)
EndDo
END D0
END PARALLEL
Return
End

156

APPENDIX D

Experiment 1

D.1 Order of Execution of Experimental Runs for Experi-
ment 1

The design of the experiment was randomized as a split-split plot design. The main plot
was the repetition where three repetitions were done. The subplots were selected at random
where problem size and matrix vector multiplication algorithm was selected. In each of these
subplots, compiler options for generating the executable ﬁles were selected at random using
an uniform distribution random number generator. The following table contains the actual

order in which the experimental runs were performed given this randomization scheme.

Table D.1. Order of execution of experiments

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
1 6033 B -fast -WGstats
2 6033 B -unroll=2 -fast -xcrossfile -WGstats
3 6033 B No ﬂags -WGstats
4 6033 B -xcrossfile -fast -WGstats
5 6033 B -fast -xcrossﬁle -WGstats
6 6033 B -fast -xcrossﬁle -unroll=2 -WGstats
7 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
8 6033 B -unroll=2 -fast -WGstats
continued on next page

 

157

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
9 6033 B -unroll=2 -WGstats
10 6033 B -unroll=2 -xcrossﬁle -fast -WGstats
11 6033 B -fast -unroll=2 -WGstats
12 6033 B -xcrossﬁle -fast -unroll=2 -WGstats
13 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
14 6033 A -fast -unroll=2 -WGstats
15 6033 A -unroll=2 -fast -xcrossﬁle -WGstats
16 6033 A -unroll=2 -xcrossﬁle -fast ~WGstats
17 6033 A -xcrossfi1e -fast -WGstats
18 6033 A -xcrossﬁle -fast -unroll=2 -WGstats
19 6033 A -fast -xcrossfile -WGstats
20 6033 A -fast -xcrossﬁle -unroll=2 -WGstats
21 6033 A -unroll=2 -WGstats
22 6033 A No ﬂags -WGstats
23 6033 A ~fast -unroll=2 -xcrossﬁle -WGstats
24 6033 A -unroll=2 -fast -WGstats
25 6033 A -xcrossﬁle -unroll=2 -fast -WGstats
26 6033 A -fast -WGstats
27 13857 B No ﬂags -WGstats
28 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
29 13857 B -fast -unroll=2 -WGstats
30 13857 B -fast —WGstats
31 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
32 13857 B -xcrossﬁle -fast -WGstats
33 13857 B -xcrossﬁle -unroll=2 -fast -WGstats

 

 

 

 

 

continued on next page

 

158

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
34 13857 B —fast -xcrossﬁle -WGstats
35 13857 B -unroll=2 -fast ~WGstats
36 13857 B -xcrossﬁle -fast -unroll=2 -WGstats
37 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
38 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
39 13857 B ~unroll=2 -WGstats
40 13857 A -unroll=2 -fast -xcrossﬁle -WGstats
41 13857 A -unroll=2 -WGstats
42 13857 A -xcrossfi1e -unroll=2 -fast -WGstats
43 13857 A -fast -unroll=2 -xcrossﬁle -WGstats
44 13857 A -fast -xcrossfile -unroll=2 -WGstats
45 13857 A -fast -xcrossﬁle -WGstats
46 13857 A -xcrossﬁle -fast ~unroll=2 -WGstats
47 13857 A -xcrossﬁle -fast -WGstats
48 13857 A -fast -WGstats
49 13857 A -unroll=2 -fast -WGstats
50 13857 A No ﬂags -WGstats
51 13857 A -fast -unroll=2 -WGstats
52 13857 A ~unroll=2 -xcrossﬁle -fast -WGstats
53 6337 B -unroll=2 -WGstats
54 6337 B -fast -unroll=2 -WGstats
55 6337 B ~unroll=2 -xcrossﬁle -fast -WGstats
56 6337 B -fast -xcrossﬁle -unroll=2 -WGstats
57 6337 B No ﬂags -WGstats
58 6337 B -fast -xcrossﬁle -WGstats

 

 

 

 

 

continued on next page

 

159

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
59 6337 B -xcrossﬁle -unroll=2 -fast -WGstats
60 6337 B -unroll=2 ~fast -xcrossﬁle -WGstats
61 6337 B -unroll=2 -fast -WGstats
62 6337 B -fast -WGstats
63 6337 B -xcrossﬁle -fast -WGstats
64 6337 B -xcrossﬁle -fast -unroll=2 -WGstats
65 6337 B -fast -unroll=2 -xcrossﬁle -WGstats
66 6337 A -fast -unroll=2 -xcrossﬁle -WGstats
67 6337 A -xcrossﬁle -unroll=2 -fast -WGstats
68 6337 A -fast -WGstats
69 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
70 6337 A -xcrossﬁle -fast -WGstats
71 6337 A —fast -unroll=2 -WGstats
72 6337 A -fast -xcrossﬁle -unroll=2 -WGstats
73 6337 A -unroll=2 -fast -WGstats
74 6337 A -unroll=2 -WGstats
75 6337 A -unroll=2 -fast -xcrossﬁle -WGstats
76 6337 A -fast -xcrossﬁle -WGstats
77 6337 A No ﬂags -WGstats
78 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
79 13857 A No ﬂags -WGstats
80 13857 A -fast -unroll=2 -WGstats
81 13857 A -xcrossﬁle -unroll=2 -fast -WGstats
82 13857 A -fast -WGstats
83 13857 A -xcrossﬁle -fast -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

160

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
84 13857 A -unroll=2 -fast -WGstats
85 13857 A -unroll=2 -WGstats
86 13857 A -unroll=2 -xcrossﬁle -fast -WGstats
87 13857 A -fast -unroll=2 -xcrossﬁle -WGstats
88 13857 A -xcrossﬁle -fast -WGstats
89 13857 A -fast -xcrossﬁle -WGstats
90 13857 A -unroll=2 -fast -xcrossﬁle -WGstats
91 13857 A -fast -xcrossfile -unroll=2 -WGstats
92 13857 B -xcrossfile -fast -unroll=2 -WGstats
93 13857 B -xcrossfile -fast -WGstats
94 13857 B -unroll=2 -WGstats
95 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
96 13857 B -fast -xcrossﬁle -WGstats
97 13857 B -fast -unroll=2 -WGstats
98 13857 B -unroll=2 -fast -WGstats
99 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
100 13857 B -fast -WGstats
101 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
102 13857 B -xcrossﬁle -unroll=2 -fast -WGstats
103 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
104 13857 B No ﬂags -WGstats
105 6033 A -unroll=2 -WGstats
106 6033 A -fast -xcrossﬁ1e -WGstats
107 6033 A -unroll=2 -fast -WGstats
108 6033 A -fast -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

161

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
109 6033 A -xcrossﬁle -unroll=2 -fast -WGstats
110 6033 A -xcrossfile -fast -unroll=2 -WGstats
111 6033 A —unroll=2 -xcrossﬁle -fast -WGstats
112 6033 A -fast -xcrossﬁle -unroll=2 -WGstats
113 6033 A -fast -WGstats
114 6033 A -fast -unroll=2 -xcrossﬁle —WGstats
115 6033 A -unroll=2 -fast -xcrossﬁle -WGstats
116 6033 A No ﬂags -WGstats
117 6033 A -xcrossﬁle -fast -WGstats
118 6033 B -xcrossﬁle -fast -WGstats
119 6033 B -fast ~xcrossﬁle -WGstats
120 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
121 6033 B -fast -xcrossfile -unroll=2 -WGstats
122 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
123 6033 B -unroll=2 -fast -xcrossﬁle -WGstats
124 6033 B -unroll=2 -WGstats
125 6033 B -fast -unroll=2 —WGstats
126 6033 B -unroll=2 -fast -WGstats
127 6033 B -xcrossfile -fast -unroll=2 -WGstats
128 6033 B -unroll=2 —xcrossﬁle -fast -WGstats
129 6033 B -fast -WGstats
130 6033 B No ﬂags —WGstats
131 6337 A -fast -unroll=2 -WGstats
132 6337 A -unroll=2 -fast -xcrossﬁle -WGstats
133 6337 A -xcrossﬁle -unroll=2 -fast -WGstats

 

 

 

 

 

continued on next page

 

162

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
134 6337 A -fast -unroll=2 -xcrossﬁle —WGstats
135 6337 A No ﬂags -WGstats
136 6337 A -unroll=2 -WGstats
137 6337 A -fast -WGstats
138 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
139 6337 A -fast —xcrossﬂle -WGstats
140 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
141 6337 A -xcrossﬁle -fast -WGstats
142 6337 A -unroll=2 -fast -WGstats
143 6337 A -fast -xcrossﬁle -unroll=2 -WGstats
144 6337 B -fast ~WGstats
145 6337 B -fast -xcrossﬁle -unroll=2 -WGstats
146 6337 B -unroll=2 -WGstats
147 6337 B -xcrossﬁle -unroll=2 -fast -WGstats
148 6337 B -xcrossﬁle -fast -WGstats
149 6337 B -xcrossﬁle -fast -unroll=2 ~WGstats
150 6337 B -fast -unroll=2 -xcrossf'ile -WGstats
151 6337 B -unroll=2 -fast -WGstats
152 6337 B ~fast -xcrossﬁle -WGstats
153 6337 B -unroll=2 -fast -xcrossﬁle —WGstats
154 6337 B No ﬂags -WGstats
155 6337 B -fast -unroll=2 -WGstats
156 6337 B -unroll=2 -xcrossﬁle -fast -WGstats
157 6337 B -fast -xcrossﬁle -unroll=2 -WGstats
158 6337 B -xcrossﬁle -unroll=2 -fast -WGstats

 

 

continued on next page

 

163

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
159 6337 B -fast -xcrossﬁle -WGstats
160 6337 B -xcrossﬁle -fast -WGstats
161 6337 B -unroll=2 -xcrossﬁle -fast -WGstats
162 6337 B -fast -unroll=2 -WGstats
163 6337 B ~unroll=2 -fast -WGstats
164 6337 B No ﬂags -WGstats
165 6337 B -unroll=2 -WGstats
166 6337 B -unroll=2 -fast -xcrossfile -WGstats
167 6337 B -fast -unroll=2 -xcrossﬁle -WGstats
168 6337 B -fast -WGstats
169 6337 B ~xcrossﬁle -fast -unroll=2 -WGstats
170 6337 A -fast -unroll=2 ~xcrossﬁle -WGstats
171 6337 A No ﬂags -WGstats
172 6337 A -unroll=2 -WGstats
173 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
174 6337 A —fast -xcrossﬁle -unroll=2 -WGstats
175 6337 A -fast -unroll=2 -WGstats
176 6337 A -xcrossﬁle -fast -WGstats
177 6337 A -fast -WGstats
178 6337 A -unroll=2 -fast -xcrossﬁle -WGstats
179 6337 A -fast -xcrossﬁle -WGstats
180 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
181 6337 A -unroll=2 -fast -WGstats
182 6337 A -xcrossﬁle -unroll=2 -fast -WGstats
183 6033 B -xcrossﬁle -fast -unroll=2 -WGstats

 

 

 

 

continued on next page

 

164

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
184 6033 B -fast —xcrossﬁle -unroll=2 -WGstats
185 6033 B -fast -WGstats
186 6033 B —unroll=2 -xcrossﬁle -fast -WGstats
187 6033 B -unroll=2 -fast -WGstats
188 6033 B -unroll=2 -WGstats
189 6033 B -fast -xcrossﬁle -WGstats
190 6033 B No ﬂags -WGstats
191 6033 B -xcrossﬁle -fast -WGstats
192 6033 B -unroll=2 -fast -xcrossﬁle -WGstats
193 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
194 6033 B -fast -unroll=2 -WGstats
195 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
196 6033 A -xcrossﬁle -fast —unroll=2 -WGstats
197 6033 A -fast -WGstats
198 6033 A -xcrossﬁle -fast —WGstats
199 6033 A -fast -xcrossﬁle ~unroll=2 -WGstats
200 6033 A -fast -unroll=2 -xcrossﬁle -WGstats
201 6033 A -unroll=2 -fast -xcrossﬁle -WGstats
202 6033 A ~xcrossﬁle -unroll=2 -fast ~WGstats
203 6033 A -unroll=2 -fast -WGstats
204 6033 A No ﬂags -WGstats
205 6033 A -fast -xcrossﬁle -WGstats
206 6033 A -unroll=2 -WGstats
207 6033 A -unroll=2 -xcrossﬁle -fast -WGstats
208 6033 A -fast -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

165

 

Table D.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
209 13857 B —xcrossﬁle -unroll=2 -fast -WGstats
210 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
211 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
212 13857 B -unroll=2 -fast -WGstats
213 13857 B -fast -WGstats
214 13857 B -fast -xcrossﬁle -WGstats
215 13857 B -xcrossﬁle -fast -unroll=2 -WGstats
216 13857 B -xcrossﬁle -fast -WGstats
217 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
218 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
219 13857 B No ﬂags -WGstats
220 13857 B -fast -unroll=2 -WGstats
221 13857 B -unroll=2 -WGstats
222 13857 A -unroll=2 -fast -xcrossﬁle -WGstats
223 13857 A -fast -unroll=2 -xcrossﬁle -WGstats
224 13857 A —fast -WGstats
225 13857 A -xcrossﬁle -unroll=2 -fast -WGstats
226 13857 A -xcrossﬁle -fast -unroll=2 -WGstats
227 13857 A -fast -xcrossﬁle -unroll=2 -WGstats
228 13857 A -fast -unroll=2 -WGstats
229 13857 A -unroll=2 —WGstats
230 13857 A -unroll=2 -fast -WGstats
231 13857 A -fast -xcrossﬁle -WGstats
232 13857 A No ﬂags -WGstats
233 13857 A -xcrossﬁle -fast -WGstats

 

 

 

 

 

continued on next page

 

166

 

 

Table D.1 (cont’d).

 

Experimental Run

Size (N)

 

Algorithm

Compiler Options

 

 

234

13857

 

A

 

 

—unroll=2 -xcrossﬁle -fast -WGstats

 

 

D.2 Anova on the metrics obtained in Experiment 1

Table D.2: AN OVA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S * A S t C A a: C S t A a: C
m0 Exec. Time Yes Yes Yes No No Yes No
m1 bread /s No No N o No No No No
m2 lread/s No Yes Yes No No No No
m3 %rcache N o No No N o No No No
m4 bwrit /s N 0 Yes Yes No Yes No No
m5 lwrit /s N 0 Yes Yes No No No No
m6 %wcache N o No Yes No No No No
m7 pgout /s No No Yes No Yes No No
m8 ppgout /s No N 0 Yes No Yes No No
m9 pgfree/s No No No N o No No No
mlO pgscan/s No No No No No No No
m1 1 atch/s No No N o No No No No
m12 pgin/s No No No No No No No
m13 ppgin/s No No No No No N o No
In 14 pﬂt /s No No No No Yes No No
m15 vﬂt/s Yes Yes Yes No Yes No Yes
m16 %usr No N 0 Yes Yes No Yes No

 

 

continued on next page

 

167

 

Table D.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S * A S a: C A * C S a: A at C
m17 %sys Yes Yes Yes No No Yes No
m18 ‘70in No No No No No No No
m19 %idle No Yes Yes Yes N 0 Yes N o
m20 pswch/s Yes Yes Yes N o No Yes No
m21 c0t0d0/rps No No N o No No No No
m22 c0t0d0/ wps No Yes Yes N 0 Yes No N o
m23 c0t0d0/ util No Yes Yes No N o N o No
m24 c0t1d0/rps No No N o N o No No N o
m25 c0t1d0/wps No No No No No No No
m26 c0t1d0/util No N o N o No No No No
m27 cpu / us No N 0 Yes Yes No Yes N 0
m28 cpu /sy Yes Yes Yes No N 0 Yes No
m29 cpu / wt No No No No No No No
m30 cpu / id No Yes Yes Yes N 0 Yes No
m31 memory / swap No N 0 Yes No No N o N o
m32 memory / free N o No Yes N o No No No
m33 page / re No No N o No No N o No
m34 page / mf Yes Yes Yes N 0 Yes No Yes
m35 page / pi N o No N o No No N o No
m36 page / po No No Yes No Yes No No
m37 page / fr No N o No No No No No
m38 page / sr N o N o No No N o No N o
m39 disk / 30 Yes Yes Yes N 0 Yes No No

 

 

 

 

 

 

 

 

 

 

continued on next page

 

168

 

Table D.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S a: A S a: C A * C S * A a: C
m40 disk / 31 No No No No No No No
m4] faults / in No No Yes No No No N o
m42 faults / sy Yes Yes Yes N o N 0 Yes No
m43 faults / cs Yes Yes Yes No N 0 Yes No
m44 cpu/us_1 No No Yes Yes No Yes No
m45 cpu/sy_1 Yes Yes Yes No No Yes No
m46 cpu/id-l No Yes Yes Yes No Yes No

 

 

 

 

 

 

 

 

 

 

 

Yes implies the hypothesis is rejected at alpha level 0.05.

169

APPENDIX E

Experiment 2

El Order of Execution of Experimental Runs for Experi-

ment 2

This is a split-split plot design. The main plot was the repetition, where three repetitions

were done. The subplots were selected at random where problem size and matrix-vector

multiplication algorithm was selected. In each of these subplots, compiler options for gen-

erating the executable ﬁles were selected at random using an uniform distribution random

number generator. The following table contains the actual order in which the experimental

runs were performed following the randomization scheme described above. There were a

total of 234 experimental runs in this experiment.

Table E.1. Order of execution of experiments

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options

1 6337 D -fast -unroll=2 -WGstats

2 6337 D -unroll=2 -fast -xcrossﬁle -WGstats

3 6337 D -xcrossfile -fast -unroll=2 —WGstats

4 6337 D -xcrossﬁle -fast -WGstats

5 6337 D —unroll=2 -WGstats

6 6337 D -xcrossﬁle -unroll=2 -fast -WGstats

7 6337 D -fast -unroll=2 -xcrossﬁle -WGstats
continued on next page

 

 

170

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
8 6337 -fast -xcrossﬂle -unroll=2 -WGstats
9 6337 D -unroll=2 -xcrossﬁle -fast -WGstats
10 6337 D No ﬂags -WGstats
11 6337 D -fast -xcrossﬁle -WGstats
12 6337 D -unroll=2 -fast -WGstats
13 6337 D -fast —WGstats
14 6337 E -xcrossﬁle -fast -unroll=2 -WGstats
15 6337 E No ﬂags -WGstats
16 6337 E -xcrossﬁle -fast -WGstats
17 6337 E -unroll=2 ~WGstats
18 6337 E -fast -xcrossﬁle -unroll=2 -WGstats
19 6337 E -unroll=2 -fast -xcrossﬁle -WGstats
20 6337 E -xcrossﬂle -unroll=2 -fast -WGstats
21 6337 E -unroll=2 -fast -WGstats
22 6337 E -fast -WGstats
23 6337 E -fast -unroll=2 -WGstats
24 6337 E -fast -xcrossﬁle -WGstats
25 6337 E -fast ~unroll=2 -xcrossﬂle ~WGstats
26 6337 E -unroll=2 -xcrossﬁle -fast -WGstats
27 6033 D -xcrossﬁle -fast -unroll=2 -WGstats
28 6033 D ~unroll=2 -fast -WGstats
29 6033 D -unroll=2 -xcrossfile -fast -WGstats
30 6033 D -fast -WGstats
31 6033 D -xcrossfile -unroll=2 -fast -WGstats
32 6033 D No ﬂags -WGstats

 

 

 

 

 

continued on next page

 

171

 

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
33 6033 D -xcrossﬁle -fast -WGstats
34 6033 D -fast -unroll=2 -WGstats
35 6033 D -unroll=2 -WGstats
36 6033 D -fast -xcrossﬁle -unroll=2 -WGstats
37 6033 D -fast -xcrossﬂle -WGstats
38 6033 D -unroll=2 -fast -xcrossﬁle -WGstats
39 6033 D -fast -unroll=2 -xcrossﬁle -WGstats
40 6033 E -unroll=2 -xcrossﬁle —fast -WGstats
41 6033 E -unroll=2 -fast -xcrossﬁle -WGstats
42 6033 E -unroll=2 -fast -WGstats
43 6033 E -xcrossﬁle -fast -WGstats
44 6033 E -fast -WGstats
45 6033 E -xcrossﬁle -unroll=2 -fast -WGstats
46 6033 E -xcrossﬁle -fast -unroll=2 -WGstats
47 6033 E No ﬂags -WGstats
48 6033 E -fast -xcrossﬁle -unroll=2 -WGstats
49 6033 E -unroll=2 -WGstats
50 6033 E -fast -unroll=2 -xcrossﬁle -WGstats
51 6033 E -fast -xcrossﬂle -WGstats
52 6033 E -fast -unroll=2 -WGstats
53 13857 D -unroll=2 -fast -xcrossﬁle -WGstats
54 13857 D -fast -xcrossﬁle -unroll=2 -WGstats
55 13857 D No ﬂags -WGstats
56 13857 D -xcrossﬁle -fast -WGstats
57 13857 D -xcrossﬁle -fast -unroll=2 -WGstats

 

 

continued on next page

 

172

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
58 13857 -fast -unroll=2 -xcrossﬂle ~WGstats
59 13857 -xcrossﬁle -unroll=2 -fast -WGstats
60 13857 D -fast -WGstats
61 13857 D -unroll=2 -xcrossﬁle -fast -WGstats
62 13857 D -fast -xcrossﬁle -WGstats
63 13857 D -fast -unroll=2 -WGstats
64 13857 D -unroll=2 -WGstats
65 13857 D -unroll=2 -fast -WGstats
66 13857 E -unroll=2 -fast -WGstats
67 13857 E -fast -xcrossﬂle -WGstats
68 13857 E -fast -unroll=2 -WGstats
69 13857 E -xcrossﬁle -fast -WGstats
70 13857 E -unroll=2 -WGstats
71 13857 E -fast -WGstats
72 13857 E -fast -unroll=2 -xcrossﬁle -WGstats
73 13857 E No ﬂags -WGstats
74 13857 E -xcrossﬁ1e -unroll=2 -fast -WGstats
75 13857 E -xcrossﬁle -fast -unroll=2 -WGstats
76 13857 E -unroll=2 -xcrossﬁle -fast -WGstats
77 13857 E -fast -xcrossﬁle -unroll=2 -WGstats
78 13857 E -unroll=2 -fast -xcrossﬁle -WGstats
79 6337 D -unroll=2 -xcrossﬁle -fast -WGstats
80 6337 D -fast -unroll=2 -xcrossﬁle -WGstats
81 6337 D -xcrossﬁle -unroll=2 -fast -WGstats
82 6337 D -fast -xcrossﬂle -unroll=2 -WGstats

 

 

continued on next page

 

173

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
83 6337 -fast -xcrossfile -WGstats
84 6337 D -unroll=2 -WGstats
85 6337 D -unroll=2 -fast -xcrossﬁle -WGstats
86 6337 D No ﬂags -WGstats
87 6337 D -fast -WGstats
88 6337 D -xcrossﬁle -fast -unroll=2 -WGstats
89 6337 D -xcrossﬁle -fast -WGstats
90 6337 D -unroll=2 -fast -WGstats
91 6337 D -fast -unroll=2 -WGstats
92 6337 E -fast -unroll=2 -WGstats
93 6337 E No ﬂags -WGstats
94 6337 E -unroll=2 -fast -xcrossﬂle -WGstats
95 6337 E -fast -unroll=2 -xcrossﬁle -WGstats
96 6337 E -xcrossﬂle -fast -unroll=2 -WGstats
97 6337 E -xcrossﬁle -fast -WGstats
98 6337 E -unroll=2 -WGstats
99 6337 E -fast -xcrossﬁle -WGstats
100 6337 E -fast -xcrossﬁle -unroll=2 -WGstats
101 6337 E -unroll=2 -fast -WGstats
102 6337 E -fast -WGstats
103 6337 E -unroll=2 -xcrossﬁle -fast -WGstats
104 6337 E -xcrossﬁle -unroll=2 -fast -WGstats
105 13857 D -fast -unroll=2 -xcrossﬁle -WGstats
106 13857 D -unroll=2 -WGstats
107 13857 D -fast -xcrossﬁle -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

 

174

 

 

 

Table 13.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
108 13857 -fast -xcrossﬁle -WGstats
109 13857 D ~unroll=2 -fast -WGstats
110 13857 D -fast -unroll=2 -WGstats
111 13857 D -unroll=2 -xcrossﬁle -fast -WGstats
112 13857 D -xcrossﬁle -fast -unroll=2 -WGstats
113 13857 D No ﬂags -WGstats
114 13857 D -xcrossﬁle -unroll=2 -fast ~WGstats
115 13857 D -fast -WGstats
116 13857 D -xcrossﬁle -fast -WGstats
117 13857 D -unroll=2 -fast -xcrossﬁle -WGstats
118 13857 E -fast -unroll=2 -xcrossﬂle -WGstats
119 13857 E -unroll=2 -WGstats
120 13857 E -fast -unroll=2 -WGstats
121 13857 E No ﬂags -WGstats
122 13857 E -fast -xcrossﬁle -WGstats
123 13857 E -xcrossﬁle -fast -WGstats
124 13857 E -unroll=2 ~fast -xcrossﬁle -WGstats
125 13857 E -unroll=2 -xcrossfile -fast -WGstats
126 13857 E -xcrossfile -fast -unroll=2 -WGstats
127 13857 E -fast -WGstats
128 13857 E -fast -xcrossﬁle -unroll=2 -WGstats
129 13857 E -unroll=2 -fast -WGstats
130 13857 E -xcrossﬁle -unroll=2 -fast -WGstats
131 6033 E -xcrossﬁle -fast -WGstats
132 6033 E -fast -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

175

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
133 6033 E -unroll=2 -fast -xcrossﬁle -WGstats
134 6033 E -xcrossﬁle -unroll=2 -fast -WGstats
135 6033 E No ﬂags -WGstats
136 6033 E -fast -WGstats
137 6033 E -unroll=2 -fast -WGstats
138 6033 E -unroll=2 -WGstats
139 6033 E -unroll=2 -xcrossﬁle -fast -WGstats
140 6033 E -fast -xcrossﬂle -unroll=2 -WGstats
141 6033 E -fast -unroll=2 -xcrossﬁle -WGstats
142 6033 E -fast ~xcrossﬁle -WGstats
143 6033 E -xcrossﬁle -fast -unroll=2 -WGstats
144 6033 D -fast -unroll=2 -xcrossﬁle -WGstats
145 6033 D -unroll=2 -xcrossﬁle -fast -WGstats
146 6033 D -fast -xcrossﬂle -unroll=2 -WGstats
147 6033 D No ﬂags -WGstats
148 6033 D -xcrossﬁle -fast -WGstats
149 6033 D -unroll=2 -WGstats
150 6033 D -xcrossﬁle -fast -unroll=2 ~WGstats
151 6033 D -xcrossﬁle -unroll=2 -fast ~WGstats
152 6033 D ~fast -xcrossﬁle -WGstats
153 6033 D -fast -unroll=2 -WGstats
154 6033 D -unroll=2 -fast -xcrossﬂle -WGstats
155 6033 D -fast -WGstats
156 6033 D -unroll=2 -fast -WGstats
157 13857 D -unroll=2 -WGstats

 

 

 

 

 

continued on next page

 

176

 

 

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
158 13857 D -unroll=2 -fast -WGstats
159 13857 D -fast -xcrossﬁle -unroll=2 -WGstats
160 13857 D -fast -unroll=2 -xcrossﬁle —WGstats
161 13857 D -unroll=2 -xcrossfile -fast -WGstats
162 13857 D -xcrossﬁle -unroll=2 -fast -WGstats
163 13857 D -xcrossﬂle -fast -unroll=2 -WGstats
164 13857 D -xcrossﬂle -fast -WGstats
165 13857 D -fast -unroll=2 -WGstats
166 13857 D -fast -WGstats
167 13857 D No ﬂags -WGstats
168 13857 D -unroll=2 ~fast -xcrossﬁle -WGstats
169 13857 D -fast -xcrossﬁle -WGstats
170 13857 E -fast -unroll=2 -WGstats
171 13857 E -xcrossﬂle -fast -WGstats
172 13857 E No ﬂags -WGstats
173 13857 E -xcrossfile —fast ~unroll=2 -WGstats
174 13857 E -unroll=2 -WGstats
175 13857 E -fast -xcrossﬁle -WGstats
176 13857 E -unroll=2 -fast -xcrossﬁle -WGstats
177 13857 E -fast -WGstats
178 13857 E -unroll=2 -fast -WGstats
179 13857 E -unroll=2 -xcrossﬁle -fast -WGstats
180 13857 E -fast -xcrossﬁle -unroll=2 -WGstats
181 13857 E -fast -unroll=2 -xcrossﬂle -WGstats
182 13857 E -xcrossﬁle -unroll=2 -fast -WGstats

 

 

 

 

 

continued on next page

 

177

 

 

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
183 6033 E -fast -unroll=2 -WGstats
184 6033 E -unroll=2 -fast -xcrossﬁle -WGstats
185 6033 E -unroll=2 -xcrossﬁle -fast -WGstats
186 6033 E No ﬂags -WGstats
187 6033 E -xcrossﬂle -fast -WGstats
188 6033 E -fast -xcrossﬁle -WGstats
189 6033 E -unroll=2 -fast -WGstats
190 6033 E -unroll=2 -WGstats
191 6033 E -xcrossﬁle -unroll=2 -fast -WGstats
192 6033 E -fast -unroll=2 -xcrossﬁle -WGstats
193 6033 E -fast -xcrossfile -unroll=2 -WGstats
194 6033 E -fast -WGstats
195 6033 E -xcrossﬁle ~fast -unroll=2 -WGstats
196 6033 D -unroll=2 -WGstats
197 6033 D -fast -xcrossﬁle -unroll=2 -WGstats
198 6033 D -xcrossﬁle -fast -WGstats
199 6033 D -fast -WGstats
200 6033 D -fast -unroll=2 -WGstats
201 6033 D -unroll=2 -fast -WGstats
202 6033 D -xcrossﬁle -fast -unroll=2 -WGstats
203 6033 D -fast -unroll=2 -xcrossﬁle -WGstats
204 6033 . D -xcrossfile -unroll=2 -fast -WGstats
205 6033 D -unroll=2 -xcrossﬁle -fast -WGstats
206 6033 D No ﬂags -WGstats
207 6033 D ~unroll=2 -fast -xcrossﬁle -WGstats

 

 

continued on next page

 

178

 

 

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
208 6033 D -fast -xcrossﬂle -WGstats
209 6337 D -fast -xcrossﬁle -WGstats
210 6337 D -unroll:2 -WGstats
211 6337 D -xcrossﬁle -fast -unroll=2 -WGstats
212 6337 D No ﬂags ~WGstats
213 6337 D ~unroll=2 ~fast -xcrossﬁle -WGstats
214 6337 D -fast ~unroll=2 -xcrossﬁle -WGstats
215 6337 D -unroll=2 -fast -WGstats
216 6337 D -xcrossﬁle -fast -WGstats
217 6337 D -unroll=2 -xcrossﬁle -fast -WGstats
218 6337 D -fast -WGstats
219 6337 D -fast -xcrossﬂle -unroll=2 -WGstats
220 6337 D -fast -unroll=2 -WGstats
221 6337 D -xcrossﬂle -unroll=2 -fast -WGstats
222 6337 E —unroll=2 -fast -WGstats
223 6337 E -fast -xcrossﬂle —unroll=2 -WGstats
224 6337 E -unroll=2 -fast -xcrossﬁle -WGstats
225 6337 E -unroll=2 -WGstats
226 6337 E -fast -xcrossﬁle -WGstats
227 6337 E -xcrossﬁle -fast -unroll=2 -WGstats
228 6337 E -xcrossﬁle -unroll=2 -fast -WGstats
229 6337 E -fast -unroll=2 -WGstats
230 6337 E -xcrossﬁle -fast -WGstats
231 6337 E No ﬂags -WGstats
232 6337 E -fast -unroll=2 -xcrossﬁle -WGstats

 

 

 

 

 

continued on next page

 

 

179

 

 

 

Table E.1 (cont’d).

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
233 6337 E -unroll=2 -xcrossﬁle -fast -WGstats
234 6337 E -fast -WGstats

 

 

E.2 Anova on the metrics obtained in Experiment 2

Table E.2: AN OVA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S at A S :o: C A a: C S * A at C
p0 execution time Yes Yes Yes N 0 Yes Yes No
p1 bread / s No N o No No No No No
p2 lread / s No Yes Yes No No Yes No
p3 %rcache N 0 Yes N 0 Yes No No No
p4 bwrit /s Yes No Yes N o N o N o No
p5 lwrit / s No Yes Yes N o No Yes No
p6 %wcache Yes No No No N o N o No
p7 pgout /s No Yes No No No No No
p8 ppgout/s No No No No No No No
p9 pgfree/s No No N o N o No No No
p10 pgscan/s No N o No No No No No
pl 1 atch/s No N 0 Yes No No No N 0
p12 pgin/s No No No No No No No
p13 ppgin/s N o No No N o No No No
p14 pﬂt/s No No No No No No No
p15 vﬂt/s Yes Yes Yes No No No N o

 

 

continued on next page

 

180

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table E.2: (cont’d).
Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S t A S a: C A a: C S a: A a: C
p16 %usr No No Yes N o No No No
p17 %sys No Yes Yes No No No No
p18 ‘70in No No N o No No No No
p19 %idle No N o No No No No No
p20 pswch/s No No No No N o No Yes
p21 c0t0d0/rps No No N o N o No No No
p22 c0t0d0/wps N o N o No No No No No
p23 c0t0d0/util No Yes Yes N o No Yes No
p24 cpu / us No N o No No No No No
p25 cpu/sy No No No No No No No
p26 cpu / wt No Yes Yes No No No No
p27 cpu / id No Yes Yes No No No No
p28 memory / swap No No No No No No No
p29 memory / free No No Yes No No No No
p30 page / re No No Yes N o No N o No
p31 page / mf N o N o No N o No No No
p32 page / pi Yes Yes Yes No No No No
p33 page / po No No No No No No No
p34 page / fr No No No No No No No
p35 page / sr No No No No No No No
p36 disk / $0 No No No N o No No No
p37 faults / in No Yes Yes No No Yes No
p38 faults / sy No Yes Yes N o N o No N o

 

 

 

 

 

 

 

 

 

 

continued on next page

 

 

181

 

Table E.2: (cont’d).

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S a: A S * C A a: C S at A t C
p39 faults / cs No No N o N o No No N 0
p40 cpu/us_1 No No No No No No No
p41 cpu/sy-1 N o No No No No No No
p42 cpu/id-l No Yes Yes No No No N o

 

 

 

 

 

 

 

 

 

 

 

Yes implies the hypothesis is rejected at alpha level 0.05.

182

APPENDIX F

Experiment 3

El Order of Execution of Experimental Runs for Experi-
ment 3

This is a split-split plot design. The main plot was the repetition, where three repetitions
were done. The subplots were selected at random where problem size and matrix-vector
multiplication algorithm was selected. In each of these subplots, compiler Options for gen-
erating the executable ﬁles were selected at random using an uniform distribution random
number generator. The following table contains the actual order in which the experimental
runs were performed following the randomization scheme described above. There were a
total of 351 experimental runs in this experiment.

Table F.1. Order of execution of experiments

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
1 6033 B -fast -WGstats
2 6033 B -unroll=2 -fast -xcrossﬁle ~WGstats
3 6033 B No ﬂags -WGstats
4 6033 B -xcrossﬁle -fast -WGstats
5 6033 B -fast -xcrossﬁle -WGstats
6 6033 B -fast -xcrossﬁle -unroll=2 -WGstats
7 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
continued on next page

 

 

 

183

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
8 6033 B -unroll=2 -fast -WGstats
9 6033 B -unroll=2 -WGstats
10 6033 B ~unroll=2 -xcrossﬁle -fast -WGstats
11 6033 B -fast -unroll=2 -WGstats
12 6033 B ~xcrossﬁle -fast -unroll=2 -WGstats
13 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
14 6033 A -fast -unroll=2 -WGstats
15 6033 A -unroll=2 -fast -xcrossfile -WGstats
16 6033 A -unroll=2 -xcrossﬁle -fast -WGstats
17 6033 A -xcrossﬁle -fast -WGstats
18 6033 A -xcrossﬁle -fast -unroll=2 —WGstats
19 6033 A -fast -xcrossﬁle -WGstats
20 6033 A -fast -xcrossﬁle -unroll=2 -WGstats
21 6033 A -unroll=2 -WGstats
22 6033 A No ﬂags -WGstats
23 6033 A -fast -unroll=2 -xcrossﬁle -WGstats
24 6033 A —unroll=2 -fast -WGstats
25 6033 A -xcrossﬁle -unroll=2 -fast -WGstats
26 6033 A -fast -WGstats
27 13857 B No ﬂags -WGstats
28 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
29 13857 B -fast -unroll=2 -WGstats
30 13857 B -fast -WGstats
31 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
32 13857 B -xcrossﬁle -fast -WGstats

 

 

 

 

 

continued on next page

 

 

184

 

Table F .1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
33 13857 B -xcrossﬂle -unroll=2 -fast ~WGstats
34 13857 B -fast -xcrossﬁle -WGstats
35 13857 B -unroll=2 -fast -WGstats
36 13857 B -xcrossﬁle -fast -unroll=2 -WGstats
37 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
38 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
39 13857 B -unroll=2 -WGstats
40 13857 A -unroll=2 -fast —xcrossﬁle -WGstats
41 13857 A -unroll=2 -WGstats
42 13857 A -xcrossﬁle -unroll=2 -fast -WGstats
43 13857 A -fast ~unroll=2 -xcrossﬁle -WGstats
44 13857 A -fast -xcrossﬁle -unroll=2 -WGstats
45 13857 A -fast -xcrossﬁle -WGstats
46 13857 A -xcrossﬁle -fast -unroll=2 -WGstats
47 13857 A —xcrossﬁle -fast -WGstats
48 13857 A -fast -WGstats
49 13857 A -unroll=2 -fast -WGstats
50 13857 A No ﬂags -WGstats
51 13857 A -fast -unroll=2 -WGstats
52 13857 A -unroll=2 -xcrossfile -fast -WGstats
53 6337 B -unroll=2 -WGstats
54 6337 B -fast -unroll=2 -WGstats
55 6337 B -unroll=2 -xcrossﬁle -fast -WGstats
56 6337 B -fast -xcrossﬁle -unroll=2 -WGstats
57 6337 B No ﬂags -WGstats

 

 

 

 

 

continued on next page

 

 

185

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
58 6337 B -fast -xcrossﬁle ~WGstats
59 6337 B ~xcrossﬁle -unroll=2 -fast -WGstats
60 6337 B -unroll=2 -fast -xcrossﬁle -WGstats
61 6337 B -unroll=2 -fast -WGstats
62 6337 B -fast -WGstats
63 6337 B —xcrossﬁle -fast -WGstats
64 6337 B -xcrossﬁle -fast -unroll=2 -WGstats
65 6337 B -fast -unroll=2 -xcrossﬁle -WGstats
66 6337 A -fast -unroll=2 -xcrossﬁle -WGstats
67 6337 A -xcrossﬁle -unroll=2 -fast -WGstats
68 6337 A -fast -WGstats
69 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
70 6337 A -xcrossﬁle -fast -WGstats
71 6337 A -fast ~unroll=2 -WGstats
72 6337 A -fast -xcrossﬁle -unroll=2 -WGstats
73 6337 A -unroll=2 -fast -WGstats
74 6337 A -unroll=2 -WGstats
75 6337 A -unroll=2 -fast -xcrossfile -WGstats
76 6337 A -fast -xcrossﬁle -WGstats
77 6337 A No ﬂags -WGstats
78 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
79 13857 A No ﬂags -WGstats
80 13857 A -fast -unroll=2 ~WGstats
81 13857 A -xcrossﬁle -unroll=2 -fast -WGstats
82 13857 A -fast -WGstats

 

 

 

 

 

continued on next page

 

 

186

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
83 13857 A -xcrossﬁle -fast -unroll=2 -WGstats
84 13857 A -unroll=2 -fast -WGstats
85 13857 A -unroll=2 -WGstats
86 13857 A -unroll=2 -xcrossﬁle -fast -WGstats
87 13857 A -fast -unroll=2 -xcrossﬁle -WGstats
88 13857 A -xcrossﬁle —fast -WGstats
89 13857 A -fast -xcrossﬁle -WGstats
90 13857 A -unroll=2 -fast -xcrossﬁle -WGstats
91 13857 A -fast -xcrossﬁle -unroll=2 -WGstats
92 13857 B -xcrossﬁle -fast -unroll=2 -WGstats
93 13857 B ~xcrossﬁle -fast -WGstats
94 13857 B -unroll=2 -WGstats
95 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
96 13857 B -fast -xcrossﬁle -WGstats
97 13857 B —fast -unroll=2 -WGstats
98 13857 B -unroll=2 -fast -WGstats
99 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
100 13857 B -fast -WGstats
101 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
102 13857 B -xcrossﬁle -unroll=2 -fast -WGstats
103 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
104 13857 B No ﬂags -WGstats
105 6033 A -unroll=2 -WGstats
106 6033 A -fast -xcrossﬁle -WGstats
107 6033 A -unroll=2 -fast -WGstats

 

 

 

 

 

continued on next page

 

187

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
108 6033 A -fast -unroll=2 -WGstats
109 6033 A -xcrossﬁle -unroll=2 -fast -WGstats
110 6033 A -xcrossﬁle -fast -unroll=2 -WGstats
111 6033 A -unroll=2 -xcrossﬁle -fast -WGstats
112 6033 A -fast -xcrossﬁle -unroll=2 -WGstats
113 6033 A -fast -WGstats
114 6033 A -fast ~unroll=2 -xcrossﬁle —WGstats
115 6033 A -unroll=2 -fast -xcrossﬁle -WGstats
116 6033 A No ﬂags -WGstats
117 6033 A -xcrossﬂle -fast -WGstats
118 6033 B -xcrossﬁle -fast -WGstats
119 6033 B -fast -xcrossﬁle -WGstats
120 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
121 6033 B -fast -xcrossﬁle —unroll=2 -WGstats
122 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
123 6033 B -unroll=2 -fast -xcrossﬁle -WGstats
124 6033 B ~unroll=2 -WGstats
125 6033 B -fast -unroll=2 -WGstats
126 6033 B -unroll=2 -fast -WGstats
127 6033 B -xcrossﬁle -fast -unroll=2 -WGstats
128 6033 B -unroll=2 -xcrossﬂle -fast -WGstats
129 6033 B -fast -WGstats
130 6033 B No ﬂags -WGstats
131 6337 A -fast -unroll=2 -WGstats
132 6337 A -unroll=2 -fast ~xcrossﬁle -WGstats

 

 

 

 

 

continued on next page

 

188

 

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
133 6337 A -xcrossﬁle -unroll=2 -fast -WGstats
134 6337 A -fast -unroll=2 -xcrossﬁle -WGstats
135 6337 A No ﬂags -WGstats
136 6337 A -unroll=2 -WGstats
137 6337 A -fast -WGstats
138 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
139 6337 A -fast -xcrossﬁle -WGstats
140 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
141 6337 A -xcrossﬁle -fast -WGstats
142 6337 A -unroll=2 -fast -WGstats
143 6337 A -fast -xcrossﬁle ~unroll=2 -WGstats
144 6337 B -fast -WGstats
145 6337 B -fast -xcrossﬁle -unroll=2 -WGstats
146 6337 B -unroll=2 -WGstats
147 6337 B -xcrossﬁle -unroll=2 -fast -WGstats
148 6337 B -xcrossﬁle -fast -WGstats
149 6337 B -xcrossﬁle -fast -unroll=2 -WGstats
150 6337 B -fast -unroll=2 -xcrossﬁle -WGstats
151 6337 B -unroll=2 -fast -WGstats
152 6337 B -fast -xcrossﬁle -WGstats
153 6337 B -unroll=2 -fast -xcrossﬁle -WGstats
154 6337 B No ﬂags -WGstats
155 6337 B -fast -unroll=2 -WGstats
156 6337 B -unroll=2 -xcrossﬁle -fast -WGstats
157 6337 B -fast -xcrossﬁle -unroll=2 -WGstats

 

 

continued on next page

 

 

189

 

1-
“tﬂﬂ'. .

 

 

Table F .1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
158 6337 B -xcrossﬁle -unroll=2 -fast -WGstats
159 6337 B -fast -xcrossﬁle -WGstats
160 6337 B -xcrossﬁle -fast -WGstats
161 6337 B —unroll=2 -xcrossﬁle -fast ~WGstats
162 6337 B -fast ~unroll=2 -WGstats
163 6337 B -unroll=2 -fast -WGstats
164 6337 B No ﬂags -WGstats
165 6337 B -unroll=2 -WGstats
166 6337 B -unroll=2 -fast -xcrossﬁle -WGstats
167 6337 B -fast -unroll=2 -xcrossﬁle -WGstats
168 6337 B -fast -WGstats
169 6337 B -xcrossﬁle -fast -unroll=2 -WGstats
170 6337 A -fast -unroll=2 -xcrossﬁle -WGstats
171 6337 A No ﬂags -WGstats
172 6337 A -unroll=2 -WGstats
173 6337 A -xcrossﬁle -fast -unroll=2 -WGstats
174 6337 A -fast -xcrossﬂle -unroll=2 -WGstats
175 6337 A -fast -unroll=2 —WGstats
176 6337 A -xcrossﬁle -fast -WGstats
177 6337 A -fast -WGstats
178 6337 A -unroll=2 -fast -xcrossﬁle -WGstats
179 6337 A -fast -xcrossﬁle -WGstats
180 6337 A -unroll=2 -xcrossﬁle -fast -WGstats
181 6337 A -unroll==2 -fast -WGstats
182 6337 A -xcrossﬁle -unroll=2 -fast -WGstats

 

 

 

 

 

continued on next page

 

190

 

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
183 6033 B -xcrossﬂle -fast -unroll=2 -WGstats
184 6033 B -fast -xcrossﬁle -unroll=2 ~WGstats
185 6033 B -fast -WGstats
186 6033 B -unroll=2 -xcrossﬁle -fast -WGstats
187 6033 B -unroll=2 -fast -WGstats
188 6033 B -unroll=2 -WGstats
189 6033 B -fast -xcrossﬁle -WGstats
190 6033 B No ﬂags -WGstats
191 6033 B -xcrossﬁle -fast -WGstats
192 6033 B -unroll=2 -fast ~xcrossﬁle -WGstats
193 6033 B -xcrossﬁle -unroll=2 -fast -WGstats
194 6033 B -fast -unroll=2 -WGstats
195 6033 B -fast -unroll=2 -xcrossﬁle -WGstats
196 6033 A -xcrossﬁle -fast -unroll=2 -WGstats
197 6033 A -fast -WGstats
198 6033 A -xcrossﬁle -fast -WGstats
199 6033 A -fast -xcrossﬁle -unroll=2 -WGstats
200 6033 A -fast -unroll=2 -xcrossﬁle -WGstats
201 6033 A -unroll=2 -fast -xcrossﬁle -WGstats
202 6033 A -xcrossﬁle -unroll=2 -fast -WGstats
203 6033 A -unroll=2 -fast -WGstats
204 6033 A No ﬂags -WGstats
205 6033 A -fast -xcrossﬁle ~WGstats
206 6033 A ~unroll=2 -WGstats
207 6033 A -unroll=2 -xcrossﬁle -fast -WGstats

 

 

continued on next page

 

191

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
208 6033 A -fast -unroll=2 -WGstats
209 13857 B -xcrossﬁle -unroll=2 ~fast -WGstats
210 13857 B -fast -xcrossﬁle -unroll=2 -WGstats
211 13857 B -fast -unroll=2 -xcrossﬁle -WGstats
212 13857 B -unroll=2 -fast -WGstats
213 13857 B —fast —WGstats
214 13857 B -fast -xcrossﬁle -WGstats
215 13857 B -xcrossﬁle -fast -unroll=2 -WGstats
216 13857 B -xcrossﬁle -fast -WGstats
217 13857 B -unroll=2 -xcrossﬁle -fast -WGstats
218 13857 B -unroll=2 -fast -xcrossﬁle -WGstats
219 13857 B No ﬂags -WGstats
220 13857 B -fast ~unroll=2 -WGstats
221 13857 B -unroll=2 -WGstats
222 13857 A -unroll=2 ~fast -xcrossﬁle -WGstats
223 13857 A ~fast -unroll=2 ~xcrossﬁle -WGstats
224 13857 A -fast -WGstats
225 13857 A -xcrossﬁle -unroll=2 -fast -WGstats
226 13857 A -xcrossﬁle ~fast -unroll=2 -WGstats
227 13857 A -fast -xcrossﬁle -unroll=2 -WGstats
228 13857 A -fast -unroll=2 -WGstats
229 13857 A -unroll=2 -WGstats
230 13857 A -unroll=2 -fast -WGstats
231 13857 A -fast -xcrossﬁle -WGstats
232 13857 A No ﬂags -WGstats

 

 

 

 

 

 

continued on next page

 

 

192

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
233 13857 A -xcrossﬁle -fast -WGstats
234 13857 A -unroll=2 -xcrossﬁle -fast -WGstats
235 6337 C -unroll=2 -xcrossﬁle ~fast -WGstats
236 6337 C -xcrossﬁ1e -fast -unroll=2 -WGstats
237 6337 C -fast -xcrossﬁle -unroll=2 -WGstats
238 6337 C -unroll=2 -fast -xcrossﬁle -WGstats
239 6337 C -xcrossﬁle -unroll=2 -fast -WGstats
240 6337 C -unroll=2 -fast -WGstats
241 6337 C -fast -WGstats
242 6337 C -fast -unroll=2 -xcrossﬂle -WGstats
243 6337 C -fast -unroll=2 -WGstats
244 6337 C -fast -xcrossﬁle -WGstats
245 6337 C No ﬂags -WGstats
246 6337 C -xcrossﬁle -fast -WGstats
247 6337 C -unroll=2 -WGstats
248 6033 C No ﬂags -WGstats
249 6033 C -fast -WGstats
250 6033 C -unroll=2 -fast -xcrossﬁle -WGstats
251 6033 C -xcrossﬁle -fast -unroll=2 -WGstats
252 6033 C -fast ~unroll=2 -xcrossﬁle -WGstats
253 6033 C -unroll=2 —fast -WGstats
254 6033 C -xcrossﬁle -unroll=2 -fast -WGstats
255 6033 C -unroll=2 -WGstats
256 6033 C -xcrossﬁle -fast -WGstats
257 6033 C -unroll=2 -xcrossﬁle -fast -WGstats

 

 

 

 

 

continued on next page

 

 

193

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
258 6033 C —fast -xcrossﬁle -WGstats
259 6033 C -fast -unroll=2 -WGstats
260 6033 C -fast -xcrossﬁle -unroll=2 -WGstats
261 13857 C -unroll=2 -WGstats
262 13857 C -fast ~xcrossﬁle ~WGstats
263 13857 C No ﬂags -WGstats
264 13857 C -unroll=2 -xcrossﬂle -fast -WGstats
265 13857 C -xcrossﬁle -fast -unroll=2 -WGstats
266 13857 C -fast -unroll=2 -WGstats
267 13857 C -xcrossfile -fast -WGstats
268 13857 C -fast -xcrossﬁle -unroll=2 -WGstats
269 13857 C -xcrossﬁle -unroll=2 -fast -WGstats
270 13857 C -unroll=2 -fast -WGstats
271 13857 C -fast -unroll=2 -xcrossﬁle -WGstats
272 13857 C -fast -WGstats
273 13857 C -unroll=2 -fast -xcrossﬁle -WGstats
274 6337 C -unroll=2 -fast -WGstats
275 6337 C -unroll=2 -xcrossﬁle -fast -WGstats
276 6337 C -fast -unroll=2 -xcrossﬂle -WGstats
277 6337 C -fast -xcrossﬁle -unroll=2 -WGstats
278 6337 C -xcrossﬁle -fast -unroll=2 -WGstats
279 6337 C -fast ~WGstats
280 6337 C -fast -unroll=2 -WGstats
281 6337 C -xcrossﬁle -unroll=2 -fast -WGstats
282 6337 C -xcrossﬁle -fast -WGstats

 

 

 

 

 

continued on next page

 

194

 

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
283 6337 C -unroll=2 -WGstats
284 6337 C -unroll=2 -fast -xcrossﬂle -WGstats
285 6337 C No ﬂags -WGstats
286 6337 C -fast -xcrossﬁle -WGstats
287 13857 C -unroll=2 -fast -WGstats
288 13857 C -fast -WGstats
289 13857 C -fast -xcrossﬁle -WGstats
290 13857 C -fast -xcrossﬁle -unroll=2 -WGstats
291 13857 C -xcrossﬁle -fast -WGstats
292 13857 C -unroll=2 -WGstats
293 13857 C -xcrossﬁle -fast -unroll=2 -WGstats
294 13857 C -fast -unroll=2 -WGstats
295 13857 C -unroll=2 -fast -xcrossﬁle -WGstats
296 13857 C -xcrossﬁle -unroll=2 -fast -WGstats
297 13857 C -unroll=2 -xcrossﬁle -fast -WGstats
298 13857 C No ﬂags -WGstats
299 13857 C -fast -unroll=2 -xcrossﬁle -WGstats
300 6033 C -xcrossﬁle -fast -unroll=2 -WGstats
301 6033 C No ﬂags -WGstats
302 6033 C -fast -xcrossﬁle -WGstats
303 6033 C -fast -WGstats
304 6033 C -xcrossﬂle -fast -WGstats
305 6033 C -xcrossﬁle -unroll=2 -fast -WGstats
306 6033 C -fast -unroll=2 -xcrossﬁle -WGstats
307 6033 C -unroll=2 -fast -xcrossﬁle -WGstats

 

 

 

 

 

continued on next page

 

195

 

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
308 6033 C «fast -unroll=2 -WGstats
309 6033 C -unroll=2 -WGstats
310 6033 C -fast -xcrossﬁle -unroll=2 -WGstats
311 6033 C -unroll=2 -fast -WGstats
312 6033 C -unroll=2 -xcrossﬁle -fast -WGstats
313 6033 C -fast -unroll=2 -xcrossﬁle -WGstats
314 6033 C -xcrossﬁle -fast —WGstats
315 6033 C -fast -WGstats
316 6033 C -fast -unroll=2 -WGstats
317 6033 C -unroll=2 -xcrossﬁle -fast -WGstats
318 6033 C No ﬂags -WGstats
319 6033 C -unroll=2 -fast -WGstats
320 6033 C -fast -xcrossﬁle -unroll=2 ~WGstats
321 6033 C -unroll=2 -fast -xcrossﬁle -WGstats
322 6033 C -fast -xcrossﬁle -WGstats
323 6033 C -xcrossﬁle -unroll=2 -fast -WGstats
324 6033 C -xcrossﬁle -fast -unroll=2 -WGstats
325 6033 C -unroll=2 -WGstats
326 6337 C -fast -unroll=2 -WGstats
327 6337 C -xcrossﬁle -fast -unroll=2 -WGstats
328 6337 C -fast -xcrossﬁle -WGstats
329 6337 C -xcrossﬂle -unroll=2 -fast -WGstats
330 6337 C -unroll=2 -fast -WGstats
331 6337 C No ﬂags -WGstats
332 6337 C -unroll=2 -WGstats

 

 

continued on next page

 

196

 

 

 

 

 

Table F.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size (N) Algorithm Compiler Options
333 6337 C -xcrossﬁle -fast -WGstats
334 6337 C -fast -WGstats
335 6337 C -unroll=2 -fast -xcrossﬁle -WGstats
336 6337 C -unroll=2 -xcrossﬁle -fast -WGstats
337 6337 C -fast -unroll=2 -xcrossﬁle -WGstats
338 6337 C -fast -xcrossﬁle -unroll=2 -WGstats
339 13857 C -unroll=2 -xcrossﬁle -fast -WGstats
340 13857 C No ﬂags -WGstats
341 13857 C -fast -xcrossﬁle -unroll=2 -WGstats
342 13857 C ~unroll=2 -fast -xcrossﬁle -WGstats
343 13857 C -unroll=2 -WGstats
344 13857 C -xcrossﬁle -fast -unroll=2 -WGstats
345 13857 C -unroll=2 -fast -WGstats
346 13857 C -fast -xcrossﬁle -WGstats
347 13857 C -fast -unroll=2 -xcrossﬁle -WGstats
348 13857 C -xcrossﬁle -unroll=2 -fast -WGstats
349 13857 C -fast -unroll=2 -WGstats
350 13857 C -xcrossﬁle -fast -WGstats
351 13857 C -fast -WGstats

 

 

 

 

 

197

 

F.2 Anova on the metrics obtained in Experiment 3

Table F.2: ANOVA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S It A S a: C A at C S a: A It C
q0 execution time Yes Yes Yes No No Yes No
ql bread / s N o No No No No No No
q2 lread /s No Yes Yes No No Yes No
q3 %rcache N o N o N o No N o N o No
q4 bwrit /s No Yes Yes N 0 Yes N o No
q5 lwrit/s No Yes Yes N o No Yes No
q6 %wcache N 0 Yes Yes No N o No N o
q7 pgout /s No No Yes N 0 Yes No No
q8 ppgout /s No No Yes No Yes N o No
q9 pgfree/s No No No No No No No
q10 pgscan/s No No No No No No No
qll atch /s No Yes Yes No No Yes No
q12 pgin/s No No No No No No No
q13 ppgin/s No No No No No No No
q14 pﬂt/s N 0 Yes No No No No No
q15 vﬂt/s Yes Yes Yes N 0 Yes Yes Yes
q16 %usr No Yes Yes No No Yes No
q17 %sys Yes Yes Yes N o No Yes No
Q18 ‘70in No No No No No No No
q19 %idle No Yes Yes No No Yes No
q20 pswch / 3 Yes Yes Yes N o No Yes N o

 

 

continued on next page

 

198

 

 

 

 

Table F.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S * A S at C A * C S t A at C
Q21 de/wps No Yes No No No No No
Q22 de/util N 0 Yes No No No No N o
Q23 cOtOdO/rps No No No No N o No No
Q24 c0t0d0/wps No Yes Yes No Yes Yes Yes
q25 c0t0d0/util No Yes Yes No No Yes No
q26 c0t1d0/rps N o No No N o No No No
q27 c0t1d0/wps No No No No No No No
Q28 c0t1d0/util N o No No No No No No
Q29 c1t6d0/wps Yes Yes Yes Yes No Yes Yes
Q30 c1t6d0/util Yes Yes Yes Yes No Yes Yes
Q31 cpu / us No Yes Yes N o No Yes No
q32 cpu /sy Yes Yes Yes No N 0 Yes N o
Q33 cpu / wt No No No No No No No
q34 cpu/ id No Yes Yes No N 0 Yes No
q35 memory / swap N 0 Yes Yes No N 0 Yes No
Q36 memory / free No Yes Yes N o No N o N o
Q37 page / re No Yes Yes No No Yes No
Q38 page / mf Yes Yes Yes No Yes Yes Yes
Q39 page / pi No N o No No No No N o
Q40 page / po No No Yes No Yes N o No
Q41 page / fr No No No No No No N o
Q42 page / sr No No No No No N o No
Q43 disk /sO Yes Yes Yes No Yes Yes No

 

 

 

 

 

 

 

 

 

 

continued on next page

 

199

 

 

 

 

 

 

Table F.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Algorithm Compiler Interactions
Label Name Size (S) (A) Option (C) S * A S =0: C A 2k C S a: A * C
Q44 disk / 31 No No No N o No No No
Q45 disk / 32 N 0 Yes No No No N o No
Q46 faults / in No Yes Yes Yes No Yes No
Q47 faults /sy Yes Yes Yes No N 0 Yes No
Q48 faults / cs Yes Yes Yes No No Yes No
Q49 cpu/usl N 0 Yes Yes No No Yes No
Q50 cpu /sy1 Yes Yes Yes No No Yes N o
Q51 cpu / id 1 No Yes Yes No N 0 Yes No

 

Yes implies the hypothesis is rejected at alpha level 0.05.

200

 

 

 

APPENDIX C

Experiment 4

G.1 Order of Execution of Experimental Runs for Experi-

ment 4

This is a fully randomized full-factorial design. The following table contains the actual

order in which the experimental runs were performed following a fully randomized scheme.

There were a total of 96 experimental runs in this experiment.

Table C.1. Order of execution of experiments

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size Compiler Option Algorithm Data Structure
1 2 -fast -O5 G Col-by-col
2 1 No ﬂags G Col-by-col
3 1 No ﬂags F Row-by-row
4 1 -fast G Col-by-col
5 2 -fast G Row-by-row
6 2 -05 A Row-by-row
7 2 No ﬂags A Row-by-row
8 1 -05 G Col-by—col
9 2 -O5 G Row-by-row
10 1 -fast -05 F Row-by-row
1 l 2 -fast A Row- by-row
continued on next page

 

201

 

 

 

Table G.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size Compiler Option Algorithm Data Structure
12 1 -O5 A Col-by-col
13 2 No ﬂags G Row-by-row
14 1 -fast ~05 F Col-by-col
l5 2 No ﬂags F Row-by-row
l6 1 N 0 ﬂags A Col-by-col
17 2 -05 F Col-by-col
18 2 No ﬂags F Col-by-col
19 2 No ﬂags F Col-by-col
20 2 -O5 G Row-by-row
21 2 -fast G Row-by-row
22 2 No ﬂags A Col-by-col
23 2 -fast G Col-by-col
24 1 No ﬂags F Col-by-col
25 1 No ﬂags A Col-by-col
26 2 -O5 A Row-by-row
27 2 —fast -O5 A Col-by-col
28 1 -fast F Row-by-row
29 2 -fast -05 F Col-by-col
30 1 -05 F Row-by-row
31 1 -fast F Row-by-row
32 1 -fast -05 G Row-by-row
33 1 -OS F Col-by-col
34 2 -fast -05 G Row-by-row
35 1 N 0 ﬂags G Row-by-row
36 1 -fast -05 A Col-by-col

 

 

 

 

 

 

continued on next page

 

202

 

 

 

 

 

Table G.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size Compiler Option Algorithm Data Structure
37 1 -fast -O5 F Row-by-row
38 1 -fast F Col-by-col
39 1 -fast -O5 A Col-by-col
40 2 -O5 G Col-by-col
41 1 -fast A Row-by-row
42 1 -fast G Col—by-col
43 2 -fast -05 F Col-by-col
44 2 -05 F Row-by-row
45 2 No ﬂags G Col-by-col
46 1 -05 A Row-by-row
47 1 -05 A Row—by-row
48 2 No ﬂags A Col-by-col
49 1 -fast -O5 A Row-by-row
50 2 -O5 F Col-by-col
51 2 No ﬂags A Row-by-row
52 1 -fast A Col-by-col
53 1 -fast A Col-by-col
54 2 -fast F Col-by-col
55 1 -05 G Col-by-col
56 2 ~fast G Col-by-col
57 1 -fast -05 G Col-by-col
58 1 No ﬂags A Row-by-row
59 2 -fast -05 A Col—by-col
60 2 -fast A Col-by-col
61 2 -fast -O5 A Row-by-row

 

 

 

 

 

 

continued on next page

 

 

203

Table G.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size Compiler Option Algorithm Data Structure
62 2 -fast F Row-by-row
63 2 -fast -05 A Row-by-row
64 1 -fast —O5 F Col-by-col
65 2 N 0 ﬂags G Row-by-row
66 1 -fast -O5 G Row-by—row
67 1 No ﬂags F Row-by-row
68 1 No ﬂags G Row-by-row
69 1 -fast F Col-by-col
70 1 -fast A Row-by-row
71 2 -fast A Row-by-row
72 1 -05 G Row-by-row
73 2 -O5 A Col-by-col
74 1 —O5 F Row-by-row
75 2 -fast ~O5 G Row—by-row
76 2 N 0 ﬂags G Col-by-col
77 2 -O5 A Col-by-col
78 1 No ﬂags A Row-by-row
79 1 -fast G Row-by-row
80 2 -fast A Col-by-col
81 1 -fast -O5 A Row-by-row
82 2 -fast F Row-by-row
83 1 -fast G Row-by-row
84 1 -05 A Col-by-col
85 1 -fast -O5 G Col-by-col
86 2 -fast -05 F Row-by—row

 

 

 

 

 

 

continued on next page

 

204

 

 

 

 

Table G.1 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Experimental Run Size Compiler Option Algorithm Data Structure
87 1 ~05 G Row-by-row
88 1 No ﬂags F Col-by-col
89 2 -fast F Col-by-col
90 1 -O5 F Col-by-col
91 1 No ﬂags G Col-by-col
92 2 -fast -05 F Row-by-row
93 2 -O5 F Row-by-row
94 2 -fast -05 G Col-by-col
95 2 -05 G Col-by-col
96 2 N 0 ﬂags F Row-by-row

 

 

G.2 Anova on the metrics obtained in Experiment 4

Table G.2: ANOVA - main factors effect in experiment 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Compiler Algorithm Data
Label Name Size (S) Option (C) (A) Structure (D)

n0 Execution time Yes Yes Yes Yes
n1 lread/s No N o No No
n2 bwrit/s Yes Yes No Yes
n3 lwrit/s No No N o N 0
n4 %wcache No N 0 Yes Yes
n5 pgout/s Yes Yes Yes Yes
n6 ppgout/s Yes Yes Yes Yes
n7 pgfree/s Yes Yes Yes Yes

 

 

continued on next page

 

 

205

 

 

Table G.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Compiler Algorithm Data
Label Name Size (S) Option (C) (A) Structure (D)

n8 atch /s Yes No Yes Yes
n9 pgin /s No No No No
n10 ppgin/s No N o No No
n 1 1 pﬂt /s Yes No Yes Yes
n 1 2 vﬂt /s Yes No Yes Yes
n13 ‘76 usr Yes Yes Yes Yes
n 14 %sys Yes No Yes Yes
11 15 % wio Yes No Yes Yes
1116 %idle Yes Yes Yes Yes
n17 pswch / 8 Yes N 0 Yes Yes
11 18 c0t0d0/ Wps Yes Yes Yes Yes
n19 c0t0d0/ util Yes Yes Yes Yes
1120 c1t1d0/wps N o No Yes No
n21 cltldO/util No No No No
1122 memory / swap No No No No
n23 memory / free No No Yes No
n24 page / re No Yes Yes Yes
n25 page / mf Yes Yes Yes Yes
n26 page / pi No No No No
n27 page / po Yes Yes Yes Yes
n28 page / fr Yes Yes Yes Yes
n29 disk / 30 Yes Yes Yes Yes
n30 faults / in N o N 0 Yes No

 

 

continued on next page

 

206

 

 

 

Table G.2: (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Compiler Algorithm Data
Label Name Size (S) Option (C) (A) Structure (D)
n31 faults / sy Yes Yes Yes Yes
n32 faults / cs Yes Yes Yes Yes
n33 cpu / us Yes Yes Yes Yes
n34 cpu / sy Yes No Yes Yes
n35 cpu / id Yes Yes Yes Yes

 

Yes implies the hypothesis is rejected at alpha level 0.05.

Table G.3. ANOVA - two term interaction effect in experiment 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name S*C S*A S*D C*A C*D A*D
n0 Execution time Yes Yes Yes Yes Yes Yes
n1 lread/s No No N o No No N 0
n2 bwrit /s Yes Yes Yes Yes N 0 Yes
n3 lwrit/s No No No No No No
n4 %wcache Yes Yes N o N o N 0 Yes
n5 pgout/s N 0 Yes Yes Yes No Yes
n6 ppgout /s No Yes Yes No Yes Yes
n7 pgfree/ s N 0 Yes Yes N 0 Yes Yes
n8 atch /s Yes Yes Yes No N 0 Yes
n9 pgin /s No No N o No No No
n10 ppgin/s N o No No No No No
n1 1 pﬂt /s Yes Yes Yes N o No Yes
n12 vﬂt/s Yes Yes Yes No No Yes

continued on next page

 

 

207

 

 

 

Table G.3 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name S*C S*A S*D C*A C*D A*D
n13 %usr Yes Yes Yes Yes Yes Yes
n 14 %sys No Yes No No No No
n15 ‘70in Yes Yes Yes No No Yes
n 16 %idle Yes Yes Yes Yes Yes Yes
n17 pswch /s N 0 Yes Yes N o No Yes
n 18 c0t0d0/ WpS Yes Yes Yes Yes Yes Yes
n19 c0t0d0/util Yes Yes Yes Yes Yes Yes
n20 c1t1d0/wps N 0 Yes No No No Yes
n21 cltldO/util No No No No No No
n22 memory / swap N 0 Yes No N o N o No
n23 memory / free N o No No No No No
n24 page / re Yes No No Yes Yes Yes
n25 page / mf Yes Yes N 0 Yes Yes Yes
n26 page / pi No No No No No No
n27 page / p0 Yes Yes No N o N 0 Yes
n28 page / fr Yes Yes N o N o No Yes
n29 disk / 30 Yes Yes Yes Yes Yes Yes
n30 faults / in N o N o No No No No
n31 faults / sy Yes Yes Yes Yes Yes Yes
n32 faults / cs Yes Yes Yes Yes Yes Yes
n33 Cpu / us Yes Yes Yes Yes Yes Yes
n34 cpu/sy N o N o No No N 0 Yes
n35 cpu / id Yes Yes Yes Yes Yes Yes

 

Yes implies the hypothesis is rejected at alpha level 0.05.

208

 

Table G.4. AN OVA - three and four term interaction effect in experiment 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name S*C*A S*C*D C*A*D S*C*A*D
n0 Execution time Yes Yes Yes Yes
n1 lread/s No No N o No
n2 bwrit /s Yes Yes Yes Yes
n3 lwrit/s No No No No
n4 %wcache Yes No No N 0
n5 Pgout/s Yes N o N 0 Yes
n6 ppgout /s Yes Yes Yes Yes
n7 pgfree/s Yes Yes Yes Yes
n8 atch / 3 Yes Yes No Yes
119 pgin/s No No No No

n10 ppgin/s No No No No
n11 pﬂt/s Yes Yes No Yes
n12 vﬂt /s Yes Yes N 0 Yes
n13 %usr Yes Yes Yes Yes
n14 %sys No No N 0 Yes
n15 %Wio Yes Yes Yes Yes
n16 %idle Yes Yes Yes Yes
n17 pswch/s No No N 0 Yes
n18 c0t0d0/wps Yes Yes Yes Yes
n19 c0t0d0/util Yes Yes Yes Yes
n20 c1t1d0/wps No No No No
n21 c1t1d0/util No No No No
n22 memory / swap N o N o N o No
continued on next page

 

 

209

 

Table G.4 (cont’d).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label Name S*C*A S*C*D C*A*D S*C*A*D
n23 memory / free No No No No
n24 page / re Yes Yes Yes Yes
n25 page / mf Yes Yes No Yes
n26 page / pi No N 0 N o No
n27 page / p0 Yes Yes No Yes
n28 page / fr Yes Yes No Yes
n29 disk / 50 Yes Yes Yes Yes
n30 faults / in No No No No
n3 1 faults / sy Yes Yes Yes Yes
n32 faults /cs Yes Yes Yes Yes
n33 cpu / us Yes Yes Yes Yes
n34 cpu/sy No Yes No No
n35 cpu / id Yes Yes Yes Yes

 

Yes implies the hypothesis is rejected at alpha level 0.05.

210

APPENDIX H

Additional Fortran ﬁles

H.1 Program to test new routines

This is a Fortran 77 program used to test the correctness of new routines used to perform

 

matrix-vector multiplication. In this particular program, a routine presented in by Golub

and Van Loan in their book Matrix Computations was tested.

Program Testing
c Testing BiMATVECCav with a matrix read from a file
Doing a matrix-vector multiplication where the
matrix is symmetric and the upper diagonal is
saved row by row
Integer apUnk,i
Integer MaxapUnk,tot
Complex*16 temp1(200),temp2(200),bestGuess(200)
Complex*16 error(7), Ybi(20100)
c c Ybi is the lower triangle of the symmetric matrix
apUnk = 200
MaxapUnk=30000

GOOD

Open(unit=27,file=’matrixIn’,status=’old’)
tot=0
Do i = 1 , MaxapUnk
Read(27,fmt=*,end=99)Ybi(i)
tot = tot 4' 1
c Print*,tot,i,Ybi(i)
EndDo
99 close(unit=27) c c Print the matrix
Do i = 1, tot
Print*,i,Ybi(i)
EndDo

Open(unit=28,file=’vectorIn’,status=’old’)

tot=0
Do i = 1 , MaxapUnk

211

 

Read(28,fmt=*,end=79)bestGuess(i)
tot = tot + 1
EndDo
79 close(unit=28)

c Print the vector
Do i = 1, tot
Print*,i,bestGuess(i)
EndDo

c Calling the original version of BiMATVECCav
Call BiMATVECCav1(apUnk,Ybi,bestGuess,temp1)
Print*,’Temp1 Original’

Do i = 1,apUnk
Print*,temp1(i)
EndDo

c Calling the modified version of BiMATVECCav
Call BiMATVECCav2(apUnk,Ybi,bestGuess,temp2)
Print*,’Temp2 Modified’

Do i = 1, apUnk
Print*,temp2(i)
EndDo

Print*,’Error’

Do i = 1, apUnk
error(i)=temp1(i)-temp2(i)

Print*,error(i)

EndDo

End
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Subroutine BiMATVECCav1(apUnk,Ybi,vector,product)

c First algorithm to solve in parallel with OpenMP
Integer apUnk
Integer BIrowEndPoint(200)
Complexi16 Ybi(*),vector(*),product(*)
c c Local variables c
Integer row,col,index
Complex*16 matEntry
c c Load the diag BIrowEndPoint vectors c
BIrowEndPoint(1) = 0
Do row = 2,apUhk
BIrowEndPoint(row) = BIrowEndPoint(row-1)+
& (apUnk - row+1)
EndDo
c c Do the MATVEC c

212

 

Do row = 1,apUnk
Do col = 1,apUnk
if (row .LT. col) then

index = BIrowEndPoint(row)+col
else
index = BIrowEndPoint(col)+row
endif
matEntry = Ybi(index)
product(row) = product(row) + matEntry*vector(col)
EndDo
EndDo
Return

End

 

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Subroutine BiMATVECCav2(apUnk,Ybi,vector,product)

c Golub & Van Loan algorithm to solve in parallel with OpenMP
Integer apUnk
Complext16 Ybi(*),vector(*),product(*)
c c Local variables c
Integer row,col,index
c c Do the MATVEC c
Do col = 1, apUnk
Do row = 1, col-1
index = (row-1)*apUnk - row*(row-1)/2 + col

c Write(6,*)’index’, index
product(row) = product(row)+Ybi(index)*vector(col)
EndDo

Do row = col, apUnk
index = (col-1)*apUnk - col*(col-1)/2 + row

c Hrite(6,*)’index’, index
product(row) = product(row) + Ybi(index)*vector(col)
EndDo
EndDo
Return
End

213

 

APPENDIX I

Matlab Files

I.

1 Program to compute order of experimental runs

This example Matlab code computes the order in which experimental runs will be executed.

Since it contains a random number generator, each run will result in a different result or

order in which the experiment will run.

%
Z
Z
Z
X
Z
Z
X
X
Z
Z
Z
Z

Program to generate order of experiments

For Split-Plot Design where let the size is
selected, then the algorithm, and last

the compiler Options.

Cannot use Minitab 13 since max no. of levels
allowed is 9

The output is a matrix with 3 columns where
column 1 is size, column 2 is algorithm, and

column 3 is compiler option

Nayda G. Santiago July 20, 2001

clear .

r=input(’What is the name of the output file? ’,’s’);
diary (r);

rand(’state’,sum(100*clock));

experiments=[];

number_alg=1; 2 Number of Levels in Algorithms
number_co=13; 2 Number of Levels in Compiler Options
number_si=3; 2 Number of Levels in Size

number_rep=3; % Number of Repetitions of the experiment

Z

Number of experiments

number_exp=number_algtnumber-si*number_rep;
for p=1:number_exp,

a=200*rand(1,200);
b=mod(a,number_co);
c=ceil(b);

214

 

exper(1)=c(1);

k=2;

for i=2:number_co,

end

exper(i)=c(k);
k=k+1;
i=1:
while j <= (i-1),
if exper(i) == exper(j).
exper(i)=c(k);
k=k+1;
i=1:
else
j=j+1:
end
end

experiments=[experiments;exper];

end

sizes=[];

for p=1:

number_rep

a=300*rand(1,200);
b=mod(a,3);
c=ceil(b);
tamano(1)=c(1);

k=2;

for

end

i=2:number_si,
tamanofi)=c(k);
k=k+1;
i=1:
while j <= (1'1),
if tamano(i) == tamano(j),
tamano(i)=c(k);
k=k+1;
i=1:
else
j=j+13
end
end

sizes=[sizes;tamano];

end

a1gorth=[3]; 1 Algorithm C is 3

2 Computing vector containing experiments for perl file
% Each row will be [Size, Algorithm, Compiler Option]
s=reshape(sizes’,number_rep*number_si,1); %convert size

% into a column

215

 

 

 

ord_exp=[];
for i=1:number_si*number_rep
for k=1znumber_co
Z We only have one algorithm
e1em=[s(i) algorth experiments(i,k)];
ord_exp=[ord_exp;elem];
end
end
sizes
algorth
experiments
ord_exp
diary off

1.2 Routine to compute entropy cost function

function entropy=entropydash(A)

Z Function to compute entropy as defined by Dash, et. al. in

Z M. Dash, H. Liu, and J. Yao, Dimensionality Reduction

Z of Unsupervised Data, Proc. of the 9th IEEE Intl. Conference

Z on Tools with Artificial Intelligence, p. 532-539, Nov 1997

Z

[N,M]=size(A); normalization=max(A)-min(A); for i=1:N
normalizationMatrix(i,:)=normalization;

end

normalizedA=A./normalizationMatrix; %Normalization

D=dist(normalizedA’);

k=1; for i=1:N
for j=i+1zN
v(k)=D(i,j);
k=k+1;
end
end Daverage= mean(v);
alpha=~log(0.5)/Daverage; % Alpha used in equation (2)

for i=1:N
for j=1:N
Sfi,j)= exp(-(alpha*D(i,j)));
end
end

for i=1:N
for j=1zN
if S(i,j) == 1
H(i,j)=0;

216

 

 

else
Z Equation 1 from paper
H(i,j)=S(i,j)*1og2(S(i,j))+(1-S(i,j))*log2(1-S(i,j));
end
end
end entropy = -(sum(sum(E)));

I.3 Routine to show scree test and the Kaiser-Guttman cri—
teria

function [total] = KG_scree(data,names);
Z data - matrix unnormalized data

Z

./.************************************
[rows,cols] = size(data);

Z Normalize data using norm 2
for i=1:cols
normalization(i)=norm(data(:,i));

end
normalizeddata =data./(ones(rows,1)*normalization);
A=normalizeddata;

Z End Normalization

Z Plot the eigenvalues of the correlation matrix
Z Kaiser-Guttman method eig > 1

correlacion= corrcoef(A);
EigValues=eig(correlacion);
EigValues=f1ipud(EigValues);

kg=1; for i=1:cols
if (EigValues(i) > 1)
kg=i;
end
end
disp(sprintf(’There are Zd eigenvalues larger
than one.’,kg));

w=20; Z How many eigenvalues to plot
if (w < c018)

plot(EigValues(1:w),’b-’);

hold on

line=ones(1,20);

plot(EigValues(1:w),’ro’);

plot(1ine)

ylabel(’Va1ue’);

217

 

title(’Corre1ation Matrix Eigenvalues’);

grid

hold off
else

disp(’You have requested too many eigenvalues’)
end

1.4 Program to validate intrinsic dimensionality estimators

This Matlab code generates a synthetic data set with the same statistics as the data set
from experiment 2. However, the ﬁrst nine columns of the matrix are independent. All
the additional columns are multiples of the ﬁrst nine plus small noise. This forces this
matrix to have an intrinsic dimension of nine. Then the code proceeds to apply all three

dimensionality estimator methods to estimate the dimension of the data.

Z This file will create a synthetic data set with known
Z dimension of 9. We will use the three different estimators
Z used in our data set to estimate the intrinsic dimensionality
Z of the data set.
Z
load Exp2
[rows,cols] = size(data);
Z Synthetic data
A=250*rand(rows,9);
noise=0.0001*randn(size(A));
B=[A 2*A+noise 3*A-noise 4*A+2*noise 5*A-2tnoise 6*A(:,1:2)];
Z Estimate mean and variance of validation data.
mean1=mean(data);
sigma1=std(data);
ZEstimate mean and variance of synthetic data.
mean2=mean(B);
sigma2=std(B);
Z
for i=1:rows
for j=1:cols
data2(i,j)=(((B(i,j)-mean2(j))/sigma2(j))*sigma1(j))+mean1(j);
end
end
Z Normalize the data with norm 2 for covariance matrix
for i=1:cols
normalization(i)=norm(data2(:,i));
end normalizeddata =data2./(ones(rows,1)*normalization);

218

Z End Normalization
Z
Z Principal Component Analysis using the covariance matrix
Z explainedcov contains the variance explained by each eigenvalue
covdata=cov(normalizeddata);
Z Compute the covariance matrix of the data
[pccov,latentcov,exp1ainedcov]=pcacov(covdata);
Z
Z
percentage=95;
i=1;
sum1=0;
while sum1 < percentage;

sum1=sum(explainedcov(1:i));

i=i+1;
end
numComponentsCov=i-1;
disp(sprintf(’Retain Zd components from covariance matrix.’,
numComponentsCov));
disp(’(95Z variance retained)’);
./.*****************Shh!**************************
Z Plot the eigenvalues of the correlation matrix
Z Kaiser-Guttman method eig > 1
C=normalizeddata;
correlacion= corrcoef(C);
Autovalores=eig(correlacion);
Autovalores=flipud(Autovalores);

kg=1; for i=1:cols
if (Autovalores(i) > 1)
kg=i;
end
end
disp(sprintf(’There are Zd eigenvalues larger than one.’,kg));

w=20; Z How many eigenvalues to plot
if (w < cols)
plot(Autovalores(1:w),’b-’);
hold on
linea=ones(1,20);
plot(Autovalores(1:w),’ro’);
plot(1inea)
ylabel(’Value’);
title(’Correlation Matrix Eigenvalues’);
grid
hold off
Z

else

219

disp(’You have requested too many eigenvalues’)
end

220

 

 

APPENDIX J

Perl Script ﬁles

.I.1

This script generates a summary ﬁle with all metrics. It determines how many lines of

Script A: Generating Summary of Metrics

metrics to read according to the time when the application ﬁnished running.

#!/usr/bin/perl -w

$file_timetrack="timetrack";
$file_time="time_out";
$file_sar="sar-out";
$file_iostat="iostat_out";
$file_vmstat="vmstat_out";
$file_mpstat="mpstat_out";
$file_summary="summary_output";

for

($k=1; $k<=123; $k++){

$directory = "E".$k;

#$directory = "Etest";

print ’Directory ’.$directory."\n";
chdir($directory);

# Get the name of makefile since the names are all
# different (makefilei to makefile13)
$file_make=‘ls I grep makef‘;

chop $file_make;

# Get the total number of unknowns to solve from the
# file descr wich was created by prism
$Number_unknowns=‘grep Total descr I grep unk‘;

chop $Number_unknowns;

# OPENING OUTPUT FILE
open(FILE_OUT,">$file_summary“) or die "Cannot write to file \n";
print FILE_OUT ’Directory ’.$directory."\n";

print FILE_OUT "Sampling every 20 seconds, total of 474 iterations\n";

221

 

 

print FILE_OUT $Number_unknowns."\n";

# READING MAKEFILE

open(FILE_2,"<$fi1e_make") or die "Cannot read file \n";

# Find line with compiler options

while (<FILE_2>){
if($_ =' /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
@space_sep = split; # Split line at blank space
if ($space_sep[0] eq "FOPT=") {

}

shift(@space_sep); # Eliminate Ist element of space_sep array

print FILE_OUT "Compiler Options: ";

foreach $0ption (@space_sep) { #Print all compiler options
print FILE_OUT $option." ";

I
print FILE_OUT "\n"; A

close(FILE_2);

# READING TIMETRACK

Open(FILE_3,"<$file_timetrack") or die "Cannot read file \n";
# Find line with ’real’ string and get the elapsed time
while (<FILE_3>){

if($_ =' /“\s*$/) {next;}; # Remove blank lines

chomp; # Remove newline character

@space_sep = split; # Split line at blank space

if ($space_sep[O] eq "Start"){
$a=<FILE_3>;
chomp ($a);
@time_line = Split(/\s+/,$a); # Split line at blank space
$start_day = $time_line[0]; # First element is day
$start_time = $time_line[3]; # Third element is time
print FILE_OUT "Start: ".$start_day." ".$start_time."\n";

}

if ($space_sep[0] eq "End"){

$a=<FILE_3>;
chomp ($a);
0time_line - split(/\8+/,$a); # Split line at blank space
$end_day = $time_line[0]; # First element is day
$end_time = $time_1ine[3]; # Third element is end time
#Qtime_fields = split(/:/,$end_time); # Split at min
print FILE_OUT "End: ".$end_day." ".$end_time."\n";
if (Sstart_day ne $end-day){

print FILE_OUT "Not same day\n";
}

222

 

}
close(FILE_3);

# READING TIME
open(FILE_4,"<$fi1e_time") or die "Cannot read file \n";
# Find line with ’real’ string and get the elapsed time
while (<FILE_4>){
if($_ =' /“\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
@space_sep = split; # Split line at blank space
if ($space_sep[0] eq "real") {
$elapsed_time = $space_sep[1]; # 2nd element is elapsed time
@min_fields = Split(/:/,$elapsed_time); # Split at min
$minutes = $min_fields[0]; # Get minutes
$seconds_dec = $min_fields[1];
@sec_fields = split(/\./,$seconds_dec); # Split at secs
$seconds = $sec_fields[0];
$total_time_prism = $minutes*60 + $seconds;
print FILE_OUT "Prism time in sec: ".$total_time_prism."\n";

}
close(FILE_4);

# READING SAR
$no_metrics = "yes"; # Flag: when we can start reading the metrics
$count_metrics = O;

# Only 28 of the metrics are relevant. We do not need those measured
# by the -v flag. Initialize the cummulative sum to zero.
for ($1=0; $l<=28; $l++){
$cum_metrics1[$l]=0;
}

open(FILE_5,"<$file_sar") or die "Cannot read file \n";
while(<FILE_5>){
if($_ =’ /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
@space_sep = split; # Split line at blank space

8 Read the lines with metrics
if ($no_metrics eq "no" ){ # We can start reading metrics
Qmetrics = Qspace_sep;
# Get the number of elements in the array Ometrics
$num_of_e1ements = scalar(0metrics);
if ($metrics[0] =‘ /:/) { # Match time with ’:’ character
$11ne_index = $num_of_elements;
# End 100p when the time matches the end sampling time
# Remove comment for E123 since there is a change in date

223

# and ends before it can finish.
last if ($metrics[0] gt $end_sampling_time);# End while loop
for ($m = 1; $m<= $line-index-1; $m++) {
$cum_metricsl[$m]=$cum_metrics1[$m] + $metrics[$m];
}
$count_metrics = $count_metrics + 1;
} elsif ($metrics[0] !~ /\//) { # Do not use line with
# slash (/) character
$old_line_index = $line_index;
$11ne_index = $line_index + $num_of_elements;
for ($m = $old_line_index; $m<= $line_index-1; $m++) {
$cum_metrics1[$m]=$cum_metrics1[Smﬂ +
$metrics[$m-$old_line_index];

}
}
}
if ($space_sep[0] eq "/bin/sar") {
$period = $space_sep[2]; # 2nd element is elapsed time
$repetitions = $space_sep[3]; # 2nd element is elapsed time
print FILE_OUT "Period: ".$period."\n";
print FILE_OUT "Repetitions: ".$repetitions."\n";
$total_time_sar=$period*$repetitions;
print FILE_OUT "Sar time: ".$total_time_sar."\n";
if ($total_time_prism > $total_time_sar){
print FILE_OUT "ERROR taking sar metrics. Sar was short
in time.\n";
}
}

if ($space_sep[0] eq "SunOS") {
$a=<FILE_5>;
$a=<FILE_5>;
chomp ($a);
@first_line split(/\s+/,$a); # Split line at blank space
$start-time $first_line[0]; # 2nd element is elapsed time
print FILE_OUT "Initial time: ".$start_time."\n";
@time_fields = split(/:+/,$start_time);# Split line at : symbol
$start_hour = $time_fields[0];
$start_min = $time_fields[1];
$start_sec = $time_fields[2];
0name_metrics = inrst-line;
$a=<FILE_5>;
chomp ($a);
0name_metrics - (@name_metrics, split(/\s+/,$a)); # Split
# line at blank space

$a=<FILE_5>;
chomp ($a);
Oname_metrics

(@name_metrics, split(/\s+/,$a)); # Split
# line at blank space

224

$a=<FILE_5>;

chomp ($a);

@name_metrics = (@name_metrics, split(/\s+/,$a)); # Split
# line at blank space

$a=<FILE_5>;

# Remove the next two lines since I am not including

# the metrics by -v flag into consideration

#chomp ($a);

#Qname_metrics = (©name_metrics, split(/\s+/,$a)); # Split
# line at blank space

$a=<FILE_5>;

chomp ($a);

@name_metrics = (@name_metrics, split(/\s+/,$a)); # Split
# line at blank space

#

# COMPUTING THE END SAMPLING TIME

#

# Sampling time is the time to take all samples while

# prism is running

$sampling_time = $tota1_time_prism -
($tota1_time_prist$period);

print FILE_OUT "Sampling total time: ".$sampling_time."\n";

$mins_plus = O;

$hours_plus = O;

# SECONDS

$secs = $sampling_timeZ60;

$end_secs = $start_sec + $secs;

if ($end_secs >= 60){
$end_secs = $end_secs - 60;
$mins_plus = 1;

}

if ($end_secs <= 9){
$end_secs = "O".$end_secs;

# MINUTES
$mins = ($sampling_time - $secs) / 60;
$end_mins = $start_min + $mins + $mins_plus;
if ($end_mins >8 60){

$end_mins = Send_mins - 60;

$hours_plus = 1;
}
if ($and-mins <= 9){

$end-mins = "O".$end_mins;

}

225

# HOURS
$hrs = ($3ampling_time - $secs - $mins¥60) / 60;
$end_hour = $start_hour + $hrs + $hours_plus;
if ($end_hour >= 24){
$end_hour = $end_hour - 24;
}
if ($end_hour <= 9){
$end_hour = "O".$end_hour;

}
# END SAMPLING TIME
$end_sampling_time = $end_hour.":".$end_mins.":".$end_secs;
print FILE_OUT "End sampling time: ".$end_sampling_time."\n";
$no_metrics = "no";
}
}
print FILE_OUT "ANALYSIS OF SAR\n";
print FILE_OUT "Count of metrics: ".$count_metrics."\n";

for ($1=O; $1<=28; $1++){
$avg_metrics[$l]=$cum_metrics1[$l]/$count_metrics;

}

# Remove the first element of the array

shift(@name_metrics);

shift(@avg_metrics);

foreach $name_metric (@name_metrics){print FILE_OUT $name_metric," ";}
print FILE_OUT "\n";
foreach $cum_metric (@avg_metrics) { print FILE_OUT $cum_metric, " ";}

print FILE_OUT "\n";
close(FILE_5);

# READING IOSTAT

print FILE_OUT "ANALYSIS OF IOSTAT\n";

$read_metrics = "no"; # Flag, when we can start reading the metrics
$lineNumber = O;

$tot_lines = $count_metrics-1;

# We have 18 metrics in iostat.
for ($1=0; $l<=17; $l++){
$cum_metrics2[$l]=0;
}
open(FILE_6,"<$fi1e_iostat") or die "Cannot read file \n";
while(<FILE_6>){
if($- =" /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
Ospace-sep = split; # Split line at blank space

# READ METRICS DESCRIPTION
if ($space_sep[0] eq "tty" && $read_metrics eq "no"){ # Next

226

# few lines contain metrics
print FILE_OUT $_."\n"; # Print first line of description
$a = <FILE_6>;

chomp($a);
print FILE_OUT $a."\n"; # Print second line of description
$a = <FILE_6>; # Discard lst line of measurements

# See description of IOSTAT command
$read_metrics = "yes"; # May read metrics

}
# Here we can read the metrics
if (($read_metrics eq "yes") && ($space_sep[0] ne "tty") &&
($space_sep[0] ne "tin") ) {
@metrics = @space_sep;
# Get the number of elements in the array @metrics
$num_of_elements = scalar(@metrics);
last if ($lineNumber >= $tot_lines); # End main while loop
for (Sm = O; $m<= $num_of_e1ements-1; $m++) {
$cum_metrics2[$m]=$cum_metrics2[$m] + $metrics[$m];
}

$lineNumber = $lineNumber + 1;

}

for ($l=0; $l<=$num_of_elements-1; $l++){
$avg_iostat[$1]=($cum_metrics2[$1]/$tot_lines);

}

foreach $cm (@avg_iostat) { print FILE_OUT $cm, " ";}

print FILE_OUT "\n";

close(FILE_6);

# READING VMSTAT

print FILE_OUT "ANALYSIS OF VMSTAT\n";

$read_metrics = "no"; # Flag to know when we can start
# reading the metrics

$lineNumber = 0;

# We have 22 metrics in iostat.
for ($1=O; $1<=21; $1++){
$cum_metrics3[$l]=0;
}
open(FILE_7,"<$file_vmstat") or die "Cannot read file \n";
whi1e(<FILE_7>){
if($_ =" /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
Ospace_sep = split; # Split line at blank space

# READ METRICS DESCRIPTION

if ($space_sep[O] eq "procs" && $read_metrics eq "no"){
# Next few lines contain metrics

227

print FILE_OUT $_."\n"; # Print first line of description
$a = <FILE_7);
chomp($a);
print FILE_OUT $a."\n"; # Print second line of description
$a = <FILE_7>; # Discard lst line of measurements
# See description of VMSTAT command
$read_metrics = "yes"; # Now we may read metrics
}
# Here we can read the metrics
if (($read_metrics eq "yes") && ($space_sep[0] ne "procs") &&
($space_sep[0] ne "r") ) {
@metrics = @space_sep;
# Get the number of elements in the array @metrics
$num_of_elements = scalar(@metrics);
last if ($lineNumber >= $tot_lines); # End main while loop
for ($m = 0; $m<= $num_of_e1ements-1; $m++) {
$cum_metrics3[$m]=$cum_metrics3[$mﬂ + $metrics[$m];
}

$lineNumber = $lineNumber + 1;

}

for ($l=0; $l<=$num_of_elements-1; $1++){
$avg_vmstat[$l]=($cum_metric33[311/$tot_lines);

}

foreach $cm (@avg_vmstat) { print FILE_OUT $cm, " ";}

print FILE_OUT "\n";

close(FILE_7);

# READING MPSTAT

print FILE_OUT "ANALYSIS OF MPSTAT\n";

$read_metrics = "no"; # Flag to know when we can start reading
# the metrics

$1ineNumber = 0;

# We have 16 metrics in mpstat.
for ($l=o; $1<=15; $1++){
$cum_metrics4_0[$l]=0;
$cum_metrics4_1[$l]=0;
$cum_metrics4_2[$l]=0;
$cum_metrics4_3[$l]=0;
}
open(FILE_8,"<$file_mpstat") or die "Cannot read file \n";
while(<FILE_8>){
if($- =' /“\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
@space_sep = split; # Split line at blank space

# READ METRICS DESCRIPTION

228

 

 

 

if ($space_sep[0] eq "CPU" && $read_metrics eq "no"){
# Next few lines contain metrics
print FILE_OUT $_."\n"; # Print first line of description

$a = <FILE_8>; # Discard 1st four lines of measurements
$a = <FILE_8>; # See description of VMSTAT command

$a = <FILE_8>;

$a = <FILE_8>;

$read_metrics = "yes"; # Now we may read metrics

}

# Here we can read the metrics. There are four lines, one

# per processor

if (($read_metrics eq "yes") && ($space_sep[0] ne "CPU")) {
last if ($lineNumber >= $tot_lines); # End main while loop

@metrics = @space_sep;
# Get the number of elements in the array @metrics
$num_of_elements = scalar(@metrics);
$cpu_id = $metrics[0];
if ($cpu_id == 0) {
for ($m = O; $m<= $num-of_elements-1; $m++) {
$cum_metrics4_0[$m]=$cum_metrics4_0[$m]

+

$metrics[$m];

}
} elsif ($cpu_id == 1) {
for ($m = 0; $m<= $num_of_e1ements-1; $m++)
$cum_metrics4_1[$m]=$cum_metrics4_1[$m]

rH

4.

$metrics[$m];

}
} elsif ($cpu_id == 2) {
for ($m = O; $m<= $num_of_elements-1; $m++) {
$cum_metrics4_2[$m]=$cum_metrics4_2[$m]

+

$metrics[$m];
}
} elsif ($cpu_id == 3) {
for ($m = 0; $m<= $num_of_elements-1; $m++)
$cum_metrics4_3[$m]=$cum_metrics4-3[$m]

rH

+

$metrics[$m];
}
$lineNumber = $lineNumber + 1;

} else {
die "Cannot identify cpu id \n";

}

}

for ($l=0; $l<=$num_of_elements-1; $1++){
$avg_mpstat0[$1]=($cum-metrics4_0[$1]/$tot_lines);
$avg_mpstat1[$l]=($cum_metrics4_1[$l]/$tot_1ines);
$avg_mpstat2[$1]=($cum-metrics4_2[$1]/$tot-lines);
$avg_mpstat3[$1]=($cumbmetrics4_3[$1]/$tot_lines);

}

foreach $cm (Qavg_mpstat0) { print FILE_OUT $cm, " "; }

print FILE_OUT "\n";

229

 

 

foreach 3cm (Qavg_mpstat1) { print FILE-OUT $cm, " "; }
print FILE_OUT "\n";

foreach $cm (@avg_mpstat2) { print FILE_OUT 3cm, " "; }
print FILE_OUT "\n”;

foreach $Cm (@avg_mpstat3) { print FILE_OUT $cm, " "; }
print FILE_OUT "\n";

close(FILE_8);

Chdir("..");

J .2 Script B: Create Crontab ﬁle

Script to generate the crontab ﬁle that will automatically run the OS calls to measure

perﬂnmnance.

#l/usr/local/bin/perl -w

#

# Syntax:

# create_crontab.p1 <nfex> <tint> <inda> <inti> [> outputfile]
# where:

# nfex = Number of First EXperiment

# tint = Time INTerval in minutes

# inda = INitial DAte in mm/dd

# inti = INitial TIme in military format HH:MM

#

# Command scratchpad
$command1 = ’/home/nayda/private/Fresh/Testing/torunprism’;

$command2 = ’/bin/iostat -cht 20’;
$command3 = ’/bin/vmstat 20’;
$command4 = ’/bin/mpstat 20’;
$command5 = ’/bin/sar -bgpuvw 20’;

# Input filename
$infile = ’ord_exp’;
$noexp = 13; # Number of experiments to process

# Splits the initial time and date information
Qinti = split(":",$ARGV[3]);

Oinda = split("/",$ARGV[2]);

Ocurtida = ($inti[1],$inti[0].$inda[1].$inda[O]);
# First othertime is for compiler Options 1 and 3
$othertime1 = ($ARGV[1] - 8 + 2 ) * 60 / 20;

230

 

#Sothertime1 = ($ARGV[1] - 10 + 2 ) * 6O / 20;

# Cambiamos el 10 por 8 para agilizar los experimentos.
# Second othertime is for all other compiler options
$othertime2 = ($ARGV[1] - 22 + 2 ) * 6O / 20;

# Searches for the number of the first experiment in input file
$1ncnt = 1;
open(ORDEXP, "$infile") ll die "Could not find Sinfile";
while (<ORDEXP>) {
if ($lncnt < $ARGV[O]) {
# Advanced in input file until line given by nfex
$lncnt += 1;
}
elsif ($noexp) {
# Generates the command lines for the next 13 lines
@expdata = split;
if ($expdata[2] == 1){
$othertime = $othertime1;
}
elsif ($expdata[2] == 3) {
$othertime = $othertime1;

}
else {

$othertime = $othertime2;
}

# The first command
print "$curtida[0] $curtida[1] $curtida[2] $curtida[3] #
$command1$expdata[2]\n";

# Advances time 1 second and generate the other 4 commands
@curtida = &get_new_time($curtida[0],$curtida[1],$curtida[2],
$curtida[3],1);

print "$curtida[0] $curtida[1] $curtida[2] $curtida[3] *
$command2 $othertime\n";

print "$curtida[0] $curtida[1] $curtida[2] $curtida[3] *
$command3 $othertime\n";

print "$curtida[0] $curtida[1] $curtida[2] $curtida[3] *
$command4 $othertime\n";

print "$curtida[0] $curtida[1] $curtida[2] $curtida[3] *
$command5 $othertime\n";

# Update the time counters for next experiment
Ocurtida 8 kget_new_time($curtida[0],$curtida[1],$curtida[2],
$curtida[3],$ARGV[1]);
$noexp -= 1;
}

else {

231

 

 

}

# Terminates execution
last;

# A subroutine to compute the time and date
sub get_new_time{

$minutes = $-[0];
$hours = $_[1];
$days = $-[2];
$months = $-[3];

$minutes += $-[4];

if ($minutes > 59) {
$minutes -= 60;
$hours += 1;

if ($hours > 23) {
$hours -= 24;
$days += 1;

if (($days > 31)

&& (($months ==

1) ll ($months == 3)

II ($months == 5) ll ($months == 7) ll ($months == 8)
ll ($months ==10) ll ($months == 12))) {

$days -= 31;
$months += 1;

}

elsif (($days > 30) && (($months == 4) || ($months == 6)
ll ($months == 9) ll ($months == 11))) {

$days -= 30;
$months += 1;

}

elsif (($days > 28) && ($months == 2)) {

$days -= 28;
$months += 1;

}

if ($months > 12) {
$months -= 12;
}

}

return ($minutes,$hours,$days,$months);

232

J .3 Script C: Convert data to minitab 13 format

Script to generate the input ﬁle compatible with minitab 13 to analyze the data from each

summary ﬁle in each directory containing the data from one experimental run.

#!/usr/bin/per1 -w

#

# Creates file with results for Minitab
# Syntax:

# create_minitab.pl

#

# Input/Output filenames

$infile1 = ’ord_exp’;

$infi1e2 = ’metrics_names’;

$infi1e3 ’summary_output’;
$outfi1e ’experiment-outcome.txt’;

# LABELS for METRICS

$labelmetric1 = ’SAR’;

$1abelmetric2 = ’IOSTAT’;

$labelmetric3 = ’VMSTAT’;

$1abe1metric4 = ’MPSTAT’;

#$noexp = 234; # Number of experiments to process

# Open files to process initial data

open(ORDEXP, "<3infile1") ll die "Could not find $infile1\n";
open(NAMES, "<$infile2") ll die "Could not find $infile2\n";
open(OUTFILE, ">$outfile") ll die "Could not find $outfile\n";

# Get names of metrics from input file

while(<NAMES>){ # Do if eof not reached
if($_ =" /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
@space~sep = split; # Split line at blank space

# Care with more than one trailing blank Spacesl!
if ($space_sep[0] eq $1abelmetric1) {
$names1 = <NAMES>;
chomp($names1);
@namel = Split(/\s+/,$namesl);
} elsif ($space_sep[O] eq $labe1metric2) {
$names2 s <NAMES>;
chomp($names2);
Oname2 = split(/\s+/,$names2);
} elsif ($space_sep[O] eq $1abelmetric3) {
$namesB - <NAMES>;

233

chomp($names3);
@name3 = split(/\S+/.$nameS3);
} elsif ($space_sep[0] eq $labelmetric4) {
$names4 = <NAMES>;
chomp($names4);
@name4 = split(/\s+/.$names4);
$names$ = <NAMES>;
chomp($name85);
@nameS = split(/\8+/,$nam985);
$name86 = <NAMES>;
chomp($name36);
@name6 = split(/\s+/,$nam885);
$names7 = <NAMES>;
chomp($names7);
@name? = split(/\s+/,$names7);
} else {
die "Should not read this line\n";

}
close(NAMES);

# Create a long line with all the metrics
@Metric_Names=(@name1, @name2, @name3, @name4, @nameS, @name6, @name7);

@0rder_of_exp = <ORDEXP>;
close(ORDEXP);

print OUTFILE "Size\tAlgorithm\tCompilerOption\tUnknowns\tPrismTime
\tCountOfMetrics";
foreach (@Metric_Names) {
print OUTFILE "\t";
print OUTFILE;
}
print OUTFILE "\n";

# Searches for the number of the first experiment in input file
for ($k=1; $k<=234; $k++){

$directory = "E".$k;
# $directory = "Etest";

print ’Directory ’.$directory."\n";

chdir($directory);

open(SUMMARY, “<$infile3") ll die "Could not find $infile3\n";
while ((SUMMARY>) {
if($_ a“ /‘\s*$/) {next;}; # Remove blank lines

chomp; # Remove newline character
Ospace_sep = split; # Split line at blank space

234

 

 

if ($space_sep[0] eq "Total“) {
$Unk = $8pace_sep[3]; # Get number of unknowns
} elsif ($space_sep[0] eq "Prism") {
$prism_time = $space_sep[4];
} elsif ($space_sep[0] eq "Count") {
$count_metrics = $space_sep[3];
} elsif ($space_sep[0] eq "bread/s") {
$m_sar = (SUMMARY);
chomp($m_sar);
@metrics_sar = split(/\s+/.$m-sar);
} elsif ($space_sep[0] eq "tin") {
$m_iostat = (SUMMARY);
chomp($m_iostat);
@metrics_iostat = split(/\s+/.$m_iostat);
} elsif ($space_sep[0] eq "procs") {
$m_vmstat = <SUMMARY>;
$m_vmstat = <SUMMARY>;
chomp($m-vmstat);
@metrics_vmstat = split(/\S+/,$m_vmstat);
} elsif ($space_sep[0] eq "CPU") {
$m_mpstat = (SUMMARY);
chomp($m_mpstat);
@metrics_mpstat0 = split(/\S+/,$m_mpstat);
shift(@metrics_mpstat0);
$m_mpstat = <SUMMARY>;
chomp($m_mpstat);
@metrics_mpstat1 = split(/\s+/,$m_mpstat);
shift(@metrics_mpstat1);
$m_mpstat = <SUMMARY>;
chomp($m_mpstat);
@metrics-mpstat2 = split(/\s+/.$m_mpstat);
shift(©metrics_mpstat2);
$m_mpstat = <SUMMARY>;
chomp($m_mpstat);
@metrics_mpstat3 = split(/\S+/,$m-mp8tat);
shift(@metrics_mpstat3);
Ometrics_mpstat = (Qmetrics_mpstat0,0metrics_mpstat1,
@metrics_mpstat2,0metrics_mpstat3);
}

chdir("..");

# Get the description of experiment from file ord_exp line k

$1ine_ord=$0rder_of_exp[$k-1];

Gord_elem = split(/\s+/.$1ine_ord);

shift(Qord_elem);

0tot-metrics = (Oord_elem,$Unk,$prism_time,$count_metrics,
Ometrics_sar,Qmetrics_iostat,0metrics-vmstat,Ometrics_mpstat);

235

$first = "yes";
foreach (@tot_metrics) {
if ($first eq "yes") {
print OUTFILE;
$first = "no";
} else {
print OUTFILE "\t";
print OUTFILE;
}

}
print OUTFILE "\n";

J .4 Script D: Convert data to SAS format

Script to generate the input ﬁle compatible with SAS from the ﬁle to be used by Minitab

13.

#!/usr/bin/perl -w

#

# Creates file with results for Minitab
# Syntax:

# create_2filesSAS.p1

# where:

#

# By Nayda G. Santiago

# Created: 07/26/2001

# Modified: 03/25/2002

# Input/Output filenames

$infile1 = ’exp5MinitabNoZeros.txt’;
$outfile1 = ’outcomelSASNoZeros.txt’;
$outfi1e2 = ’outcome2SASNoZeros.txt’;

# Open files to process initial data

open(INFILE, "<$infi1e1") II die "Could not find $infi1e1\n";
open(OUTFILEi, ">$outfile1") II die "Could not find $outfile1\n";
open(OUTFILE2, ">$outfile2") || die "Could not find $outfile2\n";

# Get first line with the names of metrics from input file
$metric_names = <INFILE>;

chomp($metric_names);

OIndnames = split(/\s+/,$metric_names);

236

# Get the number of columns in the array, i. e. number of metrics
$numberOfMetrics = $#Indnames+1;
if (($numberOfMetricsZ2) == 0){ # Even number of metrics
$Limit = ($numberOfMetrics/2)+2; # Add 3 since the first 6 columns
# are common. Otherwise the 2nd file will be longer by 6

}

else { # Odd number of metrics
$Limit = (($numberOfMetrics-1)/2)+2;

}

@metricsNamesl=@Indnames[0...$Limit];
@metricsNames2=(@Indnames[0...4], @IndnamesESLimit+1...$#Indnames]);

$first = "yes"; # This variable is used to prevent the first element
# from being a blank space.
foreach (QmetricsNamesi) {
if ($first eq "yes") {
print OUTFILE1;
$first = "no";
} else {
print OUTFILE1 " "; # Uses blank space as separator
print OUTFILE1; # Prints each element of $Metric_Name1 1

 

}
print OUTFILE1 "\n";

$first = "yes"; # This variable is used to prevent the first element
# from being a blank space.
foreach (@metricsNames2) {
if ($first eq "yes") {
print OUTFILE2;
$first = "no";
} else {
print OUTFILE2 " "; # Uses blank space as separator
print OUTFILE2; # Prints each element of $Metric_Name2
}
}
print OUTFILE2 "\n";

# Get data and change tabs to blank spaces
while(<INFILE>){ # Do if eof not reached
if($- =" /‘\s*$/) {next;}; # Remove blank lines
chomp; # Remove newline character
0data_line = split; # Split line at blank space
# Care with more than one trailing blank spaces!!

 

Qdata1 = Odata_line[0...$Limit];
Odata2 = (Odata_line[0...5],0data_line[$Limit+1...$#data_line]);
$first = "yes"; # This variable is used to prevent the first

237

# element from being a blank space.
foreach (@datal) {
if ($first eq "yes") {
print OUTFILE1;

$first = "no";

} else {
print OUTFILE1 ” "; # Uses blank space as separator
print OUTFILE1; # Prints each element of $data1

}
print OUTFILE1 "\n";

$first = "yes"; # This variable is used to prevent the first
# element from being a blank space.
foreach (@data2) {
if ($first eq "yes") {
print OUTFILE2;

$first = "no";

} else {
print OUTFILE2 " "; # Uses blank Space as separator
print OUTFILE2; # Prints each element of $data2

}
print OUTFILE2 "\n";

} #End of while
close(INFILE);

238

BIBLIOGRAPHY

239

 

BIBLIOGRAPHY

[1] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New Computing

Infrastructure. Morgan Kaufmann Publishers, Inc., 1999.

[2] Huan Liu and Hirosi Motoda. Feature Selection for Knowledge Discovery and Data
Mining. Kluwer Acedemic Publishers, 1998.

[3] Marie Coffin and Matthew J. Saltzman. Statistical analysis Of computational tests Of
algorithms and heuristics. INFORMS Journal on Computing, 12(1):24 - 44, Winter
2000.

[4] Bertrand Meyer. Object-Oriented Software Construction. Prentice Hall, 1997.
[5] Jakob Nielsen. Usability Engineering. Morgan Kaufmann, 1993.

[6] Mark Moriconi, Xiaolei Qian, and R. A. Riemenschneider. Correct architecture re-
ﬁnement. IEEE Transactions on Software Engineering, 21(4):356 - 372, April 1995.

[7] Graham D. Riley and John R. Gurd. Requirement for automatic performance analysis
APART. Technical Report FZJ-ZAM-IB-9919, Central Institute for Applied Mathe-

matics, Research Centre Jiilich, November 1999.

[8] John Mellor-Crummey, Robert J. Fowler, Gabriel Marin, and Nathan Tallent.
HPCView: A tool for top-down analysis of node performance. The Journal of Super-
computing, 23(1):8l — 104, August 2002.

[9] Sam Kash Kachigan. Statistical Analysis: An Interdisciplinary Introduction to Uni-
variate £5 Multivariate Methods. Radius Press, Inc., 1986.

[10] R. Bruce Irvin. Performance Measurement Tools for High-Level Parallel Programming
Languages. PhD thesis, University of Wisconsin - Madison, 1995.

[11] M. Alabdulkareem, S. Lakshmivarahan, and SK. Dhall. Scalability analysis of large
codes using factorial designs. Parallel Computing, 27(9):1145 - 1171, August 2001.

[12] Xian-He Sun, Dongmei He, Kirk W. Cameron, and Yong Luo. Adaptive multivari—
ate regression for advanced memory system evaluation: Application and experience.
Performance and Evaluation: An International Journal, 45(1):1 — 18, 2001.

240

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

Oleg Y. N ickolayev, Philip C. Roth, and Daniel A. Reed. Real-time statistical cluster-
ing for event trace reduction. International Journal of Supercomputing Applications
and High Performance Computing, 11(2):144 - 159, Summer 1997.

Jeffrey S. Vetter and Daniel A. Reed. Managing performance analysis with dynamic
statistical projection pursuit. In Proceedings of Supercomputing ’99, November 1999.

Dong H. Ahn and Jeffrey S. Vetter. Scalable analysis techniques for microprocessor

performance counter metrics. In Proceedings of Supercomputing ’02, November 2002.

Thomas Fahringer, Michael Gerndt, Graham Riley, and Jesper Larsson Taﬂ'. Formal-
izing OpenMP performance properties with ASL. In Proceedings of the International
Workshop on OpenMP: Experiences and Implementations (WOMPEI), Lecture Notes
in Computer Science, pages 428 — 439, Tokyo, Japan, 2000. Springer.

R. Bruce Irvin and Barton P. Miller. Mechanisms for mapping high-level parallel
performance data. In Proceedings of the 19.96 International Conference on Parallel
Processing Workshop on Challenges for Parallel Processing, pages 10 — 19, August
1996.

R. Bruce Irvin and Barton P. Miller. Mapping performance data for high-level and
data views of parallel program performance. In Proceedings of the 10th ACM Inter-
national Conference on Supercomputing, ICSQO', pages 69 - 77, May 1996.

Xian-He Sun and Kirk W. Cameron. A statistical-empirical hybrid approach to hierar-
chical memory analysis. In Proceedings of the 6th International Euro-Par Conference,
pages 141 — 148, August 2000.

Xian-He Sun, Dongmei He, Kirk W. Cameron, and Yong Luo. A factorial performance
evaluation for hierarchical memory systems. In Proceedings of the 13th International
Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Pro-
cessing, April 1999.

Thomas Fahringer, Michael Gerndt, Bernd Mohr, Felix Wolf, Graham Riley, and
Jesper Larsson Taff. Knowledge speciﬁcation for automatic performance analysis
APART. Technical Report FZJ-ZAM-IB-2001-08, Central Institute for Applied Math-
ematics, Research Centre Jiilich, August 2001.

Thomas Fahringer, Michael Gerndt, Graham Riley, and Jesper Larsson Tiff. Speci-
ﬁcation of performance problems in MP1 programs with ASL. In Proceedings of the
2000 International Conference on Parallel Processing, ICIP ’00, Montreal, Canada,
August 2000.

241

[23] Thomas Fahringer, Michael Gerndt, Graham Riley, and Jesper Larsson Taff. On
performance modeling for HPF applications with ASL. In Proceedings of the Inter-
national Symposium on High Performance Computing, Lecture Notes in Computer

Science, Tokyo, Japan, 2000. Springer.

[24] Graham E. Searle, Julian W. Gardner, Michael J. Chappell, Keith R. Godfrey, and
Michael J. Chapman. System identiﬁcation of electronic nose data from cyanobacteria
experiments. IEEE Sensors Journal, 2(3):218 — 229, June 2002.

[25] Douglas C. Montgomery. Design and Analysis of Experiments. John Wiley & Sons,
Inc., 1997.

[26] Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for e2:-
perimental design, measurement, simulation, and modeling. John Wiley & Sons, Inc.,
1991.

[27] David J. Lilja. Measuring Computer Performance: A practitioner’s guide. Cambridge
University Press, 2000.

[28] Holger Hermanns, Ulrich Herzog, and Joost-Pieter Katoen. Process algebra for per-
formance evaluation. Theoretical Computer Science, 274(1 - 2):43 — 87, March 2002.

[29] Jennifer G. Dy and Carla E. Brodley. Feature subset selection and order identiﬁcation
for unsupervised learning. In Proceedings of the 17th Intenational Conference on

Machine Learning, June-July 2000.

[30] Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small
sample performance. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 19(2):153 — 158, February 1997.

[31] Karl W. Pettis, Thomas A. Bailey, Anil K. Jain, and Richard C. Dubes. An intrin-
sic dimensionality estimator from near-neighbor information. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-l(1):25 - 37, January 1979.

[32] Neal Wyse, Richard Dubes, and Anil K. Jain. Pattern Recognition in Practice, chapter
A Critical Evaluation of Intrinsic Dimensionality Algorithms, pages 415 - 425. North-
Holland Publishing Company, 1980.

[33] Moshe F. Rubinstein, editor. Patterns of Problem Solving. Prentice-Hall, Inc., 1975.

[34] Harry I. Forsha, editor. Show Me: the Complete Guide to Storyboarding and Problem
Solving. ASQC Quality Press, 1995.

[35] C. T. H. Everaars, F. Arbab, and F. J. Burger. Restructuring sequential fortran code
into a parallel/ distributed application. In Proceedings of the International Conference
on Software Maintenance 1996, pages 13 - 22, November 1996.

242

[36] John L. Volakis and Leo C. Kempel. Electromagnetics: Computational methods and
considerations. IEEE Computational Science and Engineering, 2(1):42 — 57, Spring
1995.

[37] R. N ilavalan, I. J. Craddock, D. L. Paul, and C. J. Railton. Conformal antenna array
modeling using a locally nonorthogonal FDTD. Microwave and Optical Technology
Letters, 30(4):238 — 240, August 2001.

[38] Leo C. Kempel and John L. Volakis. A ﬁnite element-boundary integral method
for cavities in a circular cylinder. In Proceedings of the 1993 IEEE Antennas and
Propagation Society Symposium, pages 292 — 295, June 1993.

[39] Jr. Richard C. Booton. Computational Methods for Electromagnetics and Microwaves.
John Wiley & Sons, Inc., 1992.

[40] Kosmo D. Tatalias and James M. Bornholdt. Mapping electromagnetic ﬁeld compu-
tations to parallel processors. IEEE Transactions on Magnetics, 25(4):2901 — 2906,
July 1989.

[41] Ali R. Baghai-Wadji. An introduction to the fast—MOM in computational electro-
magnetics. In IEEE 6th Topical Meeting on Electrical Performance in Electronic
Packaging, page 231, October 1997.

[42] John L. Volakis, Arindam Chatterjee, and Leo C. Kempel. Finite Element Method for
Electromagnetics: Antennas, Microwave Circuits, and Scattering Applications. IEEE
Press, 1998.

[43] William L. Briggs, Van Emden Henson, and Steve F. McCormick. A Multigrid Tuto-

rial. Siam, second edition, 2000.

[44] John L. Volakis, Tayfun Ozdemir, and Jian Gong. Hybrid ﬁnite-element methodolo-
gies for antennas and scattering. IEEE Transactions on Antennas and Propagation,
45(3):493 — 507, March 1997.

[45] Young W. Kwon and Hyochoong Bang. The Finite Element Method Using MATLAB.
CRC Press, second edition, 2000.

[46] Leo C. Kempel. Implementation of various hybrid ﬁnite element-boundary integral
methods: Bricks, prisms, and tets. In Proceedings of the 1999 ACES Meeting, pages
242 - 249, 1999.

[47] Gary Goldman and Partha Tirumalai. UltraSPARC-IITM: The advancement of U1-
traComputing. In Proceedings of the 41“t IEEE Computer Society International Con-

ference (CompCon ’96): Technologies for the Information Superhighway, pages 417 —
423, Santa Clara, CA, 1996.

243

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[56]

[57]

[58]

[59]

[60]

UtraSPARCTM-II data sheet, July 1997. Sun Microsystems.

The UltraTM 450 workstation architecture. Technical White Paper, 1998. Sun Mi-

crosystems.

Adrian Cockcroft and Richard Pettit. Sun Performance and Tuning: Java and the

Internet. Sun Microsystems Press, second edition, 1998.

Jim Mauro and Richard McDougall. Solaris Internals: Core Kernel Components. Sun

Microsystems Press, 2001.

Linda 'Ikocine and Linda C. Malone. Finding important independent variables through
screening designs: A comparison of methods. In Proceedings of the 2000 Winter

Simulation Conference, volume 1, pages 749 — 754, December 2000.

Linda Trocine and Linda C. Malone. An overview of newer, advanced screening
methods for the initial phase in an experimental design. In Proceedings of the 2001
Winter Simulation Conference, volume 1, pages 169 ~— 178, December 2001.

Bernd Mohr. Design of automatic performance analysis systems: APART tech-
nical report. Technical Report Draft Report, http://www.fz-juelich.de/apart-
1/wp3/index.html, Central Institute for Applied Mathematics, Research Centre
Jiilich, May 2000.

Gene H. Golub and Charles F. Van Loan. Matrix Computations. The John Hopkins
University Press, 1989.

Abdul Waheed, Diane T. Rover, and Jeffrey K. Hollingsworth. Modeling and evalu-
ating design alternatives of an on-line instrumentation system: A case study. IEEE
Transactions on Software Engineering, 24(6):451 - 470, June 1998.

Jack Dongarra, Allen Malony, Shirley Moore, Philip Mucci, and Sameer Shende.
Performance instrumentation and measurement for terascale systems. In Proceedings
of the International Conference on Computational Science, ICCS 2003, Terascale

Performance Analysis Workshop, June 2003.

William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A message passing
standard for mpp and workstations. Communications of the ACM, 39(7):84 — 90, July
1996.

Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized
program analysis tools. SIGPLAN Notices, 29(6):196 - 205, June 1994.

Bryan Buck and Jeffrey K. Hollingsworth. An API for runtime code patching. The
International Journal of High Performance Computing Applications, 14(4):317 — 329,
Winter 2000.

244

[61]
[52]

[63]

[54]

Inc. Kuck & Associates. Guide Reference Manual, Version 3.9, March 2000.
Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, Inc., 1980.

Rolf Sundberg. Encyclopedia of Environmetrics, chapter Collinearity, pages 365 —- 366.
John Wiley & Sons, Inc., 2002.

Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining
to knowledge discovery in databases. AI Magazine, 17(3):37 — 54, Fall 1996.

[65] Ricardo Gutierrez-Osuna and H. Troy Nagle. A method for evaluating data-

[66]

[67]

[68]

[59]

[70]

[71]

[72]

[73]

[74]

[75]

preprocessing techniques for Odor classiﬁcation with an array of gas sensors. IEEE
Transactions on Systems, Man, and Cybernetics—~Part B: Cybernetics, 29(5):626 —
632, October 1999.

Anup Mathur. A Stochastic Process Model for Transient Trace Data. PhD thesis,
Virginia Polytechnic Institute and State University, 1996.

Jr. John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete- Time Processing
of Speech Signal. Macmillan Publishing Company, ﬁrst edition, 1993.

William R. Dillon and Matthew Goldstein. Multivariate Analysis: Methods and Ap-
plications. John Wiley, 1984.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman, editors. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.

Ruby L. Kennedy, Yuchun Lee, Benjamin Van Roy, Christopher D. Reed, and
Richard P. Lippmann, editors. Solving Data Mining Problems Through Pattern Recog-
nition. Prentice Hall, PRT, 1997.

Ricardo Gutierrez-Osuna. Pattern analysis for machine Olfaction: A review. IEEE
Sensors Journal, 2(3):189 — 202, June 2002.

R. Gutierrez-Osuna, T. Nagle, B. Kermani, and S. Schiffman. Handbook of Ma-
chine OIfaction: Electronic Nose Technology, chapter 7: Signal Conditioning and
Pre—processing. Wiley - VCH, 2002.

Pierre A. Devijver and Josef Kittler, editors. Pattern Recognition: A Statistical Ap-
proach. Prentice Hall International, 1982.

Huan Liu, Hongjun Lu, and Lei Yu. Active sampling: An effective approach to feature
selection. In SIAM International Conference on Data Mining, May 2003.

Avrim L. BLum and Pat Langley. Selection Of relevant features and examples in
machine learning. Artiﬁcial Intelligence, 97(1 - 2):245 - 271, December 1997.

245

[76] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artiﬁcial
Intelligence, 97(1 - 2):273 - 324, December 1997.

[77] Jennifer C. Dy and Carla E. Brodley. Visualization and interactive feature selection for
unsupervised data. In Proceedings of the 6th ACM SIGKDD Intenational Conference
on Knowledge Discovery and Data Mining, pages 360 - 364, August 2000.

[78] M. Dash and H. Liu. Feature selection for classiﬁcation. Intelligent Data Analysis,
1(3):131 — 156, 1997.

[79] Douglas Zongker and Anil Jain. Algorithms for feature selection: An evaluation.

In Proceedings of the 13th International Conference on Pattern Recognition, pages
18 — 22, 1996.

[80] Manoranjan Dash, Kiseok Choi, Peter Scheuermann, and Huan Liu. Feature selec-
tion for clustering - a ﬁlter solution. In Proceedings of the 2002 IEEE International
Conference on Data Mining, December 2002.

[81] M. Dash, H. Liu, and J. Yao. Dimensionality reduction of unsupervised data. In
Proceedings of the 9th Intenational Conference on Tools with Artiﬁcial Intelligence,
pages 532 — 539, November 1997.

[82] Claude E. Shannon. A mathematical theory Of communication. The Bell System
Technical Journal, 27:379 — 656 and 623 —656, July, October 1948.

[83] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley
& Sons, Inc., 1991.

[84] Robert S. Bennett. The intrinsic dimensionality of signal collections. IEEE Transac-
tions on Information Theory, IT-15:517 — 525, September 1969.

[85] Keinosuke Fukunaga. Handbook of Statistics, volume 2, chapter Intrinsic Dimension-
ality Extraction, pages 347 — 360. North-Holland Publishing Company, 1982.

[86] Michael Kirby. Geometric Data Analysis: An Empirical Approach to Dimensionality
Reduction and the Study of Patterns. John Wiley & Sons, Inc., 2001.

[87] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, Inc., 1986.

[88] A. Ralph Hakstian, W. Todd Rogers, and Raymond B. Cattell. The behavior
of number-Of-factors rules with simulated data. Multivariate Behavioral Research,
17:193 - 219, April 1982.

[89] Jennifer G. Dy, Carla E. Brodley, Avi Kak, Lynn S. Broderick, and Alex M. Aisen.
Unsupervised feature selection applied to content-based retrieval of lung images. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(3):373 — 378, March
2003.

246

[90] Miguel Vélez-Reyes and Luis O. J iménez. Subset selection analysis for the reduc-
tion of hyperspectral imagery. In Proceedings of the Geoscience and Remote Sensing
Symposium, IGARRS ’98, pages 1577 — 1581 Vol. 3, 1998.

[91] Charng da Lu and Daniel A. Reed. Compact application signatures for parallel and
distributed scientiﬁc codes. In Proceedings of Supercomputing ’02, November 2002.

[92] Antonio Espinosa, Tomas Margalef, and Emilio Luque. Automatic performance eval-

uation of parallel programs. In Proceedings of the Sixth Euromicro Workshop on
Parallel and Distributed Processing, PDP ’98, pages 43 — 49, January 1998.

[93] M. Gerndt, B. Mohr, M. Pantano, and F. Wolf. Performance analysis on CRAY
T3E. In Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed
Processing, PDP ’99, pages 241 — 248, February 1999.

[94] Domingo Rodriguez, Nayda G. Santiago, and Carlos Vélez. Implementation of a new
class of FFT algorithms on transputer computational structures. In Proceedings of
the 36th Midwest Symposium on Circuits and Systems, volume 2, pages 1105 —— 1108,
1993.

[95] Nayda G. Santiago, Diane T. Rover, and Domingo Rodriguez. A statistical approach
for the analysis Of the relation between low-level performance information, the code,

and the environment. In Proceedings of the International Conference on Parallel
Processing Workshops, HPSECA 02, pages 282 -— 289, August 2002.

[96] Nayda G. Santiago, Diane T. Rover, and Domingo Rodriguez. A statistical approach
for the analysis Of the relation between low-level performance information, the code,
and the environment. To appear, Journal Parallel and Distributed Computing Prac-

tice.

[97] Nayda G. Santiago, Diane T. Rover, and Domingo Rodriguez. Subset selecion of
performance metrics describing system-software interactions. Supercomputing 2002,
SC’02. Poster.

[98] Steven G. Krantz. Real Analysis and Foundations. CRC Press, Inc., 1991.

[99] Henry Stark and John W. Woods. Probability, Random Processes, and Estimation
Theory for Engineers. Prentice-Hall, Inc., second edition, 1994.

[100] Richard Barret, Michael Berry, Tony Chan, James Demmel, June Donato, Jack Don-
garra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst.

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods.
SIAM, 1993.

\\O\] Alan Jennings and J .J. McKeown. Matrix Computation. John Wiley & Sons, Inc.,
second edition, 1992.

247

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109]

[110]

[111]

[112]

[113]

[124]

Michael W. Pkazier. An Introduction to Wavelets Through Linear Algebra. Springer,
1999.

Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University
Press, 1990.

Gene Golub and James M. Ortega. Scientiﬁc Computing: An Introduction with Par-
allel Computing. Academic Press, Inc., 1993.

Paul Messina, David Culler, Wayne Pfeiffer, William Martin, J. Tinsley Oden, and
Gary Smith. Architecture. Communications of the ACM, 41(11):36 — 44, November
1998.

Chris Herring. Microprocessors, microcontrollers, and systems in the new millennium.
IEEE Micro, 20(6):45 — 51, December 2000.

Michael Slater. The microprocessor today. In Mark D. Hill, Norman P. Jouppi,
and Gurindar S. Sohi, editors, Readings in Computer Architecture, pages 668 — 680.
Morgan Kaufmann Publishers, 2000.

Bruce Greer, John Harrison, Grep Henry, Wei Li, and Peter Tang. Scientiﬁc com-
puting on the itaniumTM processor. In Procedings of Supercomputing ’01, November
2001.

INTEL. Intel Itanium 2 Processor Reference Manual For Software Development and
Optimization, April 2003. Order Number 251110-002.

Ian Foster and Carl Kesselman. The Grid: Blueprint for a New Computing Infras-

tructure, chapter Computational Grids. Morgan Kaufmann Publishers, Inc., 1999.

Matthew Shields, Omer F. Rana, David W. Walker, and David Golby. A collaborative
code development environment for computational electro—magnetics. In Proceedings
of the 8th Working Conference on Software Architectures for Scientiﬁc Computing,
pages 119 — 141, October 2001.

Jack Dongarra and David W. Walker. The quest for petascale computing. Computing
in Science and Engineering, 3(3):32 — 39, May/ June 2001.

Abdul Waheed and Diane T. Rover. Modeling and Simulation of Advanced Computer
Systems, chapter Instrumentation Systems for Parallel Tools, pages 35 — 54. Gordon
and Breach Publishers, Inc., 1996.

Jeffrey K. Hollingsworth and Bart Miller. The Grid: Blueprint for a New Comput-
ing Infrastructure, chapter Instrumentation and Measurement. Morgan Kaufmann
Publishers, Inc., 1999.

248

[115]

[116]

[117]

[118]

[119]

[120]

[121]

[122]

[123]

[124]

[125]

Daniel A. Reed. Models and Techniques for Performance Evaluation of Computer and
Communication Systems, chapter Performance Instrumentation Techniques for Par-
allel Systems, pages 463 — 490. Springer-Verlag Lecture Notes in Computer Science,
1993.

Jonathan Geisler and Valerie Taylor. Performance Evaluation and Benchmarking
with Realistic Applications. Rudolf Eigenmann, editor, chapter Performance Coupling:
Case Studies for Measuring the Interactions Of Kernels in Modern Applications. MIT
Press, 2001.

Jeffrey Brown, Al Geist, Cherri Pancake, and Diane Rover. Software tools for de-
veloping parallel applications. part 1: Code development and debugging. In SIAM
Conference on Parallel Processing for Scientiﬁc Computing Proceedings, March 1997.

Jeffrey Brown, Al Geist, Cherri Pancake, and Diane Rover. Software tools for devel—
oping parallel applications. part 2: Interactive control and performance tuning. In
SIAM Conference on Parallel Processing for Scientiﬁc Computing Proceedings, March
1997.

Ian Foster. Designing and Building Parallel Programs. Addison-Wesley Publishing
Company, Inc., 1995.

Michael T. Heath, Allen D. Malony, and Diane T. Rover. The visual display of parallel
performance data. Computer, 28(11):21 - 28, November 1995.

Michael T. Heath and Jennifer A. Etheridge. Visualizing the performance of parallel
programs. IEEE Software, pages 29 — 39, September 1991.

Jerry C. Yan and Sekhar R. Sarukkai. Analyzing parallel program performance us-
ing normalized performance indices and trace transformation techniques. Parallel
Computing, 22(9):1215 — 1237, November 1996.

Daniel A. Reed, Ruth A. Aydt, Tara M. Madhyastha, Roger J. Noe, Keith A. Shields,
and Bradley W. Schwartz. An Overview of the Pablo Performance Analysis Environ-
ment. Department of Computer Science, University of Illinois - Urbana, November
1992. Pablo Documentation.

Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeﬂrey K. Hollingsworth,
R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. The
paradyn parallel performance measurement tool. IEEE Computer, 28(11):37 - 46,

November 1995.

Hong-Linh Truon and Thomas Fahringer. SCALEA: A performance analysis tool for
distributed and parallel programs. In Proceedings of the 8th International Euro-Par
Conference, EUROPAR 2002, August 2002.

249

1 .

 

[126] Jay L. Devore. Probability and Statistics for Engineering and the Sciences. Duxbury
Thomson Learning, 2000.

[127] Ronald P. Cody and Jeffrey K. Smith. Applied Statistics and the SAS Programming
Language. Prentice Hall, fourth edition, 1997.

[128] Robert G. D. Steel, James H. Torrie, and David A. Dickey. Principles and Procedures
of Statistics: A Biometrical Approach. The McGraw-Hill Companies, Inc., third
edition, 1997.

[129] Paul R. Cohen. Empirical Methods for Artiﬁcial Intelligence. The MIT Press, 1995.

[130] Rudolf J. Freund and William J. Wilson. Statistical Methods. Academic Press, Inc.,
1993.

[131] Stephen A. Ward and Jr. Robert H. Halstead. Computation Structures. The MIT
Press and the McGraw-Hill Book Company, 1990.

250