“:3. ”3:11.
5:“
‘3;
.

“T" :7,“

 
  
  

   

"H'.

3 !'
’v', AWE‘

 
 

I"

'17.": — - i- _
gk)’°.l"'1 '3 v“, .1
‘ '2

""J 3?”

   
  

, ‘
7.1.

 
   

         
 

: £234“
2-; , .
5355;“? ,.
"3‘ ; - ',
, '

  

' mv- ~14 ._

L01. 1
-,

- m... ..__..5~

   
 

     
   
 
  

  
   

 

s {v
WE 5*

if {1lmﬂ‘MLqE‘th‘ﬁ I:

' L71 lo' '9 [it]? "E5

   
 

tr 1‘“

    

' M452“?

 

 

{if 5LT}? “w “1W3“ gt, ‘- t ‘ 1’, g
E U11 1:1"). M 1H ,5; M1 v‘ “5 ; “1+ng -.
~. U:.§::'I£nl....alml:1;h :‘Hﬁﬁﬁs‘mmgi rhilﬂEM :MPJ 3%}

mans

NIVERSITY Ll 8R RABIES

IUIlHHIHHIIIWWW|| ll NH HUI

31293 01712 916

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

NEURAL NETWORK APPLICATIONS IN NEW PRODUCT
DEVELOPMENT RESEARCH

presented by
Robert Jeffrey Thieme
has been accepted towards fulﬁllment

of the requirements for

Ph. D. degree in Marketing

 

 

jo$rofea>r V

April 14, 1998
I)ate

 

MSU i: an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

LIBRARY

: Michigan State

University

 

 

PLACE IN RETURN Box
to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

I MTE DUE

DATE DUE

DATE DUE

 

ACT ‘”0 s 1999

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1!” WM“

 

NEURAL NETWORK APPLICATIONS IN NEW PRODUCT DEVELOPMENT RESEARCH

BY

Robert Jeffrey Thieme

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Marketing and Supply Chain Management Department

1998

em—w—m "I

ABSTRACT

NEURAL NETWORK APPLICATIONS IN NEW PRODUCT DEVELOPMENT RESEARCH

BY

Robert Jeffrey Thieme

Two types of artificial neural network decision support systems
are created, trained, and evaluated to serve as guides for managers Mi
making complex new product development (NPD) go/no-go and resource
allocation decisions in the NPD process. The first is based on a b,
multilayer feedforward network (MFN) structure and is referred to as the
multilayer feedforward network decision support system (MFN-DSS). In
this system, ten independently trained networks are combined to predict
overall success or failure; the degree of financial success; and the
degree of technical success for new product projects. Each network

structure has forty-three variable inputs, five hidden nodes, and one

output. Network output corresponds to one of three aspects of NPD
success. A piecewise cross-validation procedure is used to determine
the appropriate network structure. The second neural network

architecture is a probability neural network (PNN) and provides output
probabilities of success and failure that approach Bayesian optimal.
Data to train and test the networks are based on historical data
from 612 NPD projects. After training with data from 400 projects, the
MFN—DSS correctly classifies 96.7% of 212 additional projects and the
PNN correctly classifies 92.9% of 212 additional projects. These
results are compared with k-nearest neighbor, logistic regression,
ordinary least squares (OLS) regression, and discriminant analysis
methods and found to be superior. Finally, usage of the decision
support systems as an aid to managers in new product project selection

is discussed and demonstrated using an actual case study.

Copyright by
ROBERT JEFFREY THIEME

1998

W“ Kin

ACKNOWLEDGMENTS

I am very appreciative of the support of many people who had
confidence in me and generously gave their time and effort. I begin by
thanking my parents and wife for their support throughout the whole
process. They supported me when I quit a good paying job with
opportunities for advancement in favor of a four to five year journey
through a doctoral program. I'm not sure any of us fully understood
what I was doing, but their confidence in my decisions gave me
confidence that I could succeed. Of particular note was a period in the
beginning of my second year when I changed majors from Operations
Management to Marketing. This was probably the most difficult decision
to make during the program because it was based more on intuition and
“gut feel” than on data and quantitative analysis. Even in this tough
time when I was unsure about my decision, the support and confidence
from my family helped me focus on making it work.

I would like to thank Dr. Everett Adam, Dr. Ronald Ebert, and Dr.
Lori Franz at the University of Missouri—Columbia for providing the
spark that lead to my desire to become a professor. I am indebted to
them for inspiring me to pursue this wonderful, challenging, and
fulfilling profession.

I would also like to thank my dissertation co-chairs, Dr. X.
Michael Song and Dr. Roger Calantone. In my opinion, choosing a
committee and a chair for a dissertation is the most important step in

the process. I was advised by various people that choosing co—chairs

iv

would probably infuse conflict and could be a mistake. I spent a lot of
time on this decision and have no regrets. Dr. Song and Dr. Calantone
played different, yet important roles in my program. They both provided
guidance, direction, support, and encouragement during each stage of my
dissertation and in my job search.

I would also like to thank the professors who took their doctoral
seminars seriously even when some of my fellow doctoral students chose
not to. In retrospect, I could have spent less time working on assigned
readings and course projects in many of my doctoral seminars and still
made it through the program; but I have no regrets in this regard
either. Because of the hard work and dedication of Dr. Stanley
Hollander, Dr. Cornelia Droge, Dr. John Hollenbeck, Dr. Ram Narasimhan,
Dr. Glenn Omura, Dr. Tom Page, Dr. Lloyd Rinehart, and Dr. Calantone, I
feel that I have a well-rounded doctoral education that will pay
dividends in the future. It is a shame that some of my fellow students
chose to do the minimum expected in these classes, because they have
missed out on a great opportunity.

Dr. Drbge deserves a special thanks for her willingness to discuss
many of the practical issues involved in being a professor. She spends
a lot of time working with all incoming doctoral students trying to
prepare them for what is ahead. She has always been willing to share
personal experiences to make a point, even when the experiences are not
necessarily flattering. I learned many things from her that paid off
enormously as I progressed through my dissertation and job search.
Furthermore, I don’t expect that the payoffs will stop any time soon. I
imagine that her words of wisdom will echo in my mind throughout my

career - sometimes joyfully and sometimes hauntingly.

;o L

 

 

 

I would also like to thank many of my fellow doctoral students.
Without the guidance and advice from those who were a step or two ahead
of me, namely, Jeff Schmidt, Steve Clinton, and Matt Myers, things would
have been much more difficult. The insight from their experience was
extremely helpful. In addition, Tom Goldsby and Jim Eckert have made
the program interesting and enjoyable. While we didn’t always see eye
to eye in our seminars and the basketball wasn't fit for prime time,

their friendship was much enjoyed and I wish them well in the future.

I would also like to thank the staff in the Marketing and Supply
Chain Management Department, namely, Cheryl Lundeen, Renee Dixon, Laurie
Fitch, Cindy Seagraves, Kathy Mullins, and Marilyn Brookes. Their hard
work and dedication contributes to the success of the department and
sometimes their effort is overlooked.

Finally, I would like to thank my wife, Kelly. She has been
extremely supportive throughout the program. She worked long hours
teaching first graders to read; took graduate classes in the evenings;
made due without new clothes, vacations, bedroom furniture, etc.; used
weekends as “catch-up” time instead of fun time; and spent many Friday
and Saturday evenings relaxing at home with a video instead of going
out. I cannot express my appreciation for her patience and support. I

can't wait to enjoy the fruits of our labor.

vi

TABLE OF CONTENTS

LIST OF TABLES ........................................................ ix
LIST OF FIGURES ........................................................ x
CHAPTER 1
INTRODUCTION ........................................................... 1
Artificial Neural Networks .......................................... 4
Neurobiological Roots ............................................... 7
Focus and Contribution of This Study ................................ 9
Organization of the Manuscript ..................................... 12
CHAPTER 2
NETWORK STRUCTURES AND TRAINING PROCESSES ............................. 15
An Historical Look at Neural Network Research ...................... 15
Neural Network Notation ............................................ 19
Neural Networks Defined ............................................ 20
Processing Units ................................................ 21
State of Activation ............................................. 21
Output Function ................................................. 22
Pattern of Connectivity ......................................... 22
Propagation Rule ............................ ~ .................... 23
Activation Function ............................................. 25
Learning Algorithm .............................................. 26
Learning Paradigm ............................................... 26
A Learning Process Hierarchy for Neural Networks ................... 27
Supervised Learning Paradigm .................................... 28
Unsupervised Learning Paradigm .................................. 29
Hebbian Learning Algorithm ...................................... 3O
Competitive Learning Algorithms ................................. 31
Error-Correction Learning Algorithms ............................ 32
Discussion ......................................................... 34
CHAPTER 3
STATISTICS, GENERALIZABILITY, AND A RESEARCH PROPOSAL ................. 37
Neural Networks and Statistics ..................................... 37
Advantages and Disadvantages of Neural Networks ................. 41
Generalization Strategies for MFNs ................................. 43
Proposed Applications in New Product Project Selection ............ 53
Potential Contributions of This Study .............................. 56

vii

CHAPTER 4

DATA AND ANALYSIS ..................................................... 57
Data.... ........................................................... 57
Network Inputs .................................................. 57

Firm Skills and Resources .................................... 58

Newness to Market and Newness to the Company ................. 59

Product Market Characteristics ............................... 60

Network Outputs ................................................. 60
Analysis ........................................................... 61

Methodology for Creating, Training, and Evaluating a MFN—DSS....63
Stage I: Determine the neural network structure
appropriate for good generalization ....................... 63
Stage II: Train each network in the MFN-DSS with training
data using the Levenberg-Marquardt second—order method....64
Stage III: Evaluate the performance of the MFN-DSS with

evaluation data ........................................... 65
Methodology for Creating, Training, and Evaluating a PNN ........ 68
CHAPTER 5
RESULTS AND DISCUSSION ................................................ 69
Results ............................................................ 69
Discussion ......................................................... 71
Case Study ......................................................... 74
Future Directions .................................................. 76
Limitations ........................................................ 77
Limitations of This Study ....................................... 77
Limitations of Neural Networks in General ....................... 78
APPENDICIES
APPENDIX A: Tables and Figures ................................... 81
APPENDIX B: Neural Network Demonstrations ........................ 97
APPENDIX C: Neural Network Methodology Details .................. 107
LIST OF REFERENCES ................................................... 139

viii

Table

Table

Table

Table

Table

Table

Table

Table

LIST OF TABLES

Neural Network Variants of Statistical Methods .............. 81
Measurement Items ........................................... 82
Summary of Cross-validation Procedures for MFNs ............. 84
Summary of Predictive Performance for all Models ............ 84
Success/Failure Prediction Performance ...................... 85
Success/Inconclusive/Failure Prediction Performance ......... 86
MFN-DSS Performance on Case Study Data ...................... 86
Application of Decision Rules to Case Study Data ............ 87

ix

 

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure

Figure

10:

11:

12:

13:

LIST OF FIGURES

Basic Neuron Components .................................... 89
Desired Response for the XOR Logic Function ................ 89
Processing Unit ............................................ 90
Example of a Feedforward Artificial Network ................ 91
Example of a Recurrent Artificial Neural Network ........... 91
Common Activation Functions ................................ 92
The Learning Process Hierarchy ............................. 92
Function Approximation Under—fitting and Over-fitting ...... 93
Training and Validation Errors ............................. 93
Probability Neural Network (PNN) Structure ................ 94
Effect of Changes in Sigma in PNNs ........................ 95
Kohonen Two—Dimensional SOFM .............................. 96
Cascade Correlation Structure ............................. 96

 

v.

 

v-u

'--

'N

7-.

-~
..~

CHAPTER 1

INTRODUCTION

In order to be successful in a competitive environment, firms must
create and sustain a competitive advantage (Porter 1985). The ability
to develop and launch new products successfully has been identified as a
major determinant in sustaining a firm's competitive advantage (Clark
and Fujimoto 1991; Day and Wensley 1988) and increasing the value of the
firm (Day and Fahey 1988). Firms that are successful in developing
unique new products that provide superior benefits are more likely to
increase market share and profits.

Managers charged with project selection decisions in the new
product development (NPD) process such as go/no-go choices and specific
resource allocation decisions are faced with a complicated problem. On
one hand, they have to deal with limited resources; yet on the other,
they have an overwhelming number of new product ideas. Most marketing
and R&D departments generate plenty of ideas that appear to have great
potential. Their major problems are identifying which of the ideas are
worth pursuing and determining which of the firm’s scarce resources
should be allocated to each project. Product champions often push their
product ideas and request the firm’s best and brightest R&D people for
their team. Most projects appear to be very promising at the time of
proposal, or they wouldn't have garnered support and been proposed.
Despite this long list of seemingly high potential product ideas for
management to choose from and pledged support from product champions, we
still see a relatively stagnant success rate near 59% for new products

in the marketplace (Griffin 1997; Wind and Mahajan 1997).

my

nu.
. . C.
:v

:.

.\»

.14

 

Why so many failures? Why can’t management “weed out” these
failures and concentrate the firm’s resources on the successes? Among
the reasons are the following. First, the NPD process is extremely
complex and innovation demands participation from many different
functional areas. Each functional area has its own goals, languages,
procedures, and thought worlds (Griffin and Hauser 1992). This
inevitably causes miscommunications, misunderstandings, and ultimately,
missed opportunities. Second, project screening is difficult. Any
attempt to predict the future is a risky venture, but with new products
this is especially so. The product is not well defined; it is still
simply an idea. Managers must decide on the viability of a product idea
given incomplete information regarding the nature of the technology,
market, and competition. Third, once a portfolio of product ideas is
chosen, management must decide how to allocate specific resources to
each project. Given incomplete information, managers are expected to
make tough choices concerning project selection and resource
allocations.

Over the past thirty years NPD research has uncovered many robust
relationships between various organizational, competitive, market, and
technical factors and successful development. Empirical research began
with studies in the late 1960's and throughout the 1970's (e.g. Cooper
1979; Cooper 1979; Myers and Marquis 1969; Rothwell 1974; Rothwell et
al. 1974). In the SAPPHO studies (Rothwell 1974; Rothwell et al. 1974)
and the NewProd project (Cooper 1979; Cooper 1979), successful and
unsuccessful NPD projects were compared. In these early studies, factor
analysis and linear discriminant analysis were used to determine which
factors were most useful in discriminating between successful and
unsuccessful projects.

After numerous studies throughout the 1970's and 1980's produced
similar lists of factors providing discrimination between success and

failure in the NPD process, researchers began to develop conceptual

models (Brown and Eisenhardt 1995; Griffin and Hauser 1996; Olson,
Walker and Ruekert 1995; Zirger and Maidique 1990) of the NPD process.
From these and other conceptual models, researchers began testing models
or portions of models using multivariate statistical techniques such as
regression, MANOVA, path analysis, and most recently structural
equations modeling (Calantone, Schmidt and Song 1996; Song and Parry
1997).

These conceptual models have been useful to both academicians and
practitioners. They provide theoretical frameworks with which the NPD
process can be studied and they provide guidance to practitioners in
both designing NPD structures and managing the process. These models
provide visual descriptions of an assortment of factors that have been
found to be covariates of successful projects under a variety of
conditions.

Unfortunately, most of these models are not directly applicable to
the problem of predicting whether or not an ongoing project will
ultimately be successful. In order to predict the outcome of a project,
a technique must be able to recognize patterns in the project that are
consistent with patterns in previously studied projects. Recently, a
new tool has been developed which solves this type of pattern
recognition problem. Artificial neural networks have been created and
used in a variety of applications. Their use in the study of the NPD
process may provide a new tool that can be used during the process to
predict the likelihood of success.

This study will apply the powerful tools of artificial neural
networks to the study of success and failure in NPD projects.
Artificial neural network decision support systems will be created and
trained to recognize patterns in the relationships between NPD success

and failure and a variety of antecedent factors. The result will be

“artificial NPD brains" which will simulate a manager with extensive
experience in both successful and unsuccessful NPD projects.

Given this experience, the networks will be able to provide
managerial guidance and insights. For example, they could be used by
practitioners early in the NPD process as a screen to select projects
with the highest probability of success. Training programs for
development personnel could be based on knowledge gained from the
trained network. Also, they could be used as a diagnostic tool
throughout the development process to highlight areas where additional
attention and resources are warranted.

In marketing, a few recent applications of artificial neural
networks have been seen in the journals. Artificial neural networks
outperformed traditional discriminant analysis and logistic regression
in an industrial market segmentation problem (Fish, Barnes and Aiken
1995). In another application, artificial neural networks were compared
to logistic regression models in a number of classification applications
(Kumar, Rao and Soni 1995). This study will be the first to incorporate
artificial neural networks data analysis into the study of the NPD
process.

ARTIFICIAL NEURAL NETWORKS

Artificial neural networks, also known as connectionist models or
parallel distributed processing models because of their highly connected
parallel structure, represent a relatively new paradigm in computing.
Traditional computers process information sequentially according to a
prespecified set of instructions. Artificial neural networks process

information in parallel without prespecified programming.

Artificial neural networks have a wide variety of applications.
While an exhaustive list of applications is beyond the scope of this
study, the examples which follow serve as an indication of the diversity
of possible uses of artificial neural networks.

One of the first applications was as a noise filter in telephone
lines. Echoes developed as telephone connections became increasingly
long and transmission times approached a half second. Artificial neural
networks provided an adaptive method of canceling the echo in these long
transmissions (Fausett 1994).

Image compression is a common application. An artificial neural
network can be trained as both an image encoder and decoder. The first
part of the network structure encodes the image to reduce its
dimensionality. The last part of the network decodes the compressed
image to its original (or near original) signal (Cottrell, Munro and
Zipser 1987).

A variety of optimization problems have been solved using
artificial neural networks. Hopfield and Tank (1985) pioneered work in
this area by applying artificial neural networks to the traveling
salesman problem. Other artificial neural network applications in this
area includes the job-sequencing problem and an analog to digital signal
converter (Tank and Hopfield 1986).

Sejnowski and Rosenberg (1987) developed an artificial neural
network to convert English text into verbal speech. Their network,
NETalk, has been tested in a variety of ways. In one case, the network
was trained using the 1000 most often used words from Miriam Webster’s
Pocket Dictionary and then tested using 20,012 other words from the

dictionary. The network correctly pronounced 77% of the words in the

 

test set. Subsequent variations of this network improve the success
rate to 90% (Karayiannis and Venetsanopoulos 1993).

Artificial neural networks are especially useful in pattern
recognition problems. Typically, a network is trained with
representative samples from the domain of interest. Given a new set of
data that the network hasn’t seen, the trained network identifies
patterns learned from the training samples. One popular pattern
recognition application is character recognition such as handwritten
character recognition. The United States Post Office uses artificial
neural network technology for handwritten zip code recognition (LeCun et
a1. 1989). Artificial neural networks also have been used in signature
analysis to detect forgeries (Mighell, Wildinson and Googman 1989).

Similar to pattern recognition applications, artificial neural
networks can be used as classifiers. A network is trained with
representative samples from the domain of interest and then used to
classify new data into categories learned from the training set. One
application in this area is the classification of objects from sonar
signals (Gorman and Sejnowski 1988; Gorman and Sejnowski 1988). In this
case, a network is trained to detect the difference between the sonar
signals from different submerged objects such as a metal cylinder and a
cylindrically shaped rock. Artificial neural network classification
applications also have been used in the medical field for early
diagnosis of heart attacks (Harrison, Marshall and Kennedy 1991) and
diagnosis of the causes of lower back pain (Bounds, Lloyd and Mathew,
1990; Bounds et a1. 1988).

In weather forecasting, artificial neural networks are trained to

predict certain events based on a variety of conditions that precede the

 

event. For example, artificial neural networks have been used to
predict solar flares and lightning strikes. A network for forecasting
solar flares was created that performed at least as well as a rule-based
expert system (Bradshaw, Fozzard and Ceci 1989). In another example,
researchers used wind, electric field, and wind divergence as inputs to
an artificial neural network to predict lightning strikes at the Kennedy
Space Center (Frankel et al. 1991).

Artificial neural networks are beginning to be used in business
applications. Their first business applications were in financial
analysis. Dutta and Shekhar (1988) began by comparing an artificial
neural network model to regression in prediction of bond ratings. Utans
and Moody (1995) used an artificial neural network to predict corporate
bond ratings using ten financial ratios that reflect fundamental
operating characteristics of firms. Artificial neural networks have
been used in many other financial forecasting applications including
modeling stock returns, stock markets, gold futures, and foreign
exchange markets (see Refenes 1995).

NEUROBIOLOGICAL ROOTS

The structure of artificial neural networks is biologically
inspired. That is, they were developed from attempts to model the human
brain. These models are based on the neuron (Figure 1). A simple
neuron has four parts: (1) dendrites that carry signals into the
nucleus; (2) a nucleus that processes incoming information; (3) an axon
that carries the signal away from the neuron; and (4) synapses that
connect the neuron to other neurons and transmit the signal forward

(Hertz, Krogh and Palmer 1991).

A neuron combines incoming input signals from other neurons and
feeds them into its nucleus. The nucleus evaluates and transforms the
input signal into an output signal. During the evaluation, if the input
signal does not exceed a certain threshold value, the neuron will not
fire. That is, it will not send an output signal to the next neuron.
If the threshold value is reached, the neuron fires and the output
signal travels through the axon and synapses to eventually become an
input signal to other neurons. Essentially, the nucleus of the neuron
serves as an on/off switch for its output. The human brain contains
approximately ten billion neurons connected through a network of
approximately sixty trillion synapses (Sheperd and Koch 1990).

A typical neuron interacts with between 1,000 and 10,000 other
neurons, but a single neuron could interact with as many as 200,000
neurons (Nelson and Illingworth 1991). Currently, the nature of the
working mechanisms in the brain are not known for certain. It is
believed that knowledge or intelligence in the brain is represented by
the pattern of neuron firings. This pattern of firings is called a
neuron’s activation. A neuron firing frequently has a high activation
and a neuron that is seldom firing has low activation.

The human brain can make a decision based on thousands of factors
almost immediately. Consider, for example, the process of driving
through heavy traffic on a highway. Changing road conditions,
environmental conditions (rain, snow, wind, etc.), actions of others
(drivers, bikers, pedestrians, etc.), and other factors are inputs and
processed simultaneously to make nearly instantaneous decisions
concerning a number of possible actions such as steering, signaling,
shifting, braking, and accelerating. Even the fastest traditional

computer system processing information serially using a prespecified set

of rules and instructions could not provide the reaction time necessary
for many of the decisions that are made on a routine basis by the brain.

A traditional serial processing computer also would have trouble
dealing with incomplete input data. That is, a decision would be
difficult in the presence of missing data. On the other hand, the
parallel structure of processing in the brain provides a more fault
tolerant system. The human brain is constantly making decisions based
on incomplete data.

The processing power of the brain stems from its highly connected,
parallel structure. Artificial neural networks are designed to take
advantage of what knowledge there is concerning the abilities and
processing mechanisms of the brain. In artificial neural networks
information is processed using a parallel as opposed to a serial
structure. In addition to providing a powerful platform for computing,
the parallel structure of neural networks also results in their robust
nature. This is generally referred to as fault tolerance (Knight 1990).
In biological structures, because of their parallel and highly connected
structure, neural networks tend to perform very well even when parts of
the system are destroyed or degraded. In artificial neural networks
this feature manifests in the ability to produce acceptable results even
when some inputs are missing.

FOCUS AND CONTRIBUTION OF THIS STUDY

Effective project screening continues to be a priority for both
researchers and practitioners (Cooper 1992; Marketing Science Institute
1996). A firm’s NPD performance depends greatly on management’s ability
to allocate scarce resources where they will produce the greatest
benefit. The ability to screen out likely failures early and

concentrate resources on projects identified as having high potential

10

for success is an important asset to any firm. The central issue
addressed in this study is developing tools that can predict new product
success or failure better than existing methods.

This study investigates a new methodology for predicting project
success or failure - artificial neural networks. Properly constructed
for generalizability, artificial neural networks are particularly useful
for modeling underlying patterns in data through a learning process.
They are quite useful in pattern recognition problems such as the
modeling of the complex NPD process.

The artificial neural networks used in this study gain their
performance characteristics mainly from two features: their highly
connected, parallel structure and their nonlinear hidden layer. The
parallelism that is built into the structure of artificial neural
networks allow for the consideration of numerous contingencies and
dependencies in the relationships between independent and dependent
variables. Furthermore, the nonlinearities incorporated into these
networks remove the linear constraint normally imposed on many
traditional statistical methods such as linear regression and structural
equation modeling.

Artificial network decision support systems are developed for new
product development project selection and management. These systems can
be used by new product managers during the development process to
provide indications of eventual success or failure of ongoing or
proposed new product projects.

Networks are created and trained with survey data to recognize
patterns in the relationships between NPD performance and a variety of
antecedent factors. A primary goal in constructing the systems is to
ensure the ability to correctly classify unfamiliar or new projects, not
used during training, as successful or unsuccessful by recognizing
patterns within the data that are consistent with previously analogous

projects. This property is generally referred to as network

11

generalizability.

Experienced managers recognize patterns within projects consistent
with previously successful and failed projects. Practical knowledge
such as this enables them to guide projects to ultimate success or kill
projects doomed for failure before valuable resources are wasted. The
artificial neural network decision support systems in this study use
learning algorithms to approximate this practical experience. In the
case of the multilayer feedforward network decision support system (MFN-
DSS), each of ten artificial neural networks in the system starts with
different initial weight matrices and is trained with data from 400 NPD
projects to recognize patterns consistent with success and failure.

Once trained, the result is a system of ten “NPD artificial brains."
Using a MFN-DSS is analogous to obtaining the insight, opinions, and
advice of ten artificial brains with experience in 400 successful and
unsuccessful projects, each offering different points of view based on
their different learning approaches. The other type of artificial
neural network employed here is the probability neural network (PNN).
The PNN uses the training data to construct a probability distribution
function for the output (NPD success) based on the responses indicated
by the inputs.

This study has four main purposes: (a) to investigate whether or
not decision support systems based on neural network technology can
outperform traditional methods of prediction from a statistical point of
view; (b) to compare the performance of neural network methods and
traditional methods in the prediction of a dichotomous success variable;
(c) to compare the performance of neural network methods and traditional
methods in predicting continuous success variables (financial success
and technical success); and (d) to demonstrate the usefulness of these
neural network decision support systems via case studies. Based on the
highly connected, non-linear structure of artificial neural networks and

their impressive performance in other applications it is likely they

12

will provide a superior predictive system for the NPD process.
Therefore, the following three research questions guide this study:

RQ#1: Based on statistical criteria, can artificial neural
network decision support systems outperform
traditional methods of prediction?

RQ#2: When predicting a project’s success or failure, can
artificial neural network decision support systems
outperform traditional methods?

RQ#3: When predicting a project’s degree of success or
failure, can artificial neural network decision
support systems outperform traditional methods?

This study is the first to incorporate artificial neural networks
data analysis into the study of the NPD process and the first to create,
train, and validate decision support systems based on artificial neural
networks to aid managers in complex decision making problems.
Furthermore, there are only a few studies which provide predictive
models of NPD success (e.g. Calantone, Schmidt and Song 1996; Song and
Parry 1997). These studies of NPD success make use of structural
equation models which are most useful for explanation but can also be
used for prediction. The advantages of using neural networks to create
predictive models of NPD success are: (1) no assumptions regarding the
underlying statistical properties of the data (e.g. multivariate
normality) need to be made when using neural networks, (2) neural
networks are very robust even when multicolinearity is prevalent in the
data, and (3) no prior specifications of the relationships between the
factors of the model need to be made with neural networks.

ORGANIZATION OF THE MANUSCRIPT

Chapter 2 begins with an historical look at research in artificial
neural networks. This is followed by a list of notations used
throughout the manuscript. Next, artificial neural networks are defined
and described in more detail. The chapter ends with the development and

explanation of a hierarchy of artificial neural networks. This

13

hierarchy is designed to be broad enough to provide an overview of the
most popular neural networks, but due to the scope of this study, does
not give a detailed account of every possible architecture. Thus,
Chapter 2 provides the background and tools needed to understand the
concept of artificial neural networks. More detailed discussions
regarding the mathematics behind different learning algorithms are found
in the Technical Appendices.

In Chapter 3, artificial neural network methods are compared to
traditional statistical methods. This is followed by a discussion of
issues in developing neural networks with good generalization
properties. Next, a proposal for applications of artificial neural
network techniques to new product development data to construct
prediction tools usable in the management of new products is presented.
These techniques are based on the multilayer feedforward network (MFN)
architecture and the probability neural network (PNN) architecture.
These predictive tools are "artificial brains" that learn from
experience from a large-scale, cross-national study of NPD projects and
predict success or failure based on characteristics of projects not used
during training. The chapter ends with a discussion of the potential
contributions of this study.

Chapter 4 begins with details on the data used in this study.
Next, the methodologies used to create, train, and evaluate the neural
networks used in this study are specified.

Chapter 5 presents the results of the evaluations of the trained
networks. These results are compared to traditional methods. Next, the
results are discussed from both academic and practitioner points of

View. A case study is then presented to further amplify the results and

l4

usefulness of the decision support systems created in this study. This
chapter ends with discussions of future research directions and
limitations associated with this study.

Appendix A contains all Tables and Figures in this study.

Appendix B is a selection of artificial neural network
demonstrations as applications to data analysis. These demonstrations
offer an intuitive indication of the possible uses of artificial neural
networks. They also provide a "hands on" level of knowledge concerning
the learning process in artificial neural networks.

In Appendix C, the backpropagation algorithm is derived and
discussed. The backpropagation algorithm is the basis for the most
popular class of learning algorithms. The derivation is followed by a
discussion of variations to the basic backpropagation algorithm and
other more advanced training methods. In addition to providing an in-
depth discussion of the algorithm (backpropagation) that eventually led
to the proliferation of artificial neural networks research, this
Appendix provides an overview of the most popular and useful algorithms
for solving pattern recognition problems. Finally, a brief background

on the PNN is given.

CHAPTER 2

NETWORK STRUCTURES AND TRAINING PROCESSES

This chapter serves as an artificial neural network tutorial which
provides a basic understanding of the structure and learning environment
of artificial neural networks. It begins with an historical background
of research of neural networks. A listing of the notation that will be
used throughout this study follows. Next, a definition of artificial
neural networks is presented and discussed. Finally, a learning
environment typology incorporating learning paradigms and learning
algorithms is presented.

AN HISTORICAL LOOK AT NEURAL NETWORK RESEARCH

While a complete history of artificial neural network research is
beyond the scope of this study, a brief discussion including selected
milestones and contributions of influential researchers follows.
Artificial neural network research began with the work of McCulloch and
Pitts in the 1940’s with the development of the McCulloch and Pitts
(1943) neuron. Their simple neuron structure was able to compute a
number of arithmetic and logical functions.

A few years later, in 1949, Donald Hebb wrote his influential book
The Organization of Behavior (Hebb 1949). This book contained the first
explicit statement of the psychological learning rule for synaptic
modification. The “Hebb rule” as it is now called is the basis for many

of the learning algorithms in use today was stated as follows:

15

16

When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes part in firing it, some
growth process or metabolic change takes place in one or
both cells such that A’s efficiency, as one of the cells
firing B, is increased (Hebb 1949, p. 62).

This description of synaptic change has been extended to
artificial neural networks and restated as: If two neurons on either
side of a connection are activated simultaneously, then the strength of
that connection is increased. Alternatively, if one neuron on either
side of a connection is activated while the neuron on the other side of
the neuron is inactive, the strength of that connection is decreased.

In the late 1950's and early 1960’s, Frank Rosenblatt (1958) made
an important contribution to the artificial neural network field by
introducing perceptrons and the perceptron learning rule. Perceptrons
are input layers connected by paths with adjustable weights to an output
layer of neurons. The perceptron learning rule is one of the original
learning algorithms that uses an iterative process to train a neural
network and lays the foundation for later work on learning algorithms.

Widrow and Hoff (1960) developed a learning algorithm that is
similar to the perceptron learning rule, but was the first algorithm to
incorporate the prediction error of the network as part of the weight
adjustment. The Widrow—Hoff learning rule is also known as the delta
rule or the least mean square (LMS) rule. Widrow and Hoff used the
delta rule to train networks called ADALINES (ADaptive LInear NEurons)
and MADALINES (Multiple Adaptive neurons). ADALINES and MADALINES are
networks with multiple input nodes connected by paths with adjustable
weights to single or multiple processing nodes. Among other problems,
these networks were used in rotation invariant pattern recognition and

control problems such as broom balancing and backing up a truck (Fausett

17

1994). Widrow also established the first neurocomputer hardware company
in the 1960’s (Hecht—Nielson 1990).

In 1969, Minsky and Papert proved mathematically that a perceptron
could not learn the exclusive/or (XOR) logical function. The XOR
function returns a value of true (1) if exactly one of the input values
is true (1) and false (0) otherwise. The possibilities for a network

with two inputs (xu )Q) and one output (y) follow:

H
f0

OOHHX
Or—‘OI—‘x
OHHOKC

Perceptrons could not learn this logical function because the
solution space is not linearly separable. That is, no single straight
line can separate the points for which the desired response is 1 from
the points for which the desired response is 0. See Figure 2.

Minsky and Papert's book, Perceptrons (Minsky and Papert 1969),
was a major factor in the decline of artificial neural network research
throughout the 1970’s. Before the book, there was a tremendous amount
of enthusiasm surrounding artificial neural networks. Wild claims were
being made concerning artificial neural networks performing task that
humans ordinarily perform and that artificial brains soon would be
created that would be able to "think" and "reason" in much the same way
as humans (see discussion in Hecht-Nielson 1990). When Minsky and
Papert (1969) revealed the flaws in the perceptrons, many researchers
lost interest and diverted into artificial intelligence research.

Another contributing factor in the declining interest was the lack
of computing power (Hecht-Nielson 1990). Since learning algorithms for
artificial neural networks can be relatively time consuming, involving

many iterations through the data, many interesting problems were

 

18

computationally prohibited from being solved with neural networks. This
changed later after the development of the backpropagation algorithm
(which solved the XOR problem) and faster computers (which eased the
computational restrictions) (Bishop 1995).

Although little work was done on neural networks in the 1970's,
there are two researchers that stand out. Kohonen (1972) pioneered work
in self-organizing feature maps. Feature maps have a variety of
applications such as speech recognition (Kohonen, 1988) and musical
composition (Kohonen 1989). Also of note was the work of James
Anderson. In 1977, along with others, Anderson developed a network
known as the Brain-state-in-a-box (Anderson et al. 1977). This network
was trained to make medical diagnosis and learn multiplication tables.

A resurgence of research began in the 1980's with the discovery of
the generalized delta rule or backpropagation learning algorithm.
Backpropagation (in a number of forms) is probably the most widely used
algorithm for training artificial neural networks. It was independently
discovered by Parker (Parker 1985) and by LeCun (1986). In fact, once
the backpropagation algorithm gained acceptance it was found that Werbos
first described the method in his 1974 Ph. D. dissertation (Werbos
1974). The backpropagation algorithm is a gradient descent method of
minimizing the total squared error of the network’s output using first-
order derivative information. This generalization of the earlier delta
rule allows for the training of multiple layered networks. Multiple
layered networks are much more complex than simple perceptrons and can
therefore be used to solve more complex problems.

Further research was spurred by the encouraging writings of Nobel

Prize winning John Hopfield (physics) and funding from the Defense

19

Advanced Research Projects Agency

(DARPA)

(Hecht-Nielson 1990). The

input from Hopfield and his distinguished colleagues excited many

researchers and gave credibility to the field after many years of

dormancy.
resources necessary for those gaining
In summation, artificial neural

considerably in the past fifty years,

in the past fifteen to twenty years.

The influx of funding from DARPA projects provided the

interest.
network research has progressed
with most of the growth occurring

Significant improvements in both

 

structures and learning algorithms have lead to the development of
powerful networks that have a diverse range of applications.

NEURAL NETWORK NOTATION

Unfortunately, notation used in artificial neural network research

is not always consistent. One of the notational differences resides in
the order of subscripts in weight variables; Some authors reverse the

order of subscripts. The following notation is adopted from Fausett

(1994) and will be used consistently throughout this manuscript. Other
notations, specific to different algorithms, are defined within the
text.
x1, yi Activations of units X1, Y1, respectively:
For input units Xi, x1== input signal;
For other units Yj, yj f(y_inj).
y_inj Net input to unit Yj:
y_inj = bj + inwij .
i
Z_inj Net input to hidden unit Z].
wlj Weight on connection from unit X, to unit Y].
bj Bias on unit Y].
W Weight matrix:
W = {WU}-
wq Vector of weights:
W-j = (W13: W231 W334 - - . : Wm”
a Training input vector:
8: (51! 521 S3! - . - I Sn)o
t Training (or target) output vector:
t = (t1, t2, t3, - o o I tn).
x Input vector (for the net to classify or respond to):
X: (XII X2! X3! - ' - I Xn)-

 

 

20

Aw,j Change in w”:
Awij = [wij(new) " Wij(Old)].
a Learning rate parameter.

These notations are explained more deeply in succeeding sections.

NEURAL NETWORKS DEFINED

Just as notation differs from author to author, the definition of
an artificial neural network can differ from author to author. This

study adopts the following broad definition: An artificial neural

network is a grouping of simple processing units connected by weighted,

directed paths which are adjusted through a learning process.

It is important to note where artificial neural networks gain the

“artificial” part of their name. As stated earlier, it is not known

exactly how the brain works. That is, it isn't known how synapses,

dendrites, the nucleus, axons, the firing patterns of neurons, and other

brain mechanisms really work together to create intelligence. It is

also not known what types of special activation functions exist in

neurons. But when current knowledge concerning the brain’s processing

is modeled through connections, nodes (processing units), activations,

layers, etc., a powerful computational system is discovered. Therefore,

it is important to realize that artificial neural networks are inspired
from what has been learned from biology, but don't necessarily represent
biological reality. This should not deter researchers and practitioners
from using artificial neural networks. What they gain from their
adopted structure and processing methods can provide many benefits over
conventional serial processing. Throughout the remainder of this
manuscript the general convention of dropping “artificial” from the

title will be followed and artificial neural networks will be referred

to simply as neural networks, networks, or nets.

21

There are eight aspects common to all neural networks (Rumelhart,

Hinton and McClelland 1988):

A set of processing units.

A state of activation for each unit.
An output function for each unit.

A pattern of connectivity among units.

A propagation rule for propagating patterns of activities
through the network connections.

0 An activation function that combines the signals entering a
unit with the current state of that unit to produce a new level
of activation for the unit.

0 A learning algorithm whereby connection weights are modified
through experience.

0 A learning paradigm within which the system operates.

Each of these aspects will be discussed in the following sections:
Processing units

The basic building block of a neural network is the processing
unit (Figure 3), also known as a perceptron, node, or unit. In this
manuscript these terms are used interchangeably. While some units
(inputs and outputs) do represent specific constructs or variables,
others (hidden units) do not have an assignable label or meaning. This
aspect of neural networks is different from other modeling methods such
as regression and structural equations where all units in the model
represent or are assigned to specific constructs. Currently, in neural

networks, the focus is on the inputs and outputs for explanation while
other units are simply computational.

State of.Activation
Next, the incoming signal, y_in3, (including weighted inputs from
other nodes and the bias) is evaluated by an activation function,

f(y_inﬁ. The output of the activation function determines the state of

activation of the processing unit at a specific time.

22

Output.Function

The output function of a processing unit determines the signal
that is to be passed to other units in the network. In most cases, the
output function is the identity function and the output of the unit at
any time is equal to its activation at that time. In some cases,
though, the output function is binary, bipolar, or a nonlinear function
similar to the activation functions within nodes.

Pattern of Connectivity

Neural networks are made up of highly connected parallel
processing units or nodes. These nodes are generally arranged in layers
in the network. Technically, networks must have a minimum of two layers
(input and output), but can have any number of additional layers.
Defining the number of layers in a network often becomes a source of
confusion. In this study, a network with an input layer and an output
layer is referred to as a single layer network. At first glance this
may appear counterintuitive, but the convention is to only enumerate
layers where processing occurs. Since the input layer doesn’t perform
any processing, it is not counted in the number of layers in the
network.

Hidden layers are layers of nodes located between the input and
output layers. Units can be connected either in a feedforward or
feedback system. In a feedforward system (Figure 4), units are only
connected to units lying in higher layers. Signals are transferred from
the input layer nodes to hidden layer nodes and from hidden layer nodes
to output layer nodes. A fully connected feedforward network is a
special case of neural networks that is often used. In fully connected

feedforward networks each node is connected to every node in the next

 

 

23

higher layer. Figure 4 is a fully connected network, but feedforward
networks in general are not required to be fully connected.

A shortcut notation for feedforward structures is to specify the
number of nodes in each successive layer separated by slashes. Thus, a
fully connected, feedforward network with six inputs, nine nodes in the
first hidden layer, five nodes in the second hidden layer, and three
output nodes would be a three layer network with a 6/9/5/3 structure.
The network in Figure 4 is a two layer, 3/2/2 network.

In a recurrent system (Figure 5), nodes may connect to any node in
any layer. Feedback loops may connect to the same node, nodes in the
same layer, or nodes in previous layers. The feedback loops create a
special challenge in the training process but the added complexity
allows for the solving of more complex problems. In general, the
greater the complexity of the network structure, the greater the
complexity of the problems that it can be used to solve. Complexity in
the network structure depends on a number of structural properties such

as: the number of nodes, the number of processing layers, the type of

activation functions in each node, and the type of connectivity between

nodes.

Propagation Rule

The propagation rule specifies the manner in which the input
signals are combined in a processing unit. While other propagation
rules exist, all of the networks in this study will use the inner
product rule such that the signal to be evaluated will be the inner
product of the vector of activation values and the weight vector.

In the processing unit, input signals from other processing units

and a bias unit are linearly combined to form a single signal to be

24

prmxmsed. Specifically, this signal is most often formed as a weighted
Each incoming signal is

sum oftﬁm incoming signals and the bias.

multuﬂied by the weight associated with the connection between the two

units. The weighted bias and the weighted input signals are then summed

to provide the input to a unit

y_inj = bj + Exiwij .

i
The biological roots of the bias stem from the threshold value in

a biological neuron. Most often it is included in a network as another

input signal with a fixed activation value of 1.0 and a variable weight.
Thus, if the weight associated with a neuron’s bias is negative, it

serves as the threshold value that must be exceeded for the neuron to

provide a positive output. Alternatively, a threshold node can be

incorporated into a neural network in place of a bias. A threshold unit

is simply a node with an activation of -1 connected to other nodes by an

adjustable weighted connection. Mathematically, the bias and threshold

are identical except that their signs are opposite; Nevertheless, their

impact on a neural network is the same. In this study the bias term is

For demonstration purposes, the bias is

used for sake of consistency.

shcwniaas an input separate from the incoming input signals from other

nodes.
but the bias plays an

Netynirks can be constructed without biases,

inmxartarn:.role in defining the decision boundaries of the network.

each separating line (surface) of the decision

Without a bias tenm
The bias term allows for

space would have to pass through the origin.
to detach itself from the origin and

the separating line (surface)

provides an increased level of power to the network (Fausett 1994).

25

Activation Function

Activation functions can take many forms. Four representative

functions are shown in Figure 6. A hard limiter function can take on

only two values. Typically the function takes on either bipolar values

(-1 and +1) or binary values (0 and +1). The hard limiter function is

useful when the outputs to be modeled are discrete. Ramping functions

are similar to hard limiters, except they linearly model the input over

a specified range. In Figure 6, the ramping activation function mirrors

the inputs in the interval (0,+1). Outside this range, the ramping

function takes on values of 0 or +1. Sigmoid functions accomplish the

same type of representation as the ramping function using a smooth,

nonlinear curve. The two types of sigmoid functions commonly used in

neural networks are the logistic function,

1
1+exp’”'

ﬁx):

and the hyperbolic tangent function,

l—exp“
1+exp”"

ﬁx):

The basic logistic function ranges from 0 to +1 while the

hyperbolic tangent function ranges from -1 to +1. A common

transformation of the logistic function,

f(x)=——L—-1,

1+exp'“

which ranges from -1 to +1 is also used. The logistic and hyperbolic

tangent functions are continuously differentiable, which is important in

calculatixug weight changes during the learning process. This point is
discussed in.further detail in the section on learning algorithms.

Eacﬁ1<3f these activation functions typically range from -1 to +1

or front() to +1. The hard limiter function is useful when outputs are

26

binary or bipolar. The ramping and sigmoid functions are useful when
the outputs to be modeled are continuous, but no specific rules exist
which define the activation function that is most appropriate.
Generally, the more complex the relationships to be modeled, the more
complex the activation functions need to be. Also, many networks
incorporate linear activation functions in output nodes where the output
of the network is not constrained to either (—1,+1) or (0,+l). The
output of a node with a linear activation function is simply the sum of
the weighted inputs and bias. Most neural networks use some form of
sigmoid function in the hidden layer(s) and either sigmoid or linear
functions in the output layer.
Learning.Algorithm

A learning algorithm is a set of rules which govern the weight
changes in the learning process. Knowledge or information about
relationships between inputs and outputs is stored in the weight matrix
of neural networks. Processing units store short term information
concerning the state of each unit at a point in time, but the weights
for each connection and the pattern of connections are where long term
information is stored (Rumelhart, Hinton and McClelland 1988). The
values of weights in a neural network are adapted through a learning

process. Haykin (1994) defines learning as follows:

Learning is a process by which the free parameters of a
neural network are adapted through a continuing process of
stimulation by the environment in which the network is
embedded. The type of learning is determined by the manner
in which the parameter changes take place.

Learning Paradigm
A learning paradigm is the environment that the learning algorithm
works within during the learning process. The environment defines

specific attributes of the learning process such as when weights are

27

Inxmted,vumn the process terminates, and how the learning algorithm is

applied to the network.
A LEARNING PROCESS HIERARCHY FOR NEURAL NETWORKS

Eigune7 graphically illustrates a learning process hierarchy.

Varnmusframeworks have been developed that capture and categorize

neural networks (e.g. Haykin 1994; Hinton 1989; Hush and Horne 1993;
Knight 1990; Lippmann 1987; Rumelhart, Hinton and McClelland 1988). The

focus of this study is on developing neural network applications useful

Therefore, while there may exist

in the study of the NPD process.

exotic and unique learning algorithms and learning paradigms that cannot

be categorized within the hierarchy shown in Figure 7, they are beyond
the scope of this study. Instead, a broad and comprehensive foundation
is presented.

of neural networks based on (Haykin 1994)

Most of the popular learning rules and learning paradigms have
Numerous extensions and adaptations to

their roots in this hierarchy.
Hundreds

the basic architectures exist and are still being developed.

of learning rules have been published in relevant neural network

Unfortunately, once they are published,

journals in the past few years.
little work is done to compare them to each other to determine specific

advantages and disadvantages of each method (Prechelt 1996) . Many of

these algorittmm;are developed either to solve very specific problems or

to provitna linkages between artificial neural networks and biological

Since the focus of this study is to develop neural

neural networks.
an in-depth

network applications useful in studying the NPD process,

analysis of the nuances associated with each of the numerous extensions

well established and commonly used methods

is not the goal. humead,

and structures will be incorporated into the applications studied here.

28

Supervised Learning Paradigm
Supendsed learning is also known as associative learning. There
aretwotypmsof associative learning: autoassociative and

Haykin 1994). In autoassociative

heteroassociative (Fausett 1994;

leanﬁng Uxeinput vector is identical to the output vector in the

the number of inputs equals the number of

training process. Therefore,

Autoassociative networks will be discussed in more detail in

outputs.
the section which compares neural networks to traditional statistical

Hethods.Pkmeroassociative learning involves the mapping of an input
vector on to a different output vector. In heteroassociative learning

the number of inputs and outputs may or may not be equal.
a target vector is available which defines

In supervised learning,

of the network for a given input vector. A

the desired output(s)
learning algorithm is then used to adapt the weights such that the

are reproduced when the input vector is propagated

desired output(s)
Weights are adjusted iteratively according to the

through the network.
chosen learning rule as training data propagates through the network.

Each weight change is called an iteration and each pass through the
Note that a weight change can occur

training data is called an epoch.
or after an

either atfter the presentation of a single input vector,

On-line learning is when an iteration is taken after each

entire epoch.
batch learning is when the weight updates occur

vector presentation,

(Ripley 1996).
training a network

after each epoch
Except for very simple demonstration examples,
involves a number of epochs. Weights are adjusted until the network
converges on some stopping rule specified either in the learning rule or
Generally, training terminates either when the output

by the user.

 

 

29

error of the network reaches a prespecified value or after a certain

number of epochs.
Supervised learning can be either static or dynamic. In static

models the network is trained using a training data set. After training

is complete the weights are held constant while other input vectors are

propagated through the network to yield network output values for each

input vector. In dynamic models, the network may be trained with a

training data set, but the weights are not held constant and learning

continues as new input vectors are introduced to the network. Static

learning is sometimes referred to as off-line learning and dynamic

learning is called on-line learning (Haykin 1994), but this can lead to

confusion with online and batch terms. In this study, the static and

dynamic terms are used and online refers to the weight update scheme.

unsupervised Learning.Paradigm

Unsupervised learning is performed in the absence of a desired or

target output vector. Only input values are supplied in the
unsupervised training process. Without output values to provide the
network with an error or deviation from a desired value, the network

undergoes self-organization. Through repeated training iterations,

input nodes that are similar in activation form clusters in the output

nodes. For this reason, networks that incorporate unsupervised learning

are often called self-organizing systems. The network learns to respond

to patterns or clusters in the training data without any a priori

specification of output classes or categories.
There are a variety of applications in NPD for both supervised and

unsupervised learning paradigms. For example, unsupervised neural

networks could be used to discover clusters of both independent and

3O

dependent variables as a data reduction technique similar to exploratory

factor analysis. From these clusters, a supervised neural network could

be built to discover the relationships between the clusters of
independent (input) variables and dependent (output) variables.
Independent variables could include factors that are thought to be

related to successful development and dependent variables could be

various measures of NPD success or failure. Once trained, the

supervised network could be used to analyze the relationships between
inputs and outputs and to predict outcomes based on inputs from NPD

projects to which the network has not yet been exposed.

Hebbian Learning Algorithm

Most learning algorithms have roots that can be traced to the

Hebbian rule. Donald Hebb introduced a simple learning rule that has

strong neurobiological support (Brown, Kairiss and Keenan 1990; Kelso,

Ganong and Brown 1986). The rule is based on Hebb’s description of

learning presented earlier in the historical section of this chapter.

Perhaps the easiest way to understand the Hebb rule is to consider a

network with binary nodes. Nodes with an activation of O are considered

to be off while nodes with an activation of 1 are considered to be on.

The Hebb rule states that if two connected units are both on, then the

weight associated with the connection between the nodes should be

strengthened. Mathematically, the Hebb rule can be stated in terms of

the weight update formula:
wi(new) = wi(old) + xiyi.

Note that a weight change (network learning) takes place only when both

units have activations of 1.

31

The Hebb rule is not often used in modern neural network

applications, but it has served as a launching point for a variety of

algorithm variations. These new variants are much more powerful and

have overcome many of the shortcomings of the original Hebb rule.

Nonetheless, the simplicity of the Hebb rule has led to its common use

as a pedagogical tool and it is often used in examples to demonstrate

the learning process in neural networks. See Appendix B for a Hebb net

demonstration.

Coupeti tive Learning Algorithms

Competitive learning algorithms evolved out of the Hebb rule. The

three basic components of a competitive learning algorithm are: (1)

Start with a set of units that are all the same except for some randomly

distributed parameter which makes each of them respond slightly

differently to a set of input patters; (2) Limit the “strength” of each

unit; and (3) Allow the units to compete in some way for the right to

respond to (learn from) a given subset of inputs (Rumelhart and Zipser

1985).
In the learning phase of competitive neural networks, output units

compete for the opportunity to become the active unit. The active unit,

or winner, determines the weights to be adapted in each iteration of the

learning process. Once the network is trained, the outputs become

feature detectors and recognize patterns in the data.
Neural networks trained with a competitive learning algorithm are

often used to map patterns in data. Typically, competitive neural

networks are used for clustering, vector quantization, dimensionality

reduction, and feature extraction (Krose and Smagt 1993). Popular

classes of neural networks based on competition include the Kohonen

32

self-organizing feature map (SOFM), learning vector quantization (LVQ),
and counterpropagation (Fausett 1994). Appendix B includes a detailed
demonstration of a Kohonen SOFM.

Error-correction Learning Algorithms

Probably the most popular class of learning algorithms is the
error-correction algorithm. Error-correcting algorithms are based on a
gradient descent learning algorithm originally called the ADALINE
learning rule by Widrow and Hoff (1960). They were the first to develop
a learning rule that incorporated an error term. It is also known as
the delta rule, the Widro-Hoff rule, or the LMS (least mean square) rule
(Hertz, Krogh and Palmer 1991).

The goal in error-correcting algorithms is to minimize a cost
function based on an error signal of the network (Haykin 1994). The
error signal of the network, e,, is almost always defined as the
difference between the desired output of the network, t,, and the actual
output of the network, yk. Thus, e, = t, - yk, where k denotes the index
of the output units. The cost function is usually the sum of squared

errors of the network,

'3'
E:

NIH

Zei.
k

Error-correcting algorithms use gradient descent to minimize the
cost function. The change in weights or the weight matrix according to
this method is expressed as Aw” = aepg, where a is a constant
representing the learning rate. Thus, the weight change is proportional
to the product of the error and the activation at each output node.

The backpropagation algorithm is an extension of the delta rule

and is the most popular and useful learning rule. Many variations of

33

the bmﬂqnopagation algorithm have been developed which increase the

speedvnthvuuch it converges, increase its resistance to stopping at

hocal mhﬁnm, or improve various operating characteristics of the
,ﬂ,goritmn. Some of these variations are discussed in more detail in

Appendix C.

Neural networks are created by combining algorithms, structures,

and environments to meet the needs of specific problems. In general,
this combination is made based upon characteristics of the data. When

information is not known concerning desired or target outputs of the

neu~ork unsupervised learning algorithms are generally chosen;

otherwise, error-correcting supervised learning algorithms make use of

the known outputs. A common first step in designing a network is to

choose an algorithm or family of algorithms based on the existence of

target values. Next, the structure and environment are chosen based on

or some form of cross-validation.

trial and error, experimentation,

For example, a researcher wanting to reduce the dimensionality of
a data set may choose to create an unsupervised, self-organizing network
that incorporates some form of competitive learning. Once this task is

completed, the researcher may want to create a network that classifies
A supervised network could then

samples based on some known criteria.

be built which uses the outputs from the previous network as inputs.

The CAHJMIts would be the categories of the known criteria. This network
coultikme solved using a variety of error-correcting algorithms. In

a researcher using neural networks to solve problems is armed

summary,
One of the tasks the researcher faces is

with a toolbox full of tools.

choosiju; the right tools for the job. These tools are discussed in

further detail throughout the rest of the manuscript.

34

DISCUSSION

There are three basic decisions that need to be made when
designing a neural network. The network must adopt a specific: (1)
structure; (2) learning paradigm (supervised or unsupervised); and (3)
learning algorithm. Taken together, a network that adopts a certain
structure, learning paradigm, and learning algorithm is said to belong
to a specific class of neural networks or have a specific architecture.
For example, multiple-layer feedforward networks (MFNs) trained using
supervised backpropagation are a specific class of neural networks and
are said to have a specific architecture. Probability neural networks
represent another architecture which is similar in structure to MFNs,
but are quite different in the way in which their structure is created
and the learning paradigm which is adopted.

Structure in neural networks is defined by the number of layers in
the model, the number of nodes in each layer, the activation functions
at each node, and the pattern of connections between nodes (e.g.
feedforward or recurrent). The number of input nodes is simply the
number of input variables and the number of output nodes is the number
of output variables to be modeled. The number of hidden layers and
number of nodes per hidden layer are not as easy to determine a priori.
Typically the number of hidden nodes and number of hidden layers is
determined through trial and error, experimentation, or some form of
validation procedure. The pattern of connections between nodes refers
to whether the network is strictly feedforward or contains recurrent
connections. The connection pattern may be fully connected or partially

connected. In general, the more complicated the underlying

35

rdatmmdupstotm modeled, the more complicated the network must be to

represent those relationships.

TheneUmrknmst adopt either a supervised or unsupervised

If information is available concerning the actual or

learning paradigm.
a supervised learning paradigm is likely to

desired values of outrn1ts,
an unsupervised paradigm

Maused. If Hus information is not available,

must be chosen.
Zlgreatrnmmer of learning rules are available for training neural

rmtworks,kmm:they all perform the same basic function of adjusting the

Selection of an

vmights corresponding to connections between nodes.
Each

algorﬂﬂmiis dependent upon the demands of the particular problem.

algorithm has certain characteristics (advantages and disadvantages)

that must be matched to the characteristics of the problem.

there are two important features that distinguish neural

Finally,
parallelism and

networks from other methods of data analysis:

Hinton and Williams 1986). In general,

distributed memory (Rumelhart,

the parallel nature of processing in neural networks gives them the
For

ability to deal with noisy data better than traditional methods.

netwnxrks designed to recognize printed digits from video inputs

example,
rotated, or

(optical cﬂuaracter recognition) can recognize ill-formed,

handwritten digits that are similar to the perfectly formed digits that
were used in network trahﬂng.
Theeciistrdlmated nature of memory in artificial neural networks
refers to the way knowledge is represented throughout the network.
Individual hidden nodes are normally not interpretable as specific

causes or factors in relationships between input and output nodes.

the overall pattern of the weights associated with connections

Instead,
between nodes represents the relationships between inputs and outputs.

Knowledge is distributed in a parallel sense throughout the network

36

through the pattern of weights which are adapted to fit the data in the

training phase (Rumelhart, Hinton and McClelland 1988). These two
features of neural networks (parallelism and distributed memory) are the

driving forces behind the power and speed of neural networks. They

provide neural networks their speed, fault tolerance, pattern

recognition abilities, and generalizability.

Chapter 3

STATISTICS, GENERALIZABILITY, AND A RESEARCH PROPOSAL

This chapter begins with a number of comparisons between neural
network and traditional statistical methods of discrimination,
classification, and prediction. Next, important issues in controlling
generalizability in neural networks are discussed. Finally, a proposal
for applications of neural networks to NPD data analysis is presented.
NEURAL NETWORKS AND STATISTICS

Neural network research has benefited from multidisciplinary
research efforts as contributions have come from cognitive science,
neurobiology, engineering, computer science, etc. For the most part,
though, the focus has been practical in nature. 'Networks have been
built and algorithms have evolved to solve practical problems mostly in
pattern recognition and prediction. The main concern in this area is in
correct classification and accurate forecasts. Existing structures and
learning algorithms are modified and adapted to best meet the specific
circumstances of the current problem.

When using neural networks for pattern recognition, the goal is to
build a network that learns from data in such a way that it is useful
when exposed to new data. No a priori knowledge regarding the
relationships between independent and dependent variables is assumed. A
network with this ability to properly predict future values or properly
classify new data is said to have good generalization.

On the other hand, statisticians generally seek to build a model
of the relationships between independent and dependent variables from an

'a priori' knowledge of these relationships. Their focus is on finding

37

 

38

thermxml that best represents the phenomena they are studying. Thus,

in general, statisticians take more of a macro approach to problems

while neural network research has been motivated from a micro approach.

While there are differences in the purposes and driving forces behind
the use of neural networks and traditional statistical methods, these
tools have many underlying similarities.

Recently, statisticians (Cheng and Titterington 1994; Ripley 1993,

1994; Sarle 1994) and econometricians (Kuan and White 1994) have taken

interest in neural network models. This interest and further research

is likely to benefit all three research streams. Statisticians and

econometricians get a new set of tools to add to their toolkit and

neural networks researchers may make important breakthroughs as a more

theoretical approach is taken. For example, neural network users have

struggled with issues such as: establishing appropriate network

complexity, monitoring and controlling the learning process to inhibit

under-learning and over-learning, and determining theoretical limits of

networks such as lower bounds on network error from characteristics of

the data set. Rigorous theoretical analysis of neural network

performance, structure, and learning by statisticians and

econometricians may provide answers to these and other important

questions.

Recent review articles have highlighted many of the similarities

and differences between neural network models and traditional

statistical methods (Cheng and Titterington 1994; Kuan and White 1994;

Ripley 1993, 1994). Most traditional methods have a neural network

counterpart which can be considered either a duplicate or a variant of

the stenzistical technique. While an exhaustive comparison between

statitﬁzical and neural network methods is beyond the scope of this

study, iJ:.is informative to understand the statistical roots of neural

Familiar statistical methods that are used in classification

Table 1

networks.

and chxyzrimination problems are the focus of this comparison.

39

its neural network variant, and relevant

liststjw statistical model,

citathxm. For example, consider three statistical methods which are

commonhytmed: principal components analysis, linear regression, and

logistic regression.

Principal components analysis is used as a tool for dimensionality

reduction. Orthogonal factors are estimated such that linear composites
of the original variables can be used in further analyses. Principal

components can be extracted using a variety of neural network methods

(Bishop 1995; Ripley 1996). One method for determining the principal

components of a given set of variables is to create a network with an

input layer, a single hidden layer, and an output layer with linear

activation functions in both the hidden and output nodes.

the network consists of n

To extract m

components from a set of n variables (m<n),

input nodes, m hidden nodes, and n output nodes. The training process

consists of mapping the m-dimensional input vectors onto themselves.

That is, the input vectors are simultaneously used as output vectors.

The network learns to produce the input pattern given that input

pattern. This method is called autoassociation. Once the network is

the weights from the input layer to the hidden layer represent

trained,
These

factor loadings and the hidden layer nodes represent the factors.

weights can then be used to create linear combinations of the input

variables to reduce the dimensionality of the data set.

Autoassociative neural networks can also be used to extract

nonliiuaar principal components (Bishop 1995; Sarle 1994). The network

structtuxa for this problem consists of identical input and output layers

witli'threeeliidden layers between them. The first and third hidden

.Layers luive nonlinear activation functions while the middle hidden layer

uses lJJuaar activations. The first and third layers perform the

nonlirmuar transformations of the data while the middle layer extracts

the principal cmmxnmnts.

40

Other neural network methods also exist for extracting principal
components from a data set. For example, a simple network of n inputs
and m outputs can be trained using a variation of Hebbian learning to
extract the first m principal components (Oja 1989; Sanger 1989).

Autoassociative neural networks also are used to solve data
compression problems. An autoassociative network is trained using fewer
hidden nodes than input/output nodes. In practical applications it is
often more useful to use nonlinear activations as they allow for greater
reductions in dimensionality (Hassoun 1995). Once trained, the network
is essentially cut into two halves. The first half consists of the
input layer, the first layer of weights, and the hidden layer. This is
the encoder. The inputs are fed through the encoder and the signal
produced is of lesser dimensionality. This signal can then be
transmitted to the second half of the network, the decoder. The decoder
consists of the second layer of weights and the output units. Once the
signal is passed through the decoder it is restored to its original
state, or very similar to the original state depending on the complexity
of the trained network and the accuracy to which the network was
trained.

Linear regression also can be duplicated using neural networks. A
regression network is simply a network with an input and output layer.
No hidden layers are needed. The input layer contains n+1 nodes, where
n is the number of independent variables. The extra node represents the
error term in the regression and is the same as a bias node in a neural
network. Finally, the activation function for the output layer is
linear.

Once estimated, the weights between the input and output layers
are identical to the betas in linear regression. If multiple outputs
are used, the results are identical to multivariate regression. In the

neural network literature, a simple linear regression neural network is

41

referred to as an ADALINE and a multivariate linear regression network
is called a MADALINE (Widrow and Hoff 1960).

Logistic regression also can be duplicated using neural networks.
Replacing the linear output activation function with the logistic
function in the linear regression equivalent network results in a
logistic regression model and will produce identical results.

Other statistical methods have neural network counterparts which
are specific cases of that method or are closely related to the
statistical method but have minor differences. For example, a MFN with
one hidden layer of nonlinear activation nodes is similar to projection
pursuit regression except that in projection pursuit regression the
activation functions in the hidden layer are allowed to vary during the
estimation process. In neural networks they are predetermined.
Advantages and Disadvantages of Neural Networks

In general, neural networks have two disadvantages when compared
to traditional statistical methods. First, neural networks are more
difficult to interpret. Most neural networks are built for prediction
or classification. The focus is on constructing a network with high
performance characteristics based on accuracy of predictions or
consistency in classification problems. Up to this point in neural
network development and research, interpretability has been secondary to
this goal.

Second, despite some of the extraordinary claims (Hecht-Nielson
1990) surrounding neural networks, they can be difficult to implement.
One cannot simply gather data, create an arbitrary network, and train it
with an arbitrary algorithm and expect outstanding performance. Much
still remains to be learned regarding the effects that various learning
parameters have on the performance of trained networks.

Both of these disadvantages are artifacts of the young age of the
field. While neural networks originated in the 1940's with the work of

Mdlﬂloch and Pitts (1943), most of the advancements have occurred only

42

recently. While some work was still going on in between the 1940's and
the early 1980’s , the popularity of neural networks was really launched
in the mid- to late-1980’s. A number of influential articles (e.g.
(Hinton 1989; Knight 1990; Lippmann 1987)) and books containing seminal
articles (e.g. Anderson, Pellionisz and Rosenfeld 1990; Anderson and
Rosenfeld 1988; Rumelhart and McClelland 1988) sparked this interest and
have lead to the creation of new journals, conferences, and research
groups.

While the “black box” problem of interpretability has been
secondary to the performance characteristics of neural networks, recent
attention has been paid to this problem. Researchers are finding ways
to extract rules from trained networks (Towell and Shavlik 1992, 1993)
and to incorporate interpretable nodes into neural networks (Opitz and
Shavlik 1995). Concerning implementation guidelines for the use of
neural networks, there is a need for more research in this area. As
recently as 1996, Prechelt (1996) identified this deficiency in the
research and called for a more rigorous assessment of the operating
characteristics of new algorithms, structures, and strategies. New
algorithms, structures, and strategies appear in each issue of the
neural network journals. Because of the proliferation of algorithms,
most of them are not benchmarked against known characteristics of
existing algorithms. Which algorithm should be the benchmark? What
type(s) of data set(s) should be used in benchmarking? Both of these
important issues in neural network research are currently being
addressed and it is likely that progress will be made soon.

Despite these limitations, neural networks can offer a significant
advantage over traditional statistical methods. While there are many
similarities between neural network models and statistical methods,
neural networks are extremely adaptive and flexible. As has been shown,

neural network typologies and learning paradigms can be adapted such

43

that they are similar or even equivalent to traditional statistical
methods.

There are a variety of ways in which the characteristics of neural
networks can be controlled. Depending on the circumstances, the
structure, training algorithm, and environment can be adapted to meet
the current needs. If more flexibility and learning power is needed
there are a variety of options to choose from. The network’s complexity
can be increased to meet the complexity of the problem by adding hidden
nodes to a hidden layer or by adding an additional hidden layer of
hidden nodes. Alternatively, a more powerful learning algorithm
incorporating second order methods could be chosen to aid in finding a
better solution for the weight vector.

The next section discusses strategies for constructing and
controlling neural networks to provide generalizability. Depending on
the demands of the application, neural networks can be built and trained
to meet those demands. Generalizability is just one of the possible
demands that face potential users of neural networks. Strategies and
structures can be tuned to perform other tasks as well.

GENERALIZATION STRATEGIES FOR.MFNS

Generalization refers to the ability of a trained network to
respond correctly to inputs that it hasn't seen. For classification
problems a trained network with good generalizability will correctly
classify a high percentage of items it previously hasn't seen.
Likewise, in regression or prediction problems, a trained network given
inputs it hasn't seen has good generalizability if it can predict the
values of outputs with a small degree of error.

Generalizability of a network is closely related to the concepts
of under-fitting, over-fitting, and smoothing in polynomial curve

fitting. Figure 8 graphically represents these concepts. The T’s

44

represent training data sampled from an arbitrary distribution and the
N's represent new data sampled from the same distribution for testing
the model. The straight dashed line represents the fitting of the data
with a linear model. In this case, the model is under-fitting the data.
The wiggly dotted line that passes through each training point
represents the fitting of the data with a high degree polynomial. In
this case the model is over—fitting the data. In addition to modeling
the underlying function which produced the data, this model is modeling
the random error present in each of the data points. Both the under-
fitting model and the over-fitting model result in poor functional
approximations and respond poorly to new data. A better solution is the
smooth solid curve which is close to all of the data points, but doesn’t
pass directly through each training (or new) sample. On average it is
closer to each of the unseen data points than both the under-fitting or
over-fitting curves. In polynomial curve fitting generalizability is
produced by optimizing the flexibility of the model in terms of the
degree of the polynomial model.

In neural networks, one reason for under—fitting is when the
complexity of the network (in terms of number of hidden nodes and
weights) is lower than the complexity of the function being
approximated. Over-fitting is the opposite and can occur when network
complexity exceeds the complexity of the function being approximated.
Thus, network complexity is analogous to the flexibility that can be
achieved by changing the power of the polynomial in line fitting.

The issue doesn't end with network complexity, though, under- and
over-fitting in neural networks can also be a function of a number of

other characteristics of the network and the training process. For

45

example, network training stopping rules can influence generalizability.
Given a sufficiently complex network, under—fitting can result when
training is stopped early and over-fitting can result when it is allowed
to proceed too long.

When using sample data to train a network to respond to new data
from the same population, the ideal network will smooth the
relationships in the training data to estimate the underlying function
that is generating the data. Therefore, instead of minimizing the sum
of squared error of the network on the training data, the goal is really
to minimize the sum of squared error of the network on new data. This
is expressed as the expected value of the network error. Thus, network
training error and network generalization error can be distinguished.

Study of the statistical properties of the generalization error
leads to valuable insight concerning methods and strategies of attacking
the generalization problem in neural networks. Consider the expression
for the expected value of the sum of squared error function (Bishop

1995; Geman, Bienenstock, and Doursat 1992)

E[(y(x) — tlx)2] (4-1)

where E[°] is the expected value of the argument, y(x) is the output of
the network for a given input vector x, and tlx is the conditional

average of the target vector given an input vector x. Adding and

subtracting EbeH to the expression yields

E[(y(x) - 4.92] = El:{(y(x) — E[y(x)]) + (E[y(x)] — emf]. (4-2)

Expanding the squared term gives

46

E[(y(x) — tlx)2:l = E[(y(x) — E[y(x)])2] + E|:(E[y(x)] — tlx)2]
+ 2E:[(y(x) — E[y(x)])(E[y(x)] - tlx)].

Rewriting the expected values of the first part Of the last term gives
In.)- M = BUM — EM)? + .[(.[.(.,1 - W] (4-..
+ 2[(B[y(x)] — 8[y(x>])(E[y(x>l - ‘18]-

Finally, the last term drops out and the expected value of the sum of

(4-3)

squared error term consists of two terms

E[(y(x) — tlx)2] = E[(y(x) — E[y(x)])2:l + E[(E[y(x)] - t|x)2]. (4—5)
The first term is the variance and the second term is the squared bias
[biasz] of the expected value of the sum of squared error of the network
output.

Variance in this case represents the network's sensitivity to the
particular data set used in the training process. If a large,
representative data set (with respect to the population size) is used,
then the network will be less sensitive to a new sample. In this case,
retraining based on a new large, representative sample is not likely to
change the performance of the network and its variance is low. On the
other hand, if a small sample is used to train the network it is likely
that the network will respond more strongly to intricacies in the sample
that are due to random error. Retraining the same network with a
different small sample is likely to result in widely different
performance characteristics and the network's variance is high.

On the other hand, the squared bias [biasz] of the network
represents the difference between target (actual) output of the network,
t, given a set of inputs and the average output of the network over all

possible data sets. Using the central limit theorem it can be seen that

47

increasing the size of the testing data set will produce an approximated
function with very low biasz. In fact, in the limit, as the sample size
approaches the population size, there is no bias.

Unfortunately, practical limitations do not allow sampling of the
entire population. Otherwise there would be no need for statistical and
neural methods for approximating population parameters. In most
problems the sample size is relatively fixed. While more data could
always be collected, it would still represent a small portion of the
population. Therefore, other strategies for reducing bias2 and variance
must be found.

Network complexity as a function of the number of weights and
hidden nodes in the structure has an effect on both variance and biasz.
Similar to other controllable factors in neural networks, though, the
effects are opposite. Consider the output of the simple network in
Figure 8 (straight dashed line). In this case there is one function
which best fits the data. Therefore, the variance is minimized while
the bias2 is averaged over a single model, so it is maximized.
Alternatively, consider the output of the highly complex network in
Figure 8 (wiggly dotted line). Here it is seen that there are a variety
of possible function approximations which could pass through each
training data point. Therefore, the variance of the network is high.

On the other hand, bias2 is now averaged over multiple networks and
approaches the true value, so bias2 is reduced. Network complexity is
positively related to variance and negatively related to biasz.

Therefore, at the extremes, attaining a network with good
generalizability becomes finding the optimal tradeoff between minimizing

variance and biasz. Increasing the generalizability of a given network

48

involves reducing its variance and biasz. The question becomes, How can
these be reduced in a network? A number of strategies have been
proposed in the literature. These strategies can be organized into two
categories relating to control of the learning process and adaptation of
the network’s structure. In many cases multiple strategies are combined
to form an overall strategy for improving generalizability. Learning
process strategies include early stopping, regularization (See Appendix
C), training with error, training with a holdout validation sample, and
cross-validation. Popular network structure adaptation strategies are
weight pruning and generative algorithms.

Probably the most simple strategy is early stopping. With this
strategy, once the network has proceeded through a predetermined number
of iterations training ends. The major drawback in this strategy is
deciding beforehand how many iterations to choose. There is no way to
determine a priori the path that will be taken in a gradient descent
method and how quickly training will converge on a solution. Therefore,
a choice for the number of iterations is completely arbitrary and
results will vary greatly depending upon the particular problem being
solved. Early stopping is not generally recommended as a strategy for
improving the generalizability of a network.

On the other hand, regularization has a number of desirable
properties related to network generalizability. Consider nonlinear
hidden nodes with the hyperbolic tangent function. The nature of the
activations of these nodes dictates the degree of curvature in the
function approximation. The greater the weights, the greater the
curvature of the approximation. When weights are low, the function

approximation approaches linearity. Therefore, one way to control the

49

degree of curvature of the function approximation is to place a penalty
term on the weights in the error function. The effect of the penalty is
controlled by the regularization parameter. With a small regularization
parameter, the network is allowed to produce an approximation with wild
curvatures that force the approximation through each data point. With a
large regularization parameter, the network is restricted and the
function approximation is smoothed. In the extreme, a linear
approximation can result. Therefore, proper use of regularization to
improve generalizability includes a procedure for determining a suitable
regularization parameter. The holdout validation method or cross-
validation method are useful tools for this task and are discussed
later.

Given a new input vector that is similar, but not identical to the
input vector of a training example, a network with good generalization
is able to produce an output vector for the new data that is similar
(for continuous outputs) or identical (for binary or bipolar outputs) to
the output vector of the training example. Adding noise to the training
data during the learning process incorporates this feature directly into
the network. Using the training with noise technique, the network is
prohibited from “memorizing" the training data since it changes slightly
throughout the process. The results are similar to regularization
(Bishop 1995). In this case, though, instead of optimizing the
regularization parameter for generalizability, the amount of noise to be
added to the training data is determined using validation procedures.

Training with a holdout sample for validation is performed by
splitting the data set into three parts instead of the usual two parts.

One part is used in training, one in validation, and the last part is

50

used in testing. Instead of arbitrarily setting a number of iterations
in the learning process as in early stopping, the validation holdout
strategy stops the learning process based on the performance of the
network on data not used in training.

Generally, during the training of a network, a plot of network
error over the series of iterations shows that training error is a
continuously decreasing function. When the plot of the network error on
a holdout sample (validation holdout sample) over the same series of
iterations is added to this plot, the validation error generally
decreases for a period of time and then begins to increase. Figure 9
depicts this situation. The point where the validation error hits a
minimum is where the network begins to model the random error portions
of the training data set. This is the ideal point to stop training.
Since the validation holdout is used to establish the stopping point in
the training process, it can’t be used to test the network after
training. After all, the generalizability of the network has been
optimized based on the characteristics of this sample. Instead, the
third partition (test set) is used as a final test of the performance of
the trained network.

Since data sets are finite in size it is important to make the
best use of the data as possible. Dividing the data into three parts
can impose undesired limitations on the sample size of each partition.
Therefore, cross-validation has become a popular and powerful tool for
improving the generalization properties of a network (Bishop 1995;
Hassoun 1995; Masters 1995; Ripley 1996). Instead of dividing the data

into three sets, the data is split into a training and test set.

51

The training data is then used for both training and validation by
forming v equal sized subsets of the data. The network is trained on v-
1 subsets and then an error is calculated for the network’s performance
on the remaining subset. This is performed v times such that each
subset is used as validation data. The validation error is calculated
as the average error over the v networks. This is called v-fold cross—
validation. A special case of v—fold cross-validation is when v=n,
where n is the number of samples in the training data set. This is
called leave-one-out cross-validation. When used as an early-stopping
strategy, the network's performance can evaluated at each iteration
based on the validation error. As long as the validation error is
decreasing, training proceeds. When the validation error starts to
increase, training ceases. Finally, the trained network can be tested
on the test set.

An alternative application of v-fold cross-validation in model
selection can be used to determine the best network size (number of
hidden nodes). In this application, a simple network structure (e.g.
one or two hidden nodes) is trained using v-l partitions of the training
data and the network is tested on the remaining partition. This
continues until each partition has been held out of the training
process. By summing the errors from each of the test (holdout)
partitions an estimate of the generalization error can be obtained for
that particular network structure. Next, another hidden node is added
to the network and the cross-validation process is repeated. This
process continues until a minimum generalization error is identified.
Thus, cross-validation can be used as either an early—stopping strategy,

a model selection tool, or both.

52

The generalizability of a cascade correlation network is dependent
upon the stopping rules used in the algorithm. These rules are usually
based on network (training) error and the correlation (actually
covariance), S, of the error to the output unit. Either of these rules
could be used, but rules based on the value of ISI are more intuitively
related to generalization because they relate the current network error
to the amount of error that can be reduced by adding an additional
hidden node to the network. Each additional hidden node adds complexity
to the network and allows it to approximate more complex functions.
Extremely low values of ISI in the stopping rule result in complex
networks which can exactly represent the training data but perform
poorly on new data. Likewise, stopping rules with large values of ISI
can result in a function approximation that is too smooth and also has
poor generalization. Using a validation procedure such as the holdout
method or cross-validation is helpful in determining an appropriate
value of ISI to use in the stopping rule. Cascade correlation can be
used in this way to produce a generalizable network because creates a
parsimonious model. Hidden nodes (and the weights associated with them)
are only added to the network as needed until the level of
generalizability (based on the validation procedure) is reached.

Similar to cascade correlation, the generalizability of pruned
networks is dependent on a parameter of the algorithm. In this case,
the parameter is the saliency cut-off point which determines which
weights to eliminate. If the cut-off point is too high, too many
weights are eliminated and the result is a network which is too simple
(smooth) to generalize well. Likewise, if the cut-off point is too low,

too many weights remain in the model and the result is a network which

53

is too complex (curvy) to generalize well. Again, validation procedures
such as the holdout method and cross-validation are appropriate here for
determining the saliency parameter.

PROPOSED APPLICATIONS IN NEW PRODUCT PROJECT SELECTION

Neural network techniques span a number of structures, learning
algorithms, and learning paradigms. These characteristics can be
combined in numerous ways to form a specific architecture. Since each
of these characteristics has different properties, it can easily be seen
that a wide range of neural networks are available to solve a wide range
of problems.

This study focuses on the use of neural network techniques in the
analysis of NPD data. Using a combination of the techniques discussed
in this manuscript, various artificial neural network expert systems
will be constructed. These networks will consist of two architectures:
multilayer feedforward networks and probability neural networks. Each
MFN will incorporate nonlinear hidden nodes. Input nodes will
correspond to independent factors in NPD projects and will be consistent
with success factors uncovered in previous research. For example,
inputs will represent: firm skills and resources available for the
project, the development time expected for the project, the degree of
"newness" of the product to the market, the degree of "newness" of the
product to the firm, and product market characteristics. Output nodes
will correspond to success variables in NPD projects. These include a
dichotomous variable indicating a self-selected successful or
unsuccessful project as determined by the informant and ratings on
specific success measures of technical success and financial success.

Data to train the network come from the GLOBALTECH project. This
project represents the results of an international comparative study of
global technology and innovation management practices. It includes

surveys from nearly 4000 NPD managers in nine countries. Each manager

54

responded to over 200 items. The project began with 38 case studies (16
Japanese new product teams and 12 U.S. new product teams). Each team
interview lasted between three and five hours and served six purposes:
(1) to select appropriate research methodologies; (2) to develop
sampling and survey administration techniques and procedures; (3) to
establish the content validity of the concepts and the hypothesized
relationships among the constructs; (4) to develop new measures; (5) to
establish equivalence of the constructs, concepts, measures, and
samples; and (6) to assess the possibility of cultural bias and response

format bias (Song and Parry 1997; Song and Parry 1996; Song and Xie

1997). The team interviews were followed by 92 follow-up interviews
with individual team members. Concerning survey design, items were
generated from three sources: (1) case studies; (2) literature

reviews; and (3) academic "experts" from prominent business schools. A
parallel-translation/double-translation method was used to ensure
equivalence across cultures. Also two sets of pretests were performed.
One pretest involved six Japanese graduates from U.S. business schools
and the other involved the participants from the case studies. Finally,
the sampling frame consisted of companies that had developed at least
four new products in the last five years.

Ideally, the MFNs will have only one hidden layer, but if it is
found that the underlying relationships in the data are sufficiently
complex, two hidden layers may have to be used. To determine the
number of hidden nodes which result in the network with the best
generalizability a v-fold cross-validation procedure will be performed
as a model selection tool. The procedure will begin with one hidden
node and will proceed by adding hidden nodes until a plot of the
estimate of the generalization function versus the number of hidden
nodes in the network indicates either a minimum error or maximum
predictive ability. The structure of the network to be used as the

'artificial brain' will be the structure indicated by this critical

55

point in the cross—validation procedure. A decision support system will
be created by independently training ten MFNs and using the average
output of these ten networks as the system prediction of NPD success.

Depending on the size of the network needed to provide sufficient
generalizability, some variant of backpropagation as discussed in
Appendix C will be used to train the network. If the network size is
small enough to allow for the use of the Levenberg-Marguardt second-
order method, it will be the first choice. Otherwise, either the
conjugate gradient method or backpropagation with momentum will be used
to train the network.

Since the primary goal of these networks is generalization, a
holdout portion of the data will be used to test generalizability and
will play no part in the training of the network. The networks will be
trained using only the training portion of the data. Three MFN-DSSs
will be trained independently using the dichotomous success variable and
the continuous variables representing technical success and financial
success as targets (dependent variables). Once trained, the MFN—DSSs
will be tested on the holdout portion of the data. This will provide an
indication of the ability of each MFN-DSS to recognize patterns in NPD
project data that lead to success or failure in projects that the system
has not been exposed.

In addition to the MFN-DSSs, a PNN will also be trained using the
dichotomous success variable as the target. The PNN is not appropriate
for continuous outputs, so they will not be used in PNN models.

The predictive performance of these neural network models will
then be compared to four statistical techniques: discriminant analysis,
k—nearest neighbor classification, ordinary least squares (OLS)
regression, and logistic regression. These traditional, statistical
techniques will provide a basis for comparison of the predictive

performance of the neural network models.

56

POTENTIAL CONTRIBUTIONS OF THIS STUDY

This study will represent the first use of neural networks in NPD
research to be published. No other study has been published which
utilizes the power of neural networks in an NPD application.

Furthermore, the only published NPD models which could be used as
predictive tools involve a number of assumptions that are not necessary
when using neural networks.

When using neural networks, no assumptions about the underlying
statistical properties of the data need to be made and the effects of
multicolinearity are diminished greatly by the highly parallel structure
of networks. While structural equation models are generally used for
explanation, they also can be used as a predictive tool. However, in
structural equations modeling a number of assumptions are made regarding
the underlying statistical properties of the data. For example, the
data is assumed to be multivariate normal. Also, using either a
structural model or a regression model, multicolinearity becomes a
problem. Structural equation models also require an a priori
specification of the relationships between factors used in the model.
All of the direct and indirect effects of the factors must be explicitly
stated to solve the model. The introduction of neural networks as a
tool for both prediction and explanation to the field of NPD represents
a number of advantages: (1) no assumptions about underlying statistical
properties of the data need to be made, (2) no a priori specifications
of direct and indirect relationships need to be made, and (3) negative
effects of multicolinearity are greatly diminished in neural networks,
thereby allowing for a large number of independent factors to be used in

the model.

Chapter 4

DATA AND ANALYSIS

This chapter begins with a description of the data used in this
study. This is followed with details concerning the methodologies used
to create, train, and evaluate the neural networks used in this study.

DATA

Data to train the networks are from the U.S. sample containing
data on 612 projects from the Globaltech project. The Globaltech
project was conducted by Dr. X. Michael Song. The data collection
process followed a rigorous five—step process as detailed in Song and
Parry (1997) and Song and Parry (1996). The sampling frame consisted of
companies that had developed at least four new products in the last five
years. The data was randomly partitioned: one for training (n=400) and
one for evaluation (n=212). The partitions were formed such that each
successful project is accompanied by an unsuccessful project from the
same firm where possible.
Network Inputs

Since the nature of the relationships between the independent
(input) factors and the dependent (output) factor in this model is
learned from the data through the training process, the selection of
inputs to the neural networks is a very important decision to be made in
determining the input layer structure of the neural networks. It is
vitally important that factors are selected which fully capture the
domain of interest - success factors in the NPD process. Generally,
previous research focuses on a few constructs of interest to understand

in detail how they effect NPD. The goal of this study is different,

57

58

thus, our selection process differs. Instead of taking a micro approach
to understand the specific effects of a few factors, this study takes a
macro approach which looks at a variety of factors in an effort to
capture the complexities of the NPD process. This macro approach is
warranted because this study tries to incorporate the intricacies of the
process into the model of NPD success so that it will make accurate
predictions. Table 2 provides details on these input factors.

In choosing the inputs for the models, resource-based theory
(Barney 1991; Conner 1991; Wernerfelt 1984) is relied upon for guidance.
Resource-based theory is selected because it provides a unique insight
into the problems facing managers making product selection decisions.
Also, the viewpoint of resource-based theory is very similar to the
situation which faces managers making project selection and resource
allocation decisions.

This theory is relatively new in relation to industrial
organization theory. Traditional industrial organization theory posits
that a firm's strategy and ultimately its ability to create and sustain
a competitive advantage is dependent upon environmental factors.
Furthermore, it views firms as homogeneous and resources as mobile.
Resource—based theory takes an opposite view by focusing on internal
factors. Resource—based theory views firm resources as heterogeneous
and immobile. Each firm has a limited, heterogeneous endowment of
resources and its task is to combine the endowment creatively such that
a unique, valuable market offering which is not easily imitated or
substituted is created. The central tenant of resource—based theory is
that this offering is the mechanism for creating a sustainable
competitive advantage for the firm.

Firm Skills and Resources

Adapting resource-based theory from the firm level to the NPD

program level, resources available for the development of new products

are viewed as being heterogeneous and immobile. That is, management is

59

faced with specific assets (both tangible and intangible) that can be
applied towards different development projects. These assets are
limited, varied, and specific to the firm. For example, some assets are
highly technical and intangible such as the knowledge and experience of

R&D personnel. These assets are referred to as R&D skills.

Furthermore, other assets are tangible. For example, consider idle
production equipment and facilities. These assets are referred to as
manufacturing resources. Thus, skills are defined as intangible assets

available to the NPD program and resources are defined as tangible
assets available to the NPD program. For completeness, skills and
resources for each of eight functional activities (R&D, engineering,
marketing, management, manufacturing, salesforce, distribution, and
advertising/promotion) are measured. See items 1-16 in Table 2
Newness to Market and Newness to the Company

Resource-based theory also leads to the consideration of
environmental factors. While the theory focuses on internal resources,
it recognizes that the firm does not operate in a vacuum. The outcome
of resource allocation decisions made by a firm's management depend
heavily on environmental opportunities and threats. Likewise, at the
NPD program level of analysis, management's mandate is to match internal
resources with external opportunities. To account for these
complexities, assessments of the degree of fit or synergy between the
firm's resources and the demands on the development process created by
the environment augment the measures of internal factors. This degree
of fit is represented by the degree to which the product represents a
new innovation in the market and the degree to which product
characteristics are new to the firm. See items 17-21 in Table 2 for
newness to market measures and items 22-28 in Table 2 for newness to the
company measures. That is, the fit between management's resource
allocations and opportunities and threats provided by the firm's

environment are measured.

60

Product Market Characteristics

Finally, eight specific market characteristics for the new product
are measured. Here, the importance of a market needs, market size, and
market growth characteristics in the success of new products is
recognized. Products which represent technical success can and do fail
at commercialization when there is an absence of market needs. These
items include the degree to which the market exhibits niche or mass
market characteristics; the degree of need that potential customers have
for the new product; the newness of the specific market; the dollar size
of the market or potential market; the projected growth of the market;
the degree to which competitive products are similar to the new product;
the intensity of non-price competition in the product's market; and the
intensity of price competition in the product’s market. For details,
see items 29-43 in Table 2.

Thus, following the suggestions of resource-based theory, the
inputs to the models cover both internal and external factors. Some
factors are directly controllable and others are only indirectly
controllable by management. The NPD process is extremely complex and a
simplified model which disregards environmental factors in favor of
internal factors (or vice versa) would be unable to balance internal
skills and resources with opportunities in the environment.

Network Outputs

Three distinct types of success in the evaluation of new product
projects are investigated. The first is a dichotomous variable
representing a self-selected successful or unsuccessful project as
determined by the informant. This dichotomous success variable provides
an overall evaluation of the new product project. To this macro project
evaluation two continuous variables indicating the degree of financial
and technical success for the project are added. These variables are

measured using a zero to ten scale.

61

Each network's input layer (both in the MFN—DSS and the PNN) has
forty-three nodes, each corresponding to an independent variable. All

items (except the dichotomous success variable) are measured on a zero

to ten scale. The output of the network is either dichotomous success
or one of the continuous variables -- technical success or financial
success. That is, networks are created, trained, and evaluated for each

type of success independently. See Table 2 for details.
ANALYSIS

As stated in the introduction, the purpose of this study is to
investigate the following research questions:

RQ#1: Based on statistical criteria, can artificial neural

network decision support systems outperform
traditional methods of prediction?

RQ#2: When predicting a project’s success or failure, can
artificial neural network decision support systems
outperform traditional methods?

RQ#3: When predicting a project’s degree of success or
failure, can artificial neural network decision
support systems outperform traditional methods?

To address these issues, three MFN-DSSs are trained. The target
during training for the first system is a dichotomous variable
representing a self-selected successful or unsuccessful project as
determined by the informant. This MFN-DSS simply predicts project
success or failure. The target during training of the second and third
systems is a continuous dependent variable representing the degree of
financial and technical success (respectively) obtained in the project.
Thus, the second MFN-DSS predicts a project’s degree of financial
success and the third MFN-DSS predicts a project’s degree of technical

SUCCESS .

In each MFN-DSS, nonlinear sigmoid functions, specifically, the

1
logistic function, qu = —————————, are used as activation functions for

1+exp-X
all hidden nodes and linear activation functions at the outputs. All
training is performed via the Levenberg—Marquardt second-order method

using the Matlab software package, version 5.0.

62

The first task was to determine the structure for each network in
the committee. Since the goal in constructing the networks which
comprise each committee is to achieve good generalizability, a v-fold
cross—validation procedure is performed for networks utilizing
dichotomous success, financial success, and technical success as
outputs. The inputs to each of these three MFN-DSSs are identical and
only the target (output) variable changes. As might be expected,
similar structures emerged as appropriate for good generalizability. As
Table 3 shows, a consistent pattern pointing to networks with five
hidden nodes having the best expected generalizability characteristics
was discovered. Therefore, the final network structure for all MFN
networks is thirty-four input nodes, five hidden nodes, and one output
node.

The comparison between neural network methods and traditional
methods is not limited to the MFN structure, though. A PNN is also
trained using the dichotomous success variable as the target. This
network has two outputs: one indicating the probability of success and
one indicating the probability of failure. Training and evaluation of
the PNN are performed using the code in Masters (1995). The PNN is
specifically designed for dichotomous or categorical output variables
and is not appropriate for continuous target variables, so a PNN is not
created using financial and technical success.

As the basis for the predictive performance comparisons when
dichotomous success is used as the dependent variable, discriminant
analysis, k-nearest neighbor (k=1) classification, OLS regression, and
logistic regression were selected. Other values of k in the k-nearest
neighbor method were considered, but only the results for k=1 are
reported as the results using k=1 are superior to the results using
other values. In each method, model parameters are estimated using the

training data (n=400) and performance is evaluated using the evaluation

data (n=212).

63

For the continuous dependent variables, financial success and
technical success, the PNN, k-nearest neighbor, logistic regression, and
discriminant analysis are not appropriate and are not used. Instead, a
MFN-DSS was created for each type of success and performance is compared
to the performance of OLS regression analysis. Again, the performance
comparison is based on performance on the evaluation data which is not
used during neural network training or parameter estimation in OLS
regression.

The next two subsections describe the methods for constructing the
MFN-DSS and PNN models used in this study. For further details, see
Appendix C.

Methodology for Creating, Training, and Evaluating a MFN-D58

The methodology for creating, training, and evaluating all MFN-

DSSs used in this study follows a three stage process as described next.

Stage I: Determine the neural network structure appropriate for good
generalization.

Step 0: Select output(s) based on research question.

Step 1: Select inputs from theory and literature.

Step 2: Collect data and divide into two partitions: one for
training and one for evaluation.

Step 3: Starting with a network structure containing one hidden
node, repeat steps 4-8 until trend in plot of expected
generalizability identifies network of appropriate size for
good generalizability.

Step Divide training data into v equal sized partitions.

Step 5: Repeat steps 6-7 until each training partition has been used
for evaluation.

Step 6: Train network with v-l partitions of the training data (see
Stage II for details of training algorithm).

.5

Step 7: Evaluate network with remaining training partition (See
Stage III for details on evaluation procedure).
Step 8: Calculate network sum of squared error across all networks

and compare error versus the number of hidden nodes in the
network structure.

This stage (I. above) is a macro stage that selects inputs,
datasets and partitions for testing; manages the process of fitting the
learning to the data; and keeps track of outputs and performance. It is
a generic step which is characteristic of research uses of neural

networks.

64

In stage II (below) the Levenberg—Marguardt method is used to
train the weights in the network.

Stage II: Train each network in the MFN-DSS with training data using
the Levenberg-Marquardt second-order method.

Each MFN-DSS consists of ten MFNS trained independently using the
methodology discussed here. The architecture employed is a multiple-
layer feedforward network trained using the Levenberg-Marquardt
algorithm, a supervised variant of the popular backpropagation
algorithm. Details of training neural networks with the backpropagation
and Levenberg—Marguardt learning algorithms are available in Appendix C.

Note that the stopping condition must be determined prior to
training. From experience, it was found that training approached its
asymptotic minimum within twenty-five iterations. Therefore, this
stopping condition was incorporated into the training routine. Also,
while the cross-validation procedure serves as an effective guard
against overlearning by restricting the complexity of the networks, they
are still susceptible to underlearning. That is, it is possible that
training will reach an undesirable local minimum. To guard against this
outcome procedures were employed based on the type of output being used.
For the dichotomous output variable, network performance was computed
using the training data. If the trained network correctly classified
all of the training data it was retained. Otherwise, it was retrained.
For the continuous output variables this method could not be employed.
Instead, 200 networks were trained and the ten networks with the lowest
SSE and MSE on the training data were retained. These ten retained
networks comprise the MFN-DSS. Essentially, the output of each MFN-DSS
is the mean output of these ten networks given an input vector of scores

for an NPD project.

65

Following this stage the predictive ability of the MFN—DSS is
evaluated with the evaluation data (Stage III).

Stage III: Evaluate the performance of the MFN—DSS with evaluation
data.

The performance of the decision support systems is assessed in two
ways. In the first method, the sum of squared error (SSE) and mean

squared error (MSE) are computed for the system predictions.

Step 0: Freeze network weights to trained values.

Step 1 Repeat steps 2-5 until all patterns in evaluation data have
been used.

Step 2: Propagate input vector through network to obtain network
prediction for each network.

Step 3: Calculate average network output for input pattern (across

the ten networks).

Step 4: Compute the difference between the prediction and the actual
value for each input pattern.

Step 5: Evaluate predictive ability of committee by calculating the
SSE and MSE.

In the second method, the outputs of the system are classified as

either success or failure and the percentage of correct predictions is

computed.
Step 0: Freeze network weights to trained values.
Step 1 Repeat steps 2-6 until all patterns in evaluation data have

been used.

Step 2: Propagate input vector through network to obtain network
prediction for each network.

Step 3: Calculate average network output for input pattern (across
the ten networks).

Step 4: Round average network output to either 1 or O to determine
committee prediction.

Step 5: Compare committee prediction to actual output from
evaluation data.

Step 6: Evaluate predictive ability of committee by calculating the
proportion of correct predictions.

This completes the methodology for creating, training, and
evaluating a MFN-DSS. Next is a discussion of the methodology for
creating, training, and evaluating a probability neural network (PNN).
In this method, a single stage both creates and trains the network.

The alternative artificial neural network architecture that has
been chosen for this study is the probability neural network (PNN). The
PNN has its roots in Bayesian theory and the Parzen (1962) method for

estimation of population probability density functions. But it was not

66

developed until 1988 when Donald Specht (1988) recognized the usefulness
of combining a neural network structure with Parzen's PDF estimation
method. Although the PNN is relatively new, it has been used
successfully in a variety of applications including character
recognition, classification of ships from infrared images, and the
discrimination of rumination and eating phases of animals from audio
recordings. See (Zaknich and Attikiouzel 1992) for a summary.

The basic structure of a PNN is a four layer feedforward network
with an input layer followed by a pattern layer followed by a summation
layer and ending with an output layer. See Figure 10. The input layer
is identical to an MFN and represents the inputs to the network. The
pattern layer is similar to the hidden layer in an MFN, but has some
important differences. The pattern layer contains a node for each
pattern in the training data. The weights from the input layer to the
pattern node are held constant at the values of the inputs for the
training pattern. The summation layer contains a node for each class in
the model. Pattern nodes are connected only to the summation node

associated with their class and the weights between pattern nodes and

I I 1 '
summation nodes all remain a constant value of ——— where nC IS the
n
C

number of pattern nodes for class c. The summation nodes are then
connected to the output node. The output node simply selects the
summation node with the largest value to determine the predicted class.

The only parameter that needs to be established during training is
sigma (6). Sigma represents the degree of spread for the Gaussian

kernel. A small value for sigma results in sharply peaked kernels.
Large values of sigma produce wide kernels. The optimal value for sigma
depends on the distribution (sparseness) of the training data. Figure
11 depicts the estimated PDF of a given data set for four values of
sigma. The curves along the bottom of each graph are the kernel

functions. The sum of these kernel functions is graphed and labeled in

67

each figure. If the selected sigma is too small, the estimated PDF will
closely follow the series of spikes (See Sigma = 1 in Figure 11). If
the selected sigma is too large, information will be lost and the PDF
will be blurred (See Sigma = 15 in Figure 11). For the data presented
in Figure 11 the optimal sigma is probably somewhere between five and
ten.

This presents a situation similar to that of deciding on the
appropriate number of hidden nodes to incorporate into an MFN. If the
chosen sigma is too small, the contribution of individual data points to
the estimated PDF are exaggerated and the network performs poorly on new
data. Alternatively, if the chosen sigma is too large, the estimated
PDF becomes too smooth and fails to recognize important details in the
data. Thus, a cross-validation procedure similar to that which was
discussed earlier and a line optimization procedure is used to determine
the appropriate value of sigma. These procedures utilize only the
training samples and once sigma is determined, the PNN is ready for use.

The cross-validation procedure used in this PNN is the leave-one—
out method and provides an estimate of the predictive ability of the PNN
for a given value of sigma. For a given value of sigma, each training
pattern is used as the unknown pattern vector and that pattern vector is
removed from the network. If the pattern vector is left in the network,
the result is trivial and the network will predict correctly.

Once constructed, the PNN is ready to use. That is, training
involves the building of the network by adding a pattern node for each
pattern in the training data and the application of training vectors to
weight vectors between inputs and pattern nodes. To use the network, an
unknown input vector is fed through the network. At each pattern node,
the input is the Euclidean distance between the input vector and the
weight vector (which was the original input vector for the pattern
node). The Euclidean distance is then processed through the node's

activation function (the Gaussian kernel). The output of each pattern

68

rnode is then multiplied by the weight between the pattern node and

Stnnmation node and summed at the summation nodes. The output of the

Stnnmation nodes represent the probability that the unknown sample is a

rnennber of the class corresponding to each summation node. The output

ruode then selects the summation node with the highest probability. The

PFIN used in this study will not include the output layer and will end

hnith the class probabilities at the summation layer.

Anethodology for creating, Training, and.Evaluating a PNN

St:ep 0: For each training pattern, do Steps 1—3.

St:ep 1: Create a pattern unit.

St:ep 2: Assign weight vector for pattern unit to be equal to pattern
input vector.

St:ep 3: Connect pattern unit to summation unit corresponding to
pattern unit’s class.

St:ep 4: To determine appropriate sigma (5), do Steps 5-
St:ep 5: Select starting endpoints for sigma.
St:ep 6: Select intermediate point for sigma which cuts off 38.2% of

the distance between the endpoints. These three points
constitute the first sigma trio.

St1ep 7: Select a competing sigma which cuts off 0.618 of the distance
between the endpoints of the trio.

St:ep 8: Do Steps 9-11 for each sigma until each training pattern has
been held out for evaluation. ,

St:ep 9: Remove one training pattern from data.

St:ep 10: Propagate remaining training patterns through PNN to determine
class probability predictions.

£3t;ep ll: Compute network performance as the misclassification rate.

EStmep 12: If performance of competing sigma is equal to performance of
intermediate sigma (i.e. no improvement in performance),
select intermediate sigma and proceed to Step 14.

SStuep 13: If performance of competing sigma is better than performance
of intermediate sigma, competing sigma becomes the new
intermediate sigma and a new trio is formed. Return to Step
7.

Step 14: End.

This single stage creates and trains the PNN. Once trained,
evaluation of the PNN is identical to the second method in Stage III in
tlklei MFN-DSS section. This description of the methodology for creating,
triaining, and evaluating a PNN is an abbreviated version of the more
detailed presentation in Appendix C. For more details, the reader is

re ferred to Appendix C.

Tm

 

Chapter 5

RESULTS AND DISCUSSION

This chapter begins with the results of the evaluations of the

trained neural network models. These results are then compared to

traditional methods. Next, the results are discussed from both academic

 

and practitioner points of view. A case study is then presented to

further amplify the results and usefulness of the decision support

systems. This chapter ends with discussions of future research

directions and limitations associated with this study.

RE SULTS

To address the first research question (Based on statistical

criteria, can artificial neural network decision support systems

Outperform traditional methods of prediction?) the four artificial

neural network systems (three MFN-DSS and one PNN) and the four

Statistical methods are evaluated using the holdout sample which

Consists of 212 NPD projects. Performance is measured in terms of sum

of squared error (SSE) and mean squared error (MSE) where error is the

abSolute value of the difference between the actual project success and

the predicted project success from each method. The results in Table 4

Suggest that both neural network methods outperform traditional methods

in each case. The MFN-DSS has the lowest SSE and MSE among all methods.

Thus, using SSE and MSE as performance criteria, the neural network
deCision support systems outperform k-nearest neighbor, logistic
regression, OLS regression, and discriminant analysis.

To answer the second research question (When predicting a

project's success or failure, can artificial neural network decision

69

¥

   

 

70

support systems outperform traditional methods?) the results of all of

the analyses are recoded as 0 if the prediction is less than 0.5 and 1
if the prediction is greater than 0.5. Then the number of correct
predictions for the evaluation data are tallied. Next, 2X2 confusion
matrices are constructed for these results which indicate the types of
errors that were made. These results in Table 5 indicate that both
neural network methods outperformed all other methods overall. Again,

the MFN-DSS had the highest prediction power, correctly predicting 96.7%
p...

of the 212 NPD projects in the evaluation sample. Therefore, it is

 

concluded that the neural network decision support systems outperform

traditional methods in predicting project success and failure.
The third research question concerns the performance of predictive

methods when a continuous output variable (from O to 10) is being used

(When predicting a project's degree of success or failure, can

artificial neural network decision support systems outperform

traditional methods?). Since the PNN-DSS, k-nearest neighbor, logistic

regression, and discriminant analysis methods are designed for

dichotomous output (dependent) variables, they were not used to address

this question. Therefore, the performance of MFN-DSS and OLS regression

methods were assessed. Predictions for both methods were recoded as

failure if the prediction is less than 3.5, inconclusive if the

Prediction is between 3.5 and 6.5, and success if the prediction is

greater than 6.5. This recoding scheme makes use of the continuous

measures by mirroring the decision making process that managers face in

the NPD process. That is, upon investigation, some projects are clearly

on the road to failure, some are clearly on the road to success, and the

remaining are less clear. This last grouping of projects is categorized

as inconclusive. Inconclusive projects warrant further market research

a 0 I o I - a .
nd analySis before a definitive deCiSion can be reached concerning

futLire potential. The results presented in Table 6 clearly show that

the MFN-DSS'S predictive power far exceeds that of OLS. The numbers on

71

the diagonal from the upper left hand corner to the bottom right hand

corner of each confusion matrix indicate correct predictions while the

numbers on the off—diagonals indicate misclassification. In each case,

regardless of the split or the output (dependent) variable, the MFN—DSS

resulted in fewer misclassifications than the OLS regression method.

Since the cutoffs for these three splits are arbitrary, two

additional splits were performed. The first split is a narrow

inconclusive range such that values near the midpoint (between 4.5 and

5 . 5 on a 0 to 10 scale) are coded as inconclusive. The second

additional split incorporates a wide inconclusive range, from 2.5 to

7 . 5. These results are consistent with those in Table 6. Overall,

across success variables, the MFN-DSS averaged 26.5 (12.5%)

mi sclassifications while the OLS method averaged 128 (60.4%)

misclassifications. Thus, once again, neural network methods outperform

traditional methods when predicting the degree of financial and

t e chnical performance .

DI SCUSSION
The MFN-DSSs developed here exhibit consistently superior

predictive performance when compared to another neural network method

(PNN) and all of the traditional statistical methods regardless of the

Criteria used for performance evaluation. While the systems are

Certainly not fool-proof and do occasionally misclassify projects, they

can serve as an aid to managers in reducing the overall level of

urlCertainty in the project evaluation process. The systems have an

impressive ability to correctly classify failure, ambiguity, and success

for both financial and technical success criteria. Their conservative

predictive characteristics are especially encouraging. In each

application, the MFN-DSS predicted success for an actual failed project

no more than three times (51.4%) in our evaluation sample (n=212) . The

72

(DLS system misclassified a true failure as a success as many as forty-

t:hree times (20.3%) in the evaluation sample.

Thus, for the MFN-DSS, it is more likely that a misclassification

vvill be a prediction of failure for a successful project than a

gprediction of success for a failure. This characteristic is not present

iri the OLS results. The difference in performance characteristics in

truis area can have a dramatic impact on a firm's NPD program

pnerformance. Failure to screen out projects which are on the road to

faailure is a sure path to financial ruin.

Both of the neural network models (MFN-DSS and PNN) outperformed

tzraditional classification models. It is suspected that this is mainly

at:tributable to the nonlinear nature of the neural network methods, but

th should be pointed out that the logistic regression model is also

nonlinear .
From a practitioner view, the trio of MFN-DSSs can be used for
nuanagerial guidance in project screening and diagnostic evaluation in

true management of NPD projects. With each network's parameters

(vveights) fixed to their trained values, inputs from ongoing or proposed

EDxrojects can be fed through the networks. The output of each MFN-DSS

Ebrrovides the new product manager with an indication of the likelihood of

£3L1ccess or failure or degree of financial or technical success for the

(Drlgcdng project. Thus, they can be used as a screen to highlight

EDCDtKential successes and weed out potential failures.

They also provide guidance concerning which projects need more

evaluation before a final go/no-go decision is made. For example, a

EDI7C3>ject may result in a “success" prediction for the dichotomous success
Dueassure and a relatively high prediction for the degree of technical

Suthzess, but result in an ambiguous or even failure prediction for

fir‘lancial success. In this case the manager may decide to scrap the

EDJTCDIIect entirely, or may, based on the information gained from the three

<j€3<3iwsion support systems, decide to hold the decision until further

73

market research is performed to provide more reliable information

concerning market potential. Once this new information is obtained, the

updated project characteristics may provide the manager with less

ambiguous predictions, thus making the decision more clear.

In practice, a project may receive a favorable evaluation by

management and resources may be allocated to the project for completion,

but the project's immediate direction may be uncertain. That is, there

may be a handful of scenarios presented by various members of the

development team or by management that each have merits. This is where

the MFN-DSSs are envisioned to be used as diagnostic tools by product

managers to answer “What if?” types of questions. To use it as a
diagnostic tool, an input vector from the current project is fed into
the networks and the output of the MFN-DSS is computed. Then, based on

recommendations from managers involved in the development process,

levels of input variables can be adjusted to match a variety of

alternative resource allocation scenarios. These scenarios can then be

analyzed by each MFN-DSS to determine their impact on the likelihood or

degree of success. For example, if the MFN—DSS indicates that a

particular project is projected to be a failure, but the project is

critical to the strategic direction that the firm has chosen, performing
a series of scenarios for that project may identify a resource

allocation pattern which is most likely to result in successful

development. That is, a scenario may be found in which marginal

increases in the levels of marketing research and distribution resources
(Inputs #10 and #15) and a marginal reduction in engineering skills

(Input #2) would result in a project which is more consistent with

patterns in previously successful projects. In this case, management

Could approve further development of the project with the stipulation
that additional resources be targeted towards marketing research and

Ciistribution and that an engineer on the development team be reassigned

t 0 another project .

74

An alternative example could involve the cutting back of resources
given to a number of projects which are predicted to be successes by the
MFN-DSS. Given a mandate from senior management to cut all new product
project budgets by 10%, new product managers could use a series of
scenarios for their project to determine which areas to avoid cuts and
which resources can be cut without having adverse effects on various
success criteria.

CASE STUDY

As an additional test of the performance and generalizability of
the MFN-DSSS a single firm case study is performed. This case study
also serves as an illustration of how MFN-DSSs can be employed to save
money in a NPD program. The data for the case study are independent
from the data set used previously for training and evaluation of each of
the models. These data are from twenty-three projects developed over a
five year period at a major US manufacturing firm. The R&D and tooling
costs in these projects ranged from $1 million to $23 million and from
$3 million to $62 million respectively and averaged $6.5 million and
$25.8 million respectively. Thus, these projects involved relatively
large development costs, not to mention the marketing costs incurred at
launch.

This new data is fed through each of the three MFN-DSSs
constructed and trained with the previous data set. The analysis begins
by constructing confusion matrices for each of the three types of
success. These are presented in Table 7. For each type of success, the
MFN-DSSs perform well, correctly predicting 87%, 96%, and 74% of the
twenty-three projects for dichotomous success, financial success, and
technical success, respectively.

Next, likely usage of the MFN-DSSS as an aid to managers in the
project evaluation process is illustrated. Five decision rules are
employed: (1) if any of the three predictions are “failure,” the

project is discontinued; (2) if the technical success prediction is

75

“inconclusive” and the financial success prediction is “success,” the
project is re-evaluated from a technical viewpoint; (3) if the technical
success prediction is “success” and the financial success prediction is
“inconclusive,” the project is re-evaluated from a financial viewpoint;
(4) if both the technical success prediction and the financial success
prediction are “inconclusive,” the project is re-evaluated from both
technical and financial viewpoints; and (5) if both the technical
success prediction and the financial success prediction are “success,”
the recommendation is to proceed with development of the project. Note
that decision rule #1 overrides the other decision rules. That is,
decision rules 2-5 assume that dichotomous success prediction is
“success” or the recommendation would be to discontinue the project.

Table 8 details the results of the application of these decision
rules to the twenty-three cases. The focus is on the bottom row of the
table indicating the consequences of using the decision support systems.
In five cases (2, 4, 9, 12, and 14), application of our system would
have no effect. In these cases, the suggested action is to continue
development of a project which eventually succeeded in the market.

While it is reassuring to find that the system makes correct suggestions
in these cases, managers in these cases made the right decision without
the aid of this new tool.

In five cases (1, 6, 8, 13, and 15) the recommendation is to re-
evaluate the project before making a decision. In these cases it is
impossible to evaluate the consequences of using the system because
there is no way of knowing what the results of these re-evaluations
would bring.

Following the suggestions of the decision support system for cases
3 and 10 would result in Type II errors. In both of these cases, the
system suggests “no-go” for projects which eventually succeeded in the
marketplace. Following these recommendations would result in

Opportunity costs to the firm in the form of lost profits. While the

76

actual losses incurred from these projects could not be obtained, the
R&D and tooling costs averaged $3 million and $8.5 million,
respectively. This indicates that for this particular firm these were
relatively small projects.

Almost half of the suggestions from the system would result in
cost savings for the firm. The system suggests “no-go” for eleven (5,
7, 11, and 16-23) cases which ultimately failed in the marketplace. Had
the managers of this firm used the system, they could have avoided not
only the losses incurred in the marketplace from these products but also
the costs incurred throughout the development process. The total
savings in R&D and tooling costs alone for these eleven projects would
have been $462 million.

In summary, this case study illustrates both how the system is
implemented and how the system can provide considerable cost savings
while incurring limited opportunity costs.

FUTURE DIRECTIONS

This study represents the first use of artificial neural networks
in NPD research. As such, it serves as a roadmap for future
applications of artificial neural networks in marketing and NPD. The
network structures, training algorithms, training paradigms, validation
procedures, and analysis procedures used in this study can be used as a
guide for future applications in consumer behavior, channels, pricing,
promotions, retailing, and so on. For example, weekly scanner data
containing brand sales and weekly promotional activities could be used
to train an MFN-DSS or PNN to recognize the effect of promotions on
sales. Given the time series nature of this data, a MFN-DSS or PNN
could be used to predict future weekly sales given various promotional
mixes.

An MFN-DSS or PNN could be created to aid direct marketers in
target selection for mailings. Also, an MFN-DSS or PNN could be trained

using product positioning variables, relative product price, and product

77

availability as inputs and sales as outputs. This system could be used
to understand how positioning, pricing, and availability interact to
affect sales. Various proposed scenarios could be analyzed by the
system to determine probable outcomes. In another application,
attitudes, lifestyle characteristics, and demographic characteristics
could be used as inputs and purchase intent and purchase behavior could
be used as outputs to create a model for studying consumer behavior.
Using an MFN—DSS or PNN in a channels context could result in a system '
incorporating the relative power of channel participants and competitors

as inputs and channel behavior variables as outputs. In summary, this

 

study suggests that there is great potential for the application of
artificial neural networks to a wide variety of marketing applications.
LIMITATIONS
Limitations of This Study
Expert systems based on rules from a group of experts with
extensive experience in new product projects have not been developed
because the relationships involved in the development process are highly
complex and difficult to articulate (Cooper 1992). Just as experienced
managers may be able to distinguish between doomed projects and those
which are likely to be huge successes, they may also have difficulty
articulating a set of rules which would guide a generic project to
success, the MFN-DSS and PNN developed here have learned these
underlying relationships but it is not yet possible to extract simple
rules from the trained networks. Models in the literature that have
been constructed to represent the NPD process are simple and useful, but
(do not provide the predictive ability that can be obtained by using the
Huethods described in this study. The relationships between success

fEictors are complex and probably not strictly linear. Therefore, a

 

trWadeoff is seen between predictive ability and explanatory power in
modeling the NPD process. At the present time, a modeling process which

j_5; .Superior in both respects does not exist. Developing a model which

 

78
eliminates this tradeoff is a worthy academic endeavor. The question

remains, though, is this problem solved by modifying existing models?

That is, do we learn to interpret neural network models, improve the

predictive ability of traditional linear models, or is a completely new

innovation needed?
Limitations of'Neural Networks in General

In addition to limitations which are specific to this study, it is

important to recognize limitations of neural networks in general. MFNs

are popular because of their speed (of use) and flexibility. The

training process can be slow, but once trained, the parallel structure

of MFNs makes their implementation easy and fast since only the forward

pass is needed for use. Also, it has been shown by various researchers

(Cotter 1990; Cybenko 1989; Funahashi 1989; Hecht-Nielsen 1989; Hornik
1991; Hornik, Stinchcombe and White 1989; Ito 1991; Kreinovich 1991)
that MFNs with a single hidden layer of nodes with sigmoid activations
can approximate an arbitrary continuous function to any desired

accuracy. The accuracy of the approximation is only limited by the

number of nodes in the hidden layer.

MFNs trained using backprop do have limitations. In general,

neural networks have two disadvantages when compared to traditional

statistical methods. First, the parameters of trained neural networks

are difficult to interpret. The research focus in neural networks built

for generalizability has been on constructing networks with high
performance characteristics evidenced by the accuracy of predictions or

Classifications. Up to this point in neural network development and

research, network parameter interpretability has been secondary to this

goal. While solving the “black box” problem of interpretability has
kNEen a secondary goal in neural network research, it has received some

attention recently. Researchers are beginning to develop methods for

extracting rules from trained networks (Towell and Shavlik 1992, 1993)

anCi incorporating interpretable nodes into neural networks (Opitz and

79

Shavlik 1995).

Second, despite some of the extraordinary claims surrounding
neural networks (Hecht—Nielson 1990), they can be difficult to
implement. One cannot expect outstanding performance by simply
gathering data, creating an arbitrary network structure, and training
with an arbitrary algorithm. Neural networks are extremely powerful and
flexible and a considerable amount of attention needs to be given to
controlling the learning process to properly harness this power and
flexibility. The indiscriminant application of neural network
procedures to arbitrary problems runs the risk of either overlearning or
underlearning which ultimately leads to poor performance. To guard
against both overlearning and underlearning, rigorous methodologies well
grounded in the neural network literature were applied in this study.

Both of these disadvantages are artifacts of the young age of the
field. While neural networks originated in the 1940's with the work of
McCulloch and Pitts (1943), most advancements have occurred only
recently. The popularity of neural networks was really launched in the
mid- to late—1980’s. A number of influential articles (e.g. Hinton
1989; Knight 1990; Lippmann 1987) and books containing seminal articles

(e.g. Anderson, Pellionisz and Rosenfeld 1990; Anderson and Rosenfeld
1988; Rumelhart and McClelland 1988) sparked this interest and have lead
to the creation of new journals, conferences, and research groups.
Despite these limitations, neural networks offer a significant
advantage over traditional statistical methods. While there are
Similarities between neural network models and statistical methods,
neural networks are extremely adaptive and flexible. Artificial neural
network typologies and learning paradigms can be adapted such that they
are similar or even equivalent to traditional statistical methods.
There are a variety of ways in which the performance
ChEaracteristics of neural networks can be controlled. For example, the

Structure, training algorithm, and learning environment can be adapted

80

to meet specific needs. If more flexibility and learning power is
needed there are a variety of options to choose from. The network's
complexity can be increased to meet the complexity of the problem by
adding hidden nodes to a hidden layer or by adding an additional hidden
layer of hidden nodes. Alternatively, a more powerful learning
algorithm could be chosen to aid in finding a better solution for the
weight matrix.

In conclusion, this study demonstrates the usefulness of this new
technology, neural networks, as a tool that can aid new product managers
as they tackle the difficult challenges of product development. This
study is by no means intended to be the final word on the subject,
though. Instead, it is seen as a starting point or launching pad for
further research. The methods developed here can be applied in other
marketing contexts where prediction based on experience is warranted.
Furthermore, as new neural network methods are developed, they should be
incorporated into marketing research where they can improve predictive

accuracy and/or improve the interpretation of trained networks.

APPENDICES

APPENDIX A

Tables and Figures

 

ﬂamma .omma cmﬁamv

kuaaw cmEamx

¥H03U®C CMEHW

 

 

 

lemma .mmma poonaawxaa

axmzochwm ocm coucﬂ: .>oaxo<v EDEmeE ma> coameHumm oaHumEMHMQHEwm moanome cmENuHom
Amm<2v

“Hmmﬁ.cmeooﬁumv mocHan coammwwmww w>aummpm wumaum>auasz zmz

lemma.>wamamv Ammmv conmwwme pasmwsm coauownowm zmz

 

Aomma tmmma mapﬂwmv

coammwwmmw oaumamoH umocﬁauooq

rhesuwc meumom

 

lemma neomamc

COHWWOHOQH HOCHOM

rzzmoc awesome
amusmc coammmummw amwwcwo

 

roams urowdmc

coauocsw ucmcHEaHUmaU chwmx

lzzmv enozum:
amazon oaumaaanmnoum

 

lemma smaaam
“Nome cmaeumauz “mama nonmamc

coammwwmww caumaooq

.mcoﬂuucnw
coHum>Huom oaumaooa
Qua: mZHA¢Dm2\MmHA<Q<

 

lemma uwom pcm Zoupazv

:onmwuomn wmwcaq

.mcoauUCDw
coHum>auom nausea
Suﬁ? wzHA¢Q¢Z\MZHA¢D<

 

lemma cw>mua use chnsmc

c0auocsw ucmcHEauomao nausea mrwmrmam

.mcoauucsw
coHum>auum paocmmwru
SuHS mqu<de\mzHA<Q¢

 

Aomma .mmmH coconoxa

mam>amcm ucmcaEHHomaU woncmawc ummpmmz

AO>AV coaumNaucmno
Houom> acacwqu

 

AHmmH uwﬁmuxv

mam>amcm mucwCOQEoo Hmmaocaum umocaacoz

 

 

lamma nomcmm “mama .mmma aﬂoc

mcHamom amcoawcmaﬂpauasz
\ mamhamcm mucwCOQEou HMQaUCaHm

Eh
mcacwmma swannmm

rue: um: pwmﬂ>umqsmco
uzmz w>aumﬁuommmou5¢

 

lemma .hmmﬁ
owwnmmowo pom Hmucmmwmuv

 

mam>amcm umeSHo

 

Aemdv
>womre wocmcowmm m>aummo¢

 

 

nqoaumuwo ucm>oaom

porno: amoaunauuum

Honor xuotuoz anunoz

.nuorucz Hmoaunaunum mo uuqaauu> xuozuoz Hauaoz "H wanna

nausUwh van nuanna

1 NHOZHAA‘

81

82

Table 2: Measurement Items.

INDEPENDENT VARIABLES

Respondents assessed each of the following items on a scale of
zero (strongly disagree) to ten (strongly agree) unless noted
otherwise.

1.

2.

10.
ll.
12.
13.
14
15.
16.
17.
18.
19.
20.

21.
22.

23.
24.

25.
26.
27,

28,

Our company's R&D skills were more than adequate for this selected
project.

Our company's engineering skills were more than adequate for this
selected project.

Our company's marketing research skills were more than adequate for
this selected project.

Our company's management skills were more than adequate for this
selected project.

Our company's manufacturing skills were more than adequate for this
selected project.

Our company's salesforce skills were more than adequate for this
selected project.

Our company's distribution skills were more than adequate for this
selected project.

Our company's advertising/promotion skills were more than adequate
for this selected project.

Our company's R&D resources were more than adequate for this selected
project.

Our company's marketing research resources were more than adequate
for this selected project.

Our company's engineering resources were more than adequate for this
selected project.

Our company's management resources were more than adequate for this
selected project.

Our company's manufacturing resources were more than adequate for
this selected project.

.Our company's salesforce resources were more than adequate for this

selected project.

Our company's distribution resources were more than adequate for this
selected project.

Our company's advertising/promotion resources were more than adequate
for this selected project.

This product relied on technology which has never been used in the
industry before.

This product was a modification of an existing product.

This product caused significant changes in the whole industry.

This product was one of the first of its kind introduced into the
market.

This product was highly innovative - totally new to the market.

The potential customers for this product were totally new to our
company.

The product class itself was totally new to our company.

The nature of the manufacturing process was totally new to our
company.

The technology required to develop the product (R&D) was totally new
to our company.

The distribution system and/or type of salesforce for this product
was totally new to our company.

The type of advertising and promotion required was totally new to our
company.

The competitors we faced in the market for this product were totally
new to our company.

29.
30.
31
32
33.

34
35.

36.
37.
.There was a strong, dominant competitor - with a large market share -

38

39.

40.

41

42.
43.

83

There were many potential customers for this product - a mass market
— as opposed to one or a few customers (0 = one customer; 10 = mass
market).

Potential customers had a great need for this class of product.

.There was only a “potential demand” for this product class - no

market existed at the time of introduction.

.The dollar size of the market (either existing or potential) for this

product was very large.
The market for this product was growing very quickly.

.Competing products were very similar to each other.

Non-price competition in the market was very intense.
Price competition in the market was very intense.
There were many competitors in this market.

in this market.

Potential customers for this product were very loyal competitors'
products in this market.

Potential customers for this product were very satisfied with
competitors’ products.

.New product introductions by competitors were frequent in this

market.

Users’ needs changed quickly in this market - the market was dynamic.
Government legislation, rules, standards, etc. played an important
role in the design and testing of products for this market.

SUCCESS VARIABLES
Dichotomous Success variable - Respondent was instructed to select a

successful or unsuccessful project. (Item coded as zero for failure
and one for success.)

Financial Success - How successful was this product from an overall

profitability standpoint?
Respondents assessed this item on a scale of zero to ten based on
the following endpoints:

0 = A great financial failure: far below our minimum acceptable
profitability criteria.
10 = A great financial success: far exceeded our minimum acceptable

profitability criteria.

Technical Success - Please indicate the technical success of this

product, relative to the technical success of other new products
introduced by your firm.

Respondents assessed this item on a scale of zero to ten based on
the following endpoints:

0 = A great technical failure: far less than the technical success
of our other new products.
10 = A great technical success: far greater than the technical

success of our other new products.

84

Table 3: Summary of Cross-validation Procedures for MFNs.

 

 

Hidden Nodes
Success Type 1 2 3 4 5 6 7
Dichotomous‘ 10.25 9.00 8.00 8.75 4.75 5.25 13.00
Financial” 40.03 38.28 23.04 31.50 22.88 31.52 45.65
Technical” 17.87 18.19 17.15 19.21 14.41 23.44 19.25

 

 

 

 

 

 

 

 

Notes:
1 Values are the percentage of misclassifications among the 400

training projects.
2 Values are the SSE for the 400 training projects.

Table 4: Summary of Predictive Performance for all Mbdels.

Dichotomous Financial Technical
Success Success Success
(SSE) (MSE) (SSE) (MSE) (SSE) (MSE)

 

 

 

 

 

 

MFN-DSS 5.35 0.025 4.769 0.023 3.815 0.018

PNN 8.78 0.041 N/A N/A N/A N/A

k-Nearest Neighbor 16.00 0.075 N/A N/A N/A N/A

Logistic Regression 25.65 0.121 N/A N/A N/A N/A

OLS Regression 32.48 0.153 49.132 0.232 18.779 0.086

Discriminant Analysis 44.00 0.208 N/A N/A N/A N/A
Note: Errors based on results after scaling all measures from 1-10 to

0-1.

Table 5:

MFN-D88

Probability Neural Network
k-Nearest Neighbor

Logistic Regression

Ordinary Least Squares Regression
Discriminant Analysis

IMultilayer Feedforward Network -
Decision Support System (MFN-D88)

 

Actual Actual
7 Failure Success
Predicted 102 3
Failure
Predicted 4 103
Success

k-Nearest Neighbor

 

Actual Actual
Failure Success
Predicted 100 10
Failure
Predicted 6 96
Success

Ordinary Least Squares Regression

 

 

Actual Actual
Failure Success
Predicted 86 24
Failure
Predicted 20 82
Success
Note: On a 0 to 1 scale,

85

Success/Failure Prediction Performance.

Number of Correct

Predictions in
Test
Sample (n=212)

 

 

205
197
196
181
168
168

Probability Neural Network (PNN)

 

Actual Actual
Failure Success
Predicted 100 9
Failure
Predicted 6 97
Success

Logistic Regression

 

Actual Actual
Failure Success
Predicted 89 14
Failure
Predicted, 17 92
Success

Discriminant Analysis

 

Actual Actual
Failure Success
Predicted 86 24
Failure
Predicted 20 82
Success

 

a score less than 0.5 was coded as failure and

a score above 0.5 was coded as success.

86

 

 

 

Table 6: Success/Inconclusive/Failure Prediction Performance.
Actual Financial Performance

Failure Inconclusive Success

(n=93) (n=13) (n=106)
MFNl OLS m 1 OLS mm 1 OLS
Predicted Failure 82 28 0 3 1 31
Inconclusive Prediction 8 35 13 7 12 36
Predicted Success 3 30 0 3 93 39

 

 

 

Actual Technical Performance

 

 

Predicted Failure 20 1. 0 2 0 4
Inconclusive Prediction 4 15 49 28 13 63
Predicted Success 3 11 9 28 114 60

 

 

 

 

a score less than 3.5 was coded as failure, a
and a score above 6.5

Note: On a 0 to 10 scale,

score between 3.5 and 6.5 was coded as ambiguous,
was coded as success.

Table 7: JMFN-DSS Performance on Case Study Data.
Correct
Predictions
Project Outcome in Case
Study
Prediction Failure Success (n=23)
Dichotomous Failure 10 1
Success* Success 2 10 20 (87.0%)
Project Outcome
Prediction Failure Inconclusive Success
Financial Failure 8 0 0
Success** Inconclusive 0 9 0
Success 0 1 5 22 (95.7%)
Project Outcome
Prediction Failure Inconclusive Success
Technical Failure 2 3 0
Success** Inconclusive 0 7 2
Success 0 1 8 17 (73.9%)
Notes:

*On a 0 to 1 scale,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

score above 0.5 was coded as success.

**On a O to 10 scale,
score between 3.5 and 6.5 was coded as inconclusive,

6.5 was coded as success.

a score less than 0.5 was coded as failure and a

a score less than 3.5 was coded as failure, a
and a score above

87

.mmOUUDm WOUMUﬂUCH

:m: Ocm

.m>HmsHocoocH moumoaoca :H:

emHDHHmm mmHQUHUCH th:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mocH>mm mocH>mm uonwm
umou m\c powwwm oz m\c uommwm oz umou HH 0Q>B uomwmm oz
H H m H H m H m
m H m H m m H H
m m m m m m m m
mumsHm>o mumsHm>m
00:02 .mm 00 tom 00 ouloz ouuoz 00
m H m H m m m m
m H m H m m H m
m m m m m m m m
0H ommo mH ommo vH ommo mH mmmo NH ommo HH mmmo oH ommo m ammo
mocH>mm mocH>mm nowum
m\c umoo m\c umoo nomuum oz HH muse summon oz m\c
m m m H m m m m
H H m m m H m H
m m m m m m m m
mumsHm>w mumsam>m mumsHm>m
tom ovuoz now 00:02 00 00:02 00 tom
H m H H m m m m
H H m m m H m H
m m m m m m m m
m wmmo n mmmo o wmmu m mmmo v ammo m ommo N ommo H ommo

"muoz

menZﬁE

Dawns mo GOGGnUOncoo
mmwoosm Hmuaccowe
mmmoosm Hmaocmch
mmooosm m5060uocoao
aaoouno

uoanoum Hugues
OHnm cowuwooo MD
vouuommam cowvoe
mmmuosm Hmoaccume
mmooosm HmHocmch
mmoousm msoeouocoao

acoauoauonm mmnuzmz

meIZMz

unﬁnb mo conesvoecou
mmmousm Hmoaccome
mmmoosm HmHucmch
mmmoosm m2050uocoao
Oioouso

uoonoum aasuua
oaam aoaoaooa an
neueomoam cowuuﬂ
mmmousm HMUHccume
mmmousm HmHucmch
mmmoosm msoeouocoHo

nGOﬂuOﬁOOHm mmanHZ

.ﬂuma heave Onﬂo ou eOHam cowewoon mo cowu10ﬁdnmd "Q Games

88

.mmoousm

mm#MUHUCﬂ

pom .:0>HmsHocoocHs

moumoHch

ssHS sOHSHHMrw mmHmUﬂUCH 3h: ”GU07.—

 

moca>mm
umoo

mocH>mm
umoo

mocH>mm
umoo

00CH>mm
umoo

mocH>mm
umoo

mocH>mm
umoo

mocH>mm
umou

 

H
m
m

H
H
m

H
m
m

m
H
m

H
m
m

m
m
m

H
m
m

 

00:02

00:02

00:02

00:02

00:02

00:02

00:02

 

 

H
m
m

 

H
H
m

 

H
m
m

 

m
H
m

 

H
m
b

 

h
h
m

m
h
m

 

 

 

MN @060

NN @000

Hm 8880

ON 0000

GH 0000

ma 0000

”H mmmo

000:;

unweb mo OOGOsUOQGOU
mmmousm Hmoaccooe
mmwoosm HmHucmch
mmooosm msoeouocoao
oaoouso

uoonoum Husuo<
oaam coeewocn an
009000050 deﬁant
mmwousm Hmoaccooe
mmouusm Hmaocmcam
mmouusm m30E000c0Ho

3030808 menu—ﬁx

. «OOH—GﬂuGOUV Q CHJ‘E

Nucleus

«-

k

 

89

Cell Bod Axon

. >——<

Dendrites Synapses

 

Figure 1: Basic Neuron Components.
1 a) (l) (0)
(0) (1)
#
1
Figure 2:

Desired Response for the XOR Logic Function.

 

I'm: 7....

9O

 

 

 

Wbias

Figure 3: Processing Unit.

 

F1911:

 

0

 

>0

Input Hidden Output
Layer Layer Layer

1"?

Figure 4: Example of a Feedforward Artificial Network.

 

Feedback
Connections

   
 
 
 

 

>0

 

*0

Input Hidden Output
Layer Layer Layer

 

Figure 5: Example of a Recurrent Artificial Neural Network.

Hard Limiter

92

Ramping Function

 

 

 

 

 

 

 

Hyperbolic Tangent

 

Sigmoid

Functions LOgiStiC

__2

 

 

Figure 6: Common Activation Functions.

 

 

Learning Process I

 

I

l

l

 

 

Learning Paradigm

 

 

Learning Algorithm

 

 

 

 

H—Supmndeed
— Unsupervised

 

—- Hebbian
—- Competitive
— Error-Correction

 

Figure 7: The Learning Process Hierarchy.

93

 

 

 

Figure 8: Function Approximation Under-fitting and Over-fitting.

Optimal
Stopping
Error Point

Validation Data

 

Training Data

 

 

# of Iterations

Figure 9: Training and validation Errors.

94

 

e’s"

<4
’\

 

Input Pattern Summation Output
Layer Layer Layer Layer

Figure 10: Probability Neural Network (PNN) Structure.

Sigma=1

4 (

I
4

 

 

 

 

 

 

:11:

Sigma=10

 

 

4

)1

 

 

95

Sigma=5

PDF

    
  

    
      

,4, \\

 

      

PDF

 

 

Figure 11: Effect of Changes in Sigma in PNNS.

96

----- Indicates neighborhood of size 1.
Indicates neighborhood of size 2.

2nd Hidden Node

 

 

 

Kohonen Two—Dimensional SOFM.

Outputs

0

 

f

 

T

x)

 

 

 

 

 

 

 

(XI

 

 

 

 

 

 

O O O O O O
O O O O O O
O O O O O O

+ ----------- +
O O O I O O O l
O O O l O X O l
O O O I O O O l

+ ----------- +
O O O O O O
O O O O O O
O O O O O 0
Notes:
X = Selected unit.
Figure 12:

lst Hidden Node
f(x)
Inputs

Figure 13:

 

 

 

 

 

 

i
%

Cascade Correlation Structure.

0:: IF?”

 

APPENDIX B

Demonstrations of Neural Network Learning.

APPENDIX B

Demonstrations of Neural Network Learning.

This appendix begins with demonstrations of both supervised and
unsupervised learning via a Hebb net and a Kohonen self-organizing
feature map (SOFM). All of the calculations in the Hebb net
demonstration and most of the calculations in the SOFM demonstration are
given to provide a rich, hands-on understanding of the learning process.

The first two demonstrations are worked out in detail using
simulated NPD data. The first demonstration is a simple supervised Hebb
network. All calculations in the learning process are presented. This
example provides a deep understanding of the Hebbian learning algorithm
and affords a hands-on level of knowledge of the training process in a
neural network that recognizes patterns in data. It also demonstrates
how neural networks learn from data, generalize to unseen data, and
exhibit fault tolerance.

The second example is a Kohonen SOFM. In this example, an
unsupervised network is trained using a competitive learning algorithm.
While the data is simulated, the exercise demonstrates how NPD data
could be analyzed using neural networks. Specifically, a neural network
is created which separates input data into two distinct clusters. Also,
most of the calculations are shown in this demonstration.

Next, the backpropagation algorithm is derived and discussed.
Backpropagation is often used to train the most popular class of neural
networks, multiple-layer feedforward networks (MFNs). After derivation
of the backpropagation algorithm, a number of variations are presented

and discussed.

97

 

 

98

While the data used in the demonstrations is simulated, it is also
a reasonable representation of new product projects. Consider a survey
of managers involved in the development of new products. The purpose of
the survey is to determine patterns of NPD factors that lead to new
product success and failure when developing highly innovative, new—t0-
the-world products. In this case it is reasonable to expect that
considerable cross-functional involvement is needed because development
involves technological complexities that need to be shared between
functions (Griffin and Hauser 1996; Gupta, Raj and Wilemon 1986; Ruekert E
and Walker 1987). i

It would also be reasonable to expect that formalization would
need to be high to provide the development team with a set of rules to
work with as they reduce uncertainty with respect to technological
complexities (Moenaert et a1. 1994; Moenaert and Souder 1990; Ruekert
and Walker 1987; Zaltman, Duncan and Holbeck 1973). That is, formal
rules concerning the operating procedures which guide the development
process can serve as a foundation for the development team. However,
centralization, defined as the concentration of power in upper levels of
the management hierarchy, would be expected to be low for successful
products as the development team would need to retain a considerable
amount of power if they are to make important decisions regarding the
direction of the new product (Gupta, Raj and Wilemon 1986; Moenaert et
a1. 1994; Moenaert and Souder 1990).

Finally, the last variable in the simulated survey is emphasis on

product concept testing early in the development of the new product.

 

Since the product is highly innovative and new to the world, early

product concept testing may not be effective as potential customers may

99

not be able to fully comprehend the benefits that the new technology may
be able to provide. Early product concept testing may lead to
unnecessary delays and confusion among the developers.

Therefore, the simulated data set includes four independent
variables: cross-functional involvement (x), formalization (f),
centralization (c), and emphasis on early product concept testing (t).
The only dependent variable in the data set is new product success (5).
All of the data in this data set is bipolar. Responses of 1 or -1
indicate the presence or absence (respectively) of the factor in the
development process respectively. A response of 1 for the success
variable indicates a successful project while a response of -1 indicates
failure. The seven responses used in the demonstration are listed below
with a brief comment presenting a possible rationale for success or

failure in the specific project:

1 1) Despite negative effects of centralization, success was
due to cross-functional involvement to reduce
uncertainties and formalization which established a
foundation for the team to work from.

(-1 1 1 l -1) Early emphasis on product concept testing led to delays
and confusion among developers. Concentration of power
among senior managers not involved in the project
prevented the developers from making important
decisions concerning the direction of the new product.

(-1 l l —1 -1) Lack of power among developers and a lack of cross-
functional involvement prevented developers from making
important decisions and reducing uncertainties.

(1 1 -1 -1 1) Cross-functional involvement, a foundation of rules to
work from, and a concentration of power in the team
fostered successful development as the team worked
together to reduce uncertainties and had the power to
make important decisions.

(-1 -1 -1 1 -1) Early emphasis on product concept testing in the face of
no cross-functional involvement, no set of rules to
work with, and little power were a formula for the
failure of this new product.

(1 -1 -1 -1 1) Cross-functional involvement and the power to make
important decisions regarding the new product led to a
successful development effort.

(—1 -1 1 1 —1) Lack of power among developers, lack of established

rules to work with, lack of cross-functional

involvement, and an emphasis on early product concept
testing contributed to the failure of this new product.

 

 

100

HEBB NET DEMONSTRATION

The simulated data is used to train and test a Hebb net.

It is a

simple example of the use of the Hebb learning rule in a supervised

feedforward network.

Therefore,

The network has four inputs and one output.

the structure of this single layer network is 4/1. The

learning algorithm for a Hebb net follows:

Step 0:

Step 1:
Step 2:
Step 3:

Step 4:

to train

In

examples

entire data set is used to test the network.

of the

Step 0:

Step 1:

Step

Step

Step 1:

Step

Step

Step 1:

Step

Step

Step 1:

Step

Step

Initialize weight matrix.
w” = 0.
Do steps 2 and 3 for each training vector,
vector, t, pair.
Set activations for inputs and outputs:
Xi = Si; Yj = tj.
Adjust weights according to Hebb rule:
wU-(new) = w1j(old) + xiyj.
Test for stopping condition.

5, and target

Go to Step 1.

this demonstration the first four simulated responses are used
the network. The stopping condition is that all training

are correctly classified by the network. Once trained, the

Following are the details

training process:

2:

3:

U)

Initialize weight matrix.

Wi=0.
Do steps 2 and 3 for (1 l 1 -l) and (1).
Set activations for inputs and outputs:

xi = (1 1 1 -:u.- y,- = (1:.

Adjust weights according to Hebb rule:
vn(new) = (0 0 0 0) + (1 l 1 -1).
vn(new) = (1 l 1 -1).

Do steps 2 and 3 for (-1 1 1 1) and (-1).

Set activations for inputs and outputs:

Xi = ('1 1 1 l); Yj = (‘1).

Adjust weights according to Hebb rule:
vn(new) = (1 l l -1) + (1 -l —1 -1).
vn(new) = (2 0 0 -2).

D0 steps 2 and 3 for (—1 1 1 -1) and (-l).

Set activations for inputs and outputs:

Xi = (“l 1 l -1); Yj = (‘1).

Adjust weights according to Hebb rule:
Vﬁ(DGW) = (2 O 0 -2) + (1 -1 -1 1).
vn(new) = (3 -1 -1 -1).

D0 steps 2 and 3 for (1 l —1 —l) and (1).

Set activations for inputs and outputs:
Xi =(1 l ‘1 ’1);y3'= (1).

Adjust weights according to Hebb rule:
vn(new) = (3 -1 -1 -1) + (l 1 -l -1).

 

101

vn(new) = (4 0 -2 -2).

At the conclusion of the first epoch the weight vector, Step 4
calls for the testing of each training vector using the final weight
vector. To test each vector, the dot product of the training vector and
the weight vector is calculated to determine the network output. The
dot product is then truncated (if outside the [-1,1] range) to either 1
or -1 for comparison with the actual responses. The actual responses

are in brackets next to the truncated network output:

( 1 1 1 -1) ° (4 O -2 -2) = 4 => 1 [ 1]
(-1 1 l 1) ° (4 O -2 -2) = -8 => -1 [-1] a
(-1 1 1 -1) ° (4 0 -2 -2) = -4 => -1 [-1]

1 1 -1 -1) ° (4 O -2 -2) = 8 => 1 [ 1]

Since the network correctly classifies each training pattern, the
training process is stopped and the weight vector is frozen. Next, the
vectors which were not exposed to the network during training are

tested. This test gives an indication of the generalizability of the

network:
(-1 -l -1 1) ' (4 0 -2 -2) = -4 => -1 [-1]
( 1 -1 -1 -1) 0 (4 0 -2 -2) = 8 = 1 [ 1]
(-l -1 1 1) ° (4 0 -2 -2) = —8 => -1 [-1]

The network correctly classifies vectors that were held out during
training, indicating that the network has generalizability properties.
The Hebb net is a simple feedforward network that is capable of
learning simple patterns in data. One requirement of a Hebb net is that
the outputs are bipolar. Binary outputs cannot be learned because when
the output values are zero, the network does not learn. That is, since
the weight change is xgg, the weight change will be zero. The inputs
are not restricted to be bipolar, though. The previous network can
learn to classify the first four patterns with binary inputs and a
kDipolar output. The final weight vector becomes (2 0 -l ~1) and

COrrectly classifies all seven patterns.

102

The trained network also has fault tolerant capabilities. Using

the last pattern as an example, it can be seen that the network will

correctly classify successful and unsuccessful projects even when one of

the inputs missing:

( O -1 1 l) 0 (4 O -2 -2) = -4 = -1 [-1]
(-1 O 1 1) ° (4 O -2 -2) = -8 = -1 [-l]
(-1 -1 O 1) ° (4 O -2 -2) = -6 => -1 {-1]
(-1 -1 1 0) ° (4 O -2 -2) = -6 => -1 [-1]

The network is not completely fault tolerant, though. Consider

the introduction of missing data into the fifth pattern:

( O -1 -1 1) ° (4 O -2 ~2) = 0 => ?? {-1]
(-1 O -1 1) ' (4 O -2 -2) = -4 => -1 [-l]
(-1 -1 O 1) ° (4 O -2 -2) = -6 => -1 [-1]
(-1 -l -1 O) 0 (4 O -2 -2) = -2 => -1 [-1]

The network is able to correctly classify three of the four patterns
with missing data, but cannot classify the first pattern at all. While

not perfectly fault tolerant, this simple network does exhibit a

tendency to be robust in the face of missing data, even on data not used

in training.

The Hebb net also can be trained to perform simple logical
functions such as the AND function and the OR function, but is limited
in the range of patterns that it can learn. For example, it cannot
learn the exclusive/or (XOR) problem. The XOR problem is a classic
learning problem for neural networks (Minsky and Papert 1969). In the
XOR problem, the task for the network is to produce a value of 1 if one

and only one of two inputs are 1, and a value of -1 otherwise:

Inputs Output
(1 1) (-1)
( 1 -l) ( 1)
(-1 1) (l)
{-1 -1) (-1)

If a Hebb net is trained on this data, the final weight vector is

(O 0), which obviously doesn't correctly classify any of the patterns.

'33)“

121-3

103

The network cannot learn these patterns is because they are not linearly
separable. The XOR problem is a classic neural network task because
when Minsky and Papert (1969) publicized the fact that none of the
neural networks at that time, in 1969, could solve the problem many
researchers began to lose interest in neural networks.
KOHONEN SOFM DEMONSTRATION

The next demonstration utilizes competitive learning in an
unsupervised paradigm. Teuvo Kohonen (1972, 1988, 1989) was the driving

force behind the development of this class of neural networks called of

 

self-organizing feature maps (SOFM). SOFMs are a very popular class of
unsupervised neural networks and have a wide range of applications
including character recognition (Kohonen and Oja 1976) and the traveling
salesman problem (Hopfield and Tank 1985). SOFM's can cluster or map
data in either one or two dimensions. In the competitive learning
algorithm, output units are forced to compete for the opportunity to
learn the training pattern. In the winner-take-all scenario the winning
unit is the only output unit that learns at each iteration.

SOFM's contain four common elements (Haykin 1994):

0 A one- or two-dimensional structure of units that compute simple
discriminant functions of inputs.

0 A method to determine the winning unit based on the discriminant
function calculation.

0 An interactive environment that activates the winning unit and its
neighborhood simultaneously.

0 A method of changing the weights leading to the winning unit and its
neighbors that increases their discriminant function.

The general algorithm for training a SOFM trains two—dimensional

 

feature maps. In two-dimensional SOFM’s the output units are arranged
in a fixed structure and neighborhoods of units are created. A
neighborhood of size one includes the selected unit and each adjacent

unit. A neighborhood of size two all units no more than two units away

104

from the selected unit. Examples of neighborhoods of size one and two
are shown in Figure 12. The general learning algorithm for training an

SOFM follows (Fausett 1994; Haykin 1994):

Step 0: Initialize the weight matrix with random values. All weights
must be different at initialization. Set learning rate to
initial value.

Step 1: Select (either randomly or consecutively) an input vector X,
from the training data set.

Step 2: Determine the winning unit (and its neighborhood of units)

using a minimum distance criterion.

, 2
DD) = 2(wlj — xi)
1
Step 3: Adjust the weights of the winning unit and its neighborhood.
The size of the neighborhood and the learning rate decrease
either arithmetically or geometrically during the training

process.
Wij(n8W) = Wij(Old) + a[Xi - Wij(Old)]

Step 4: Return to Step 1 until weight matrix converges on a final
solution.

The adaptation of the general SOFM learning algorithm for this
one—dimension network follows:

Step 0: Initialize the weight matrix with random values. All weights
must be different at initialization. Set learning rate to
initial value.

Step 1: For each input vector X, in the training data set, do Steps 2
through 4.
Step 2: Determine the winning unit using a minimum distance criterion.

, 2
00) = Z(wij - xi)
1
Step 3: Adjust the weights of the winning unit and its neighborhood.
The size of the neighborhood and the learning rate decrease

either arithmetically or geometrically during the training

process.
wij(new) = wij(old) + 0:[xi - w1j(old)]
Step 4: Update learning rate [a(new) = a(old) - 0.1] and return to

Step 1 until a = 0.

The same input data used in the previous example is used to
demonstrate the use of a Kohonen SOFM. This network forms two clusters
without the use of target vectors. In this demonstration none of the
patterns in the data set are held out for training and the network is
allowed to form clusters using all of the input information. The
learning rate starts at a value of 0.5. The process proceeds as

follows:

 

105

 

 

Step 0: Initialize weights and learning rate.
w = [.4000 .9000] a = .5
.7000 .2000
.1000 .5000
[.6000 .80001
Step 1: Do Steps 2-4 for x, = (1 1 1 0).
Step 2: D(j) = 2(wij — x1)2
1
D(l) = 1.62; D(2) = 1.54. .1 J = 2.
Step 3: wn(new) = wm(old) + a[xi - wﬂ(old ]
vh2(new) = (.9 .2 .5 .8) + .5[( 1 1 O) - (.9 .2 .5 .8)]
w = [.4000 .9500]
.7000 .6000
.1000 .7500
L.6000 .4000]
Step 1: Do Steps 2-4 for X, — (O l 1 1).
Step 2: D(1)= 1.22; D(2) = 1.485. J = 1.
Step w = [.2000 .9500]
.8500 .6000
.5500 .7500
1.8000 .4000]
Step 1: Do Steps 2-4 for x1 = (0 l 1 0).
Step 2: D(l) = 0.905; D(2) = 1.285. a J = 1.
Step 3: w = [ 1000 .9500]
.9250 .6000
.7750 .7500
[.4000 .40001
Step 1: Do Steps 2-4 for X, = (1 1 0 0).
Step 2: D(l) = 1.576; D(2) = 0.885. 3 J = 2.
Step 3: w = [.1000 .9750]
.9250 .8000
.7750 .3750
1.4000 .2000J
Step 1: Do Steps 2-4 for x1 = (0 0 0 1).
Step 2: D(l) = 1.826; D(2) = 2.371. a J = 1.
Step 3: w = [.0500 .9750]
.4625 .8000
.3875 .3750
L.7000 .20001
Step 1: Do Steps 2-4 for x, = (1 0 0 0).
Step 2: D(l) = 1.757; D(2) = 0.821. J = 2
Step 3: w = [.0500 .9875]
.4625 .4000
.3875 .1875
L.7000 .1000]
Step 1: Do Steps 2-4 for x1 = (0 0 1 1).
Step 2: D(l) = 0.682; D(2) = 2.605. 3 J = 1.
Step w = [.0250 .9875]
.2313 .4000
.6938 .1875
L.8500 .1000]

 

106

This completes the first iteration of training. Next, the
learning rate is reduced to 0.4 and the second iteration proceeds. The

weight matrices after each iteration follows:

Iteration 1 Iteration 2 Iteration 3
w = [.0250 .9875] w = [.0032 .9973] w = [.0008 .9991]
.2313 .4000 .2604 .4704 .3124 .5183
.6938 .1875 .7203 .1845 .7228 .2103
L.8500 .1000] L.8366 .0216] L.8138 .00741
Iteration 4 Iteration 5
w = [.0003 .9995] w = [.0002 .9997]
.3584 .5534 .3890 .5744
.7265 .2357 .7305 .2528
L 7957 .0038] L.7850 .00281

Next, each pattern is propagated through the network using the
final solution for the weight matrix. This gives the response (output)

of the trained network to the input patterns:

Input Pattern Output 1 Output 2
(1110) 1.12 1.83
(0111) 1.90 0.83
(0110) 1.12 0.83
(1100) 0.39 1.57
(0001) 0.78 0.00
(1000) 0.00 1.00
(0011) 1.52 0.26

Using a competitive training algorithm without the
assistance of target values, the SOFM organized the patterns into
two distinct classes based on similarities between patterns.
Similar to the procedure for exploratory factor analysis, the next
step is to look at the clusters that have been formed to determine
whether or not labels can be applied to the classes. The second,
third, fifth, and seventh patterns form the first cluster while
the first, fourth, and sixth patterns form the second cluster.
Each pattern in the first cluster is a failure, while each pattern
in the second cluster is a success. Therefore, the fist cluster
can be labeled failure and the second cluster can be labeled

SUCCGSS .

 

 

APPENDIX C

Neural Network Methodology Details.

 

107

APPENDIX C

Neural Network Methodology Details

This appendix provides a technical background on many
popular training algorithms used in the construction of multilayer
feedforward networks. The appendix begins with the
backpropagation algorithm. This algorithm has had a large impact
on research in neural networks and most new algorithms have their
roots in basic backpropagation. Therefore, the backpropagation
algorithm is both explained and derived to provide a solid
foundation in this class of algorithms. Following this, major
derivatives of backpropagation are presented. Next, advanced
methods incorporating second-order derivatives, conjugate
gradients, pruning, and structural generation are presented. This
is followed with a discussion of strategies to use when deciding
between learning algorithms. This appendix concludes with details
on the probability neural network (PNN).
BACKPROPAGATION

The backpropagation (commonly shortened to backprop) class of
training algorithms are error correction learning algorithms. They are
the most popular and are extremely versatile. The interest in backprop
algorithms began in the 19803 when it was simultaneously “discovered" by
a number of researchers (LeCun 1986; Parker 1985; Rumelhart, Hinton and
Williams 1986; Werbos 1974). Numerous derivations and adaptations to
the basic structure and learning algorithm have been proposed in recent
years.

Backprop is applied to multilayer feedforward networks (MFNs).
MFNs are networks with three or more layers: an input layer, at least
one hidden layer, and an output layer. Within each layer any number of
nodes may be present. Connections are made between nodes via adjustable

weights. Connections are restricted to be unidirectional from an

 

 

108

originating node to a terminating node residing in a higher level. For
example, a connection can be made from an input node to a hidden layer
node. Another connection can be made from the same hidden layer node to
an output node. But a connection cannot be made from a hidden node back
to an input node or from an output node back to either a hidden or input
node.

Up until the development of the backprop algorithm, MFNs could not
be solved due to the assignment problem. When adjusting the weights in
a MFN using error information it becomes necessary to determine the
level at which each node contributes to the network error. This is the
assignment problem. To solve the problem, an evaluating of the
derivatives of the error function at each node is needed. Before the
backpropagation algorithm no method existed to propagate the derivatives
of the error function backwards through each weight in the model. The
backprop algorithm solves the assignment problem by calculating partial
derivatives of the network error with respect to each weight in terms of
known values.

Derivation of the Backprop Algorithm

The following derivation of the backprop algorithm for a single
hidden layer follows the description in (Fausett 1994). There are a
multitude of other sources (e.g. Bishop 1995; Haykin 1994; Hertz, Krogh
and Palmer 1991; Lippmann 1987; Ripley 1996). For a complete discussion
of the derivation and the motivations behind it, see (Werbos 1974).

The network error is typically defined as the squared difference
between the network output and the target output. For multiple outputs,

y,, the network output is:

yk = f(y_ ink) (3-1)
where f(x) is an arbitrary activation function for the node which has a

derivative, f'(x); y_inK is the activation of output node k; zj is the

.. ." ' _ -
1

 

 

 

109

activation of hidden node j; and wﬂ is the weight between hidden node j

and output node k.
y_ ink = Z ijjk (3'2)
k

and
zj = f(z_ inj) = f[bj + Z Xiwij] (3-3)
1
where x, are the input values and w” are the weights between the input
nodes i and hidden nodes j. Given network outputs as a function of the

weight vector, the error function of the network can be defined as:
1 2
E = 32k, — yk] (3-4)
k

Using this error function, the chain rule can be employed to
determine the partial derivative of the network error, E, with respect

to the weights from each hidden node to each output nodes, wﬂ:

 

 

 

 

 

 

 

 

 

= {:48 — .12}
63: = aw: (at: ' f(y_ink)]2} <3-6:
6:: “[tk ‘ “1,3ij f(Y- ink) (3-7)

6:: = “[t" 7 Yklf'iy— ink] 6ij Y___ in,C (3-8)

as: = '[tk ‘ ”<le ink)zj (3—9)

Therefore, each weight from hidden layer nodes to output nodes can

be updated according to the following rule:

ijk = -0: 6E = 0:[tk — yk]f'(y_ink)zj (3-10)
awjk

 

Where a is the learning rate parameter which specifies the degree of

learning at each iteration. The learning rate parameter is usually

 

110

chosen to be a value between 0 and 1. The greater the value of a, the
greater the weight change.

Thus, changes for weights between nodes in the hidden and output
layers can be calculated using information gained from the forward
propagation phase, 21 and y,, and knowledge of the derivative of the
activation function for the output nodes, f'(y_in,).

The same steps are taken to derive the weight changes for the
connections from input nodes to hidden layer nodes by taking the partial
derivative of the network error with respect to the weights from input

to hidden nodes:

 

88 a
= — t - y -—-— y (3-11)
5W1: Zkll k k] 5W: k

 

 

y__ink (3-12)

= -z[t. - when.) 4,,"

ij k ’ ij

(3
= _Z[tk ' Yk]f(Y_ 1:1,)ij 5,4?1 (3’13)
awij k 1)

 

 

 

85E = -2 [11): ’ Yk]f(y_ ink)wjkf'(z_ injxxi) (3-14)
11'

Therefore, each weight from input nodes to hidden layer nodes can
be updated according to the following rule:

a
Awij = —a swig = af’(y_ inj)xiz[tk — yk]f'(y_ 1:1,)ij (3-15)
11' k

 

As in the weight change formula for weights between hidden and
output layer nodes, the changes in weights between input and hidden
nodes can be calculated using information from the forward propagation
phase and knowledge of the derivative of the activation function of the

hidden nodes.

It is convenient to define 8k and 63 as:

=[tk - yk}f(y_ ink) (3-16)

 

111

Sj = -2 aijkf'(Z_ inj) (3‘17)
k

so that &,and ejrepresent the portion of error at output node k and

hidden node j to apply to the weight changes. Then the update formulas

can be written as:

Aij = 0.6ij (3'18)

and
Awij = aﬁjxi (3-19)

Thus, the weight change for any weight is a function of the learning
rate parameter, the portion of error correction due to the error at the
terminal node of the weight, and the value of the originating node of
the weight. The extension of the backprop algorithm for a single layer
as derived above is a straightforward procedure.

Note that any differentiable function can be used for the hidden
and output node activations. The logistic sigmoid and hyperbolic
tangent functions are very useful here since they are nonlinear and have

easily calculated derivatives:

Logistic sigmoid: £00 =-——i—r— and f(x)==iixIl-fOQ] (3—20)

Hyperbolic tangent: be = li;§;——— and f(x)==[1+—beIl-—fhd] (3-21)
e

Although these functions can be scaled to produce any range of
values, the logistic sigmoid function is useful when activations between
0 and l are desired and the hyperbolic tangent function is useful when
activations between -1 and 1 are desired.

The Backprop Algorithm

The first step in the backprop algorithm is to initialize the
weights. This is usually done by assigning small random values to each
weight. Next, in the forward propagation phase, a signal is transmitted
from the inputs through the network to produce output(s). Each network

output is then compared to the target value to determine the network

 

.r

a:

C.
.:
r.

E

Co

E. .2
.v

Cc

.\~

 

112

error. Once the error is calculated, the next phase propagates the
error backwards through the system (hence the name backpropagation) to
adjust the weights so as to reduce the error. This procedure is then
repeated to reduce the network error to the desired level.

Thus, the backprop algorithm for a single hidden layer MFN can be
stated as follows:

Step 0: Initialize weight matrix.
w” = small random value.
Step 1: Do steps 2-14 while stopping condition has not been met.
Step 2: Do steps 3-13 for each input and output training pair.
Step 3: Transmit each input x1 to units in the hidden layer.
Step 4: Calculate the net input for each unit receiving signals.

2_ lnj = bj + Z xiwij
i
Step 5: Calculate the activation of each hidden node.
Zj = f(Z_ inj)
Step Transmit each hidden node activation to the output layer.
Step 7: Calculate the net input for each output node.

y_ ink = Z ijjk
1

ON

Step 8: Calculate the activation at each output node.
Yk = f(Y__ink)

Step 9: Calculate error information at each output node.
5k = (tk — Yk)f(Y_,ink)

Step 10: Calculate weight change for weights between hidden and
output nodes.

Aij = (15ij
Step 11: Calculate error information at each hidden node.

83. = f'(z_ 1:14.): 8,,ij
k

Step 12: Calculate weight change for weights between input and
hidden nodes.

Awij = (15in
Step 13: Update weights.

wij(new) = veil-(Old) + Awij
Step 14: Check stopping condition.

Steps 3-8 constitute the forward propagation phase of the algorithm,
while steps 9-12 are the backwards propagation phase. This is the
general backprop algorithm.

In order to implement backprop, a number of decisions must be made
and criteria must be determined. This is the case for all of the

algorithms that are presented in this manuscript. That is, each

 

A‘.‘ "A-3
,
5.

 

 

113

algorithm is presented in a general form which cannot be directly
applied to a specific problem, but requires prior specification of
certain conditions and parameters.

Criteria which satisfy stopping conditions need to be established
prior to implementation of backprop. Unfortunately, a stopping
condition which is best for all neural network problems does not exist.
One simple criteria is to specify the number of epochs or iterations
over which to train the network. Another criteria is to specify an F]
acceptable network error at which to discontinue learning. Stopping '1
conditions and their selection are discussed in further detail in the

generalization section. In many cases, multiple criteria are

 

established such that once one of the criteria is met, the stopping
condition is satisfied and training ceases.

Also, decisions need to be made regarding the weight update
schedule. There are generally two choices: batch or on~1ine. In a
batch mode, weights are updated after the presentation of all of the
training patterns (once per epoch). In an on-line mode, weights are
updated after the presentation of each training pattern (n times per
epoch where n is the number of training patterns).

MFNS are popular because of their speed (of use) and flexibility.
The training process can be slow, but once trained, the parallel

structure of MFNS makes their implementation easy and fast since only

 

the feedforward pass is needed for use. Also, it has been shown by
various researchers (Cotter 1990; Cybenko 1989; Funahashi 1989; Hecht-
Nielsen 1989; Hornik 1991; Hornik, Stinchcombe and White 1989; Ito 1991;
Kreinovich 1991) that MFNS with a single hidden layer of nodes with
sigmoid activations can approximate an arbitrary continuous function to
any desired accuracy. The accuracy of the approximation is only limited
by the number of nodes in the hidden layer.

MFNS trained using backprop do have limitations, though. They can

be susceptible to local minima and they can be prohibitively slow.

 

114

Numerous adaptations of the basic algorithm (sometimes referred to as
“vanilla backprop") have been developed to minimize these limitations.
However, these problems have not yet been eliminated and careful
attention needs to be paid to the limitations when neural networks are
applied to practical problems.
VARIANTS OF THE BACKPROP ALGORITHM

Many variations of MFNS trained using vanilla backprop exist and
new ones are being developed almost daily. The variants discussed here
can be organized into six broad categories: momentum, learning rate

adaptation, regularization, second order, conjugate gradient, and

 

structurally generative/degenerative. While this categorization is not
exhaustive, it does provide an overview of a wide variety of the most
useful variations of vanilla backprop.
Backprop with Mementum

The first adaptation to vanilla backprop was the addition of a
momentum term in the weight update formulas to speed the learning
process (Rumelhart, Hinton and Williams 1986). The momentum term
provides information concerning the previous gradients. For a single

hidden layer MFN, the weight change formula becomes:

ijkL+l = aﬁkzj4-qujdt (3-22)

for weights between the hidden and output layer and
Awij|1+1 = (15in + ”[wijlt — wijlt—l] (3‘23)

for weights between the input and hidden layer. In these formulas, T is

the iteration index and u is the momentum parameter (ranging from 0 to
1) which defines the degree to which the weight update depends on
previous information. Adding the momentum term significantly improves
the speed of vanilla backprop, but doesn’t necessarily lead to more

accurate models (Rumelhart, Hinton and Williams 1986).

115

Learning'Rate.Adaptation
Fahlman (1988) devised a popular algorithm which improves on
backprop with momentum. The method is called quickprop and incorporates

an adaptive momentum parameter based on the current and previous local

 

 

 

 

 

 

gradients. The weight change formula is:
a
A SW, A (3 24)
w = w _ -
‘ 92 5i ‘1
5wt_1 5wI
where t is the iteration index. This gives a weight update formula:
is
+ 5w‘ A (3 25)
w = w _ w _ . -
t t 1 5E 5_E t 1
5w,_1 5wt

 

 

Quickprop is based loosely on Newton’s second-order method
(discussed later). The ratio of the current gradient to the difference
between the previous gradient and the current gradient become an

adaptive momentum parameter:

 

 

6_E
8wt
= . 3—26
y(r) £4 _§E < >
5w,_1 5wT

 

 

When the current slope of the error function has the same sign as the
previous slope, then the adaptive momentum parameter is positive and
accelerates the weight change. On the other hand, when the current
slope has the opposite sign as the previous slope, the weights are
approaching a minimum and the adaptive momentum parameter becomes
negative which decelerates the weight change (Hassoun 1995).
Regularization

The next class of algorithms adopt regularization to improve upon
the original backprop algorithm. While momentum addresses the speed of

learning by accelerating the process, it does not address the issue of

 

 

116

generalizability (the ability to perform well on data not previously
seen). Regularization addresses both learning speed and
generalizability. This is accomplished by incorporating a penalty term,

or regularization term, into the error function.
E(w) = E(w) + VQ(w) (3-27)
where v is the strength parameter of the penalty function, Can. The
training process proceeds as in backprop with momentum with the adapted
error function. . 1

Many forms of penalty have been suggested. One popular penalty is

simply the sum of squared weights. L

Q(w) = $2 wi (3-28)

1
In this case, often called weight decay, the penalty specifically
attacks large weight values. Weight decay regularization drives
weights towards zero. If large weights are needed, however, they will
be retained if there is an appropriate balance between the
regularization term and the actual network error in the error function.
That is, if a large weight contributes significantly towards network
error reduction then it will be retained at the expense of other weights
that are not contributing towards network error reduction. Thus, the
regularization term enables the network to search for a parsimonious
model.

The net effect of regularization is a smoothing of the
approximated function which usually improves the generalizability of the
network at the expense of network error minimization. That is, given a
network that can be trained to exactly reproduce a data set, the same
network trained with regularization may fail to completely memorize the
data set. But the regularized network may perform better on new

samples. Generalization issues are covered more extensively later.

 

117

Newton’s Second-order.Methods

Vanilla backprop and vanilla backprop with momentum generally
follow a zigzag pattern of gradient descent along the error surface to
either a local or global minimum. Making use of only the first
derivative of the error function and using a constant step size, u,
these methods suffer from long training times. When the step size is
too small, it takes many steps to reach the minimum. When the step size
is too large, the network overshoots the minimum in the negative
gradient direction and the process becomes unstable.

One way to attack the step size problem is to make use of the
second derivative of the error function. This provides information
concerning the curvature of the error function along with the direction
of the minimum. Thus, second order methods make use of the first three
terms in the Taylor expansion. Given the following approximation for

the incremental weight change in the error function, E(w), as a
quadratic function of Aw (Haykin 1994),
AE(w) = E(w + Aw) — E(w) , (3-29)

approximation via a Taylor expansion yields:

AE(w) e AE(w) = VETAw + éAwTVZEAw, (3-30)

 

where Vﬁxw)== is the first order gradient vector, g, of the error

i

2
is the Hessian matrix, H, of second order

 

function, V2E(w) = 2
1

elements of the error function, and the T superscript denotes the column
vector transformation of a row vector. A minimum can be found by

setting the derivative of the Taylor expansion equal to zero:
V(AE(w)) = 0 = VE + VZEAw. (3-31)

Therefore, the optimal weight change is given by:

 

 

118

VE _
Awopt = —'V—2—E = -H lg (3‘32)

and the optimal weight vector is given by:
Wopt = wo + Aw = wO — H‘lg. (3-33)

This forms the basis of Newton's method. Using the information in
the gradient, g, and the Hessian, H, both the direction and the step
size to the global minimum can be obtained. Calculating the gradient is
performed using ordinary backprop. Unfortunately, calculating the
Hessian and its inverse are usually computationally prohibitive. The
Hessian involves calculations in the order of (nwz) steps where n is the
number of training examples and w is the number of weights in the weight
vector. Also, once the Hessian has been calculated, the inverse Hessian
requires calculations in the order of (w3) steps. Other problems
include the possibility of moving towards a maximum or saddle point when
the Hessian is not positive definite. Also, since the gradient and the
Hessian are evaluated locally, if the step size is too large, movement
may be outside the range of the error surface, causing instability in
the learning process. Battiti (1992) identifies these and other
practical problems implementing Newton’s method in neural network
learning problems.

Quasi-Newton Methods

While calculation of the Hessian introduces difficulties in using
Newton's method, the Hessian can be estimated and an approximate
solution to Newton’s method can be obtained. Methods which approximate
the Hessian are referred to as Quasi—Newton methods and satisfy the

Quasi-Newton condition found by evaluating Newton's method at iterations
t and 1+1:
lemwt = ~HEKGI+1-Gt). (3-34)

The approximation, G, to the Hessian is devised to satisfy the

Quasi-Newton condition. The two most popular methods for calculating

”if; .- E,

C..- .-.

 

 

119

the Hessian approximation, G, involve known quantities: the weight
vector at iterations t and 1+1, the gradient evaluated at iterations T

and 1+1, and the previous Hessian approximation, G. The Davidson-

Fletcher-Powell (DFP) formula is (van der Smagt 1994):

G ,1 = G + pp? — (G‘VXG‘V) (3—35)
I I

T T
p v v G,v

 

p GIV
wherep = wr+1+-wr, v = gt+1-g,, and u = _?_'_'—F—__'
p v v GTV

The Broyden-Fletcher-Goldfarb-Shannon (BFGS) is (Bishop 1995):

T G v vTG
Gr+1 = G: + pp —~( IT) T+(vTGtv)va (3-36)
p v v Gtv

 

The weight updating rule becomes:
wt+1 = wt + aIGth’ (3-37)

where a,is found through line minimization and G,is the Hessian
approximation. Results from comparing these two approximations have
failed to identify either as being superior in all situations. Some
researchers recommend the DFP procedure (van der Smagt 1994) while
others recommend the BFGS procedure (Bishop 1995; Ripley 1996).

Another method that makes use of the Hessian is the Levenberg—
Marquardt algorithm. The following description of the algorithm follows
the discussions in (Bishop 1995; Hagan and Menhaj 1994). Further
details can be found in (Press et al. 1992). This method is an
approximation to Newton's method (3-32) where the weight change is given

as

VE(w)
Aw = —-————— (3-38)
V2E(w)

where VE(w) is the gradient vector of the network error with respect to

the weight vector and VﬂE(w) is the Hessian of the error function with

respect to the weights. While other methods permit the use of a variety

v‘:

 

120

of error functions, the Levenberg-Marquardt algorithm is specific to the

sum of squares error function

E(w) = Zeﬂw) (3-39)

1
where ey=tyﬁu(x) and the index i represents the ith pattern vector in

the training data. For this error function,

5E

 

VE(w) = (3-40)

V2E(w) = VE(w)TVE(w) + Z ei(w)V2ei(w) (3-41)

1

2
5 821. (3-42)
5w

 

where V2e1(w) =

Following Newton's method, the weight update formula becomes

VE1~
Wnew = wold '- V—zEﬁr—(t — Y(W)). (3‘43)

The Levenberg-Marquardt modifies this solution by dropping the
second term of the Hessian. For linear models this modification is
identical to Newton's method, but for non-linear models it is only an

approximation. This results in the weight update rule

_ VE(w)
VE(w)TE(w) + 11

 

Wnew : Wold

(t — y(w)), (3-44)

where I is the identity matrix and k is the step size parameter.
Since it is possible that the gradient vector could be pointing
towards a maximum instead of a minimum, the Levenberg-Marguardt method

incorporates a check for this situation. If the new weight vector

increases the network error, the old weight vector is retained, A is
reduced by a factor B and the iteration is repeated. If the new weight
vector decreases the network error, the new weight vector is retained, A
is increased by a factor of B and the process proceeds to the next

iteration. A typical starting value for A is between 0.1 and 0.01 and B

 

 

121

is generally set to 10. Note that when A is small this algorithm

approximates Newton's method and when it is large it approximates
backprop.

The Levenberg—Marguardt algorithm for a single layer MFN can be
stated as follows:

Step 0: Initialize network parameters.
w” = small random value

A = small value usually between 0.01 and 0.1

B = scaling factor for 1; usually 10.
Step 1: Repeat steps 2-10 until stopping condition is reached.
Step 2: Repeat steps 2 and 3 until one epoch is complete.
Step 3: Propagate input vector through the network.
Step 4: Calculate network error, egﬁq-y,(w), for input vector.

Step 5: Calculate total network error, E00 = EZeEOQ.

1

 

Step 6: Compute gradient vector of the error function as in backprop.
Step 7: Solve for wnew using the update formula
VE(w)
Wnew : Wold - t _ y(w) '
VE(w)TE(w) + AI ( )
Step 8: Recompute the total network error using wmw.

Step 9: If E(wmm)<E(de) Then
Retain wnew

A'new = Aold/B
Go to Step 10

If E (wnew) >E (wold) Then
Retain wad

A~new = A'olqu B

Go to Step 7
Step 10: Check for stopping condition.
Note that the stopping condition and weight update schedule must be
determined a priori as discussed earlier in reference to the vanilla
backprop algorithm.
The conjugate Gradient Method

One of the difficulties of vanilla backprop is the specification

of the step size that is taken at each iteration. This is determined by
the learning rate parameter, a. When a is too small, the learning

process takes very long as successive steps are likely to proceed in the

same (or very similar) directions as previous steps. Alternatively,
when a is too large, steps may pass the minimum in the direction of

descent and the error can actually increase. Also, step sizes that are

 

 

122

too large lead to oscillation as successive steps overcompensate for
previous steps.

To attack the step size problem line minimization can be used to
find the best step size for the given direction vector. One line
minimization method involves finding three successive points on the
function where the middle point yields a smaller error than the first
and third. This bounds the minimum between the first and third points.
An iterative process is taken which then moves the bounds closer to the
minimum until the minimum is found. Thus, the optimal step size for the
direction vector is found through line minimization (see Press et al.
1992 and Brent 1973 for detailed discussions of line minimization).

Line minimization improves vanilla backprop by optimizing the step
size and forcing successive direction vectors to be orthogonal, but it
does not completely solve the oscillation problem. A path of orthogonal
vectors leading to a minimum still involves a zigzag trajectory where
successive steps undo some of the learning of the previous step. To
correct for this problem successive directions are forced to be
conjugate instead of simply orthogonal.

The conjugacy concept involves the relationship between two
vectors with respect to a quadratic function. If two vectors are
conjugate with respect to a quadratic function, then movement along one
of the vectors produces no movement along the other (Masters 1995). The
dot product of orthogonal vectors is zero. Conjugate vectors, a and b,
satisfy the following condition:

a'Hb = 0 (3-45)
where H is the Hessian matrix. It can be shown that a set of n vectors,
(where n is the number of weights in the weight vector) which are
mutually conjugate are also linearly independent (Bishop 1995).

Therefore, if a set of n direction vectors are used in the learning

 

123

process with line minimization, the backtracking inefficiencies of the
oscillation problem can be avoided.

Before applying the conjugate concept to neural network training
it is important to point out two of the assumptions regarding conjugacy.
The first assumption is that the Hessian is constant. This is not
always the case, but in most applications the Hessian is fairly constant
near a minimum. While the situation is not perfect (Masters 1995), it
can still provide a useful tool in minimization. The second important F

assumption is that the function to be minimized is quadratic, that is,

'.. Ls. . '

it is a multivariate polynomial with only zero-, first-, and second—

 

order terms. Since this may not be true in all cases, applying the
conjugacy concepts to network training will only approximate the results
of its use on strictly quadratic functions. Nevertheless, even with
only approximate satisfaction of these assumptions, the conjugate
concepts has been shown to be a powerful tool in network training
(Bishop 1995; Haykin 1994).

A learning process which uses linearly independent conjugate
directions and line minimization to determine each iterative step size
in the search for a weight vector which minimizes the error function can

be summarized as

w* = we + Zaidi (3-46)

1
where di represents the direction of learning at each iteration (the
negative gradient, VE) and a, is the step size obtained through line
minimization techniques. Thus, the weight update formula becomes
”1+1 = wI + ardt. (3-47)

This formula is the same form as backprop where the step size is
determined through line minimization and the direction vectors are all
mutually conjugate. This method is known as the conjugate gradient

method.

 

124

To find the set of mutually conjugate directions<i which minimize

the error function with respect to the weight vector, the direction in
the first iteration, d“ is defined as the negative gradient, —-VTKwh.
Next, consider each successive direction vector, duq, as a combination
of the negative of the new gradient vector -VE004+1 and a factor of
the old direction vector d,

d,+1 = —V:2:(w)H1 + Btd, (3-48)
where B,is a factor which forces the satisfaction of the conjugacy

condition. Solving for B,yields:

VEvv +-d
Br =[ ()r+1 1+1]. (3_49)
dt

 

Multiplying the right hand side by one and expanding gives:

VE +-d
B: = (Hdt){ (w)r+1 1+1) (3_50)
:

Hd, d

 

T T
T _ T T ’

 

Imposing the conjugacy condition on the direction vectors forces the

last term to drop out which leaves (Bishop 1995):

_ VE(w):+1Hd,
ded,

 

BI (3-52)

Thus, the conjugate gradient method is a second order method that takes
advantage of the Hessian and uses conjugate directions in the search for
the weight vector which minimizes the error function. As seen in the
discussion of Newton’s method, calculations of the Hessian can be

computationally prohibitive. Fortunately, there are a number of
solutions for B,h¢dch satisfy the conjugacy condition but do not

involve calculation of the Hessian. The Polak-Ribiere solution is

 

 

 

125

: “ Ve(w)1-‘,1[v8(w),,1 _ v::(w),] (3_53)
— VE(w):VE(w)t

 

and the Fletcher—Reeves solution is

T
I = VE(w)t+lVE(w)7H1 . (3_54)

VE(w)'fVE(w)I

This leaves the problem of finding the optimal step size, a, for each

 

iteration. This is calculated using line minimization techniques as
discussed previously.
Thus, conjugate gradient methods offer a number of advantages over

vanilla backprop and backprop with momentum. They don't require

explicit values for learning parameters — the learning rate a and the

momentum parameter p. Instead, these parameters are estimated at each

iteration from simple first order terms (which are already being
calculated in vanilla backprop) and a line minimization procedure. In
the conjugate gradient method, these adaptive parameters are estimated
at each iteration to give optimal performance of the algorithm.

Also, conjugate gradient methods approximate second-order methods
without the computational costs of calculating the Hessian and its
inverse. They accomplish this by implicitly using information contained
in the Hessian without directly calculating it (Bishop 1995).

Furthermore, the additional cost of the line minimization
procedure to determine each step size is more than compensated for by
the elimination of the zigzag trajectory towards the minimum which
erases portions of progress from previous steps. If the error function
is strictly quadratic, it can be shown that the conjugate gradient
algorithm converges in a maximum in n steps, n equaling the number of
parameters in the weight vector (Press et al. 1992; Ripley 1996). Even

when the error function is not strictly quadratic, though, convergence

 

 

 

126

of the algorithm is typically much faster than standard algorithms such
as backprop with momentum (Masters 1995).
PruninguMethods

Another class of training algorithms includes those which make
structural changes to the network during learning. Three of the more
popular generative/degenerative algorithms are optimal brain damage,
optimal brain surgeon, and cascade correlation. Optimal brain damage
and optimal brain surgeon are pruning algorithms which delete small
weights that don't contribute significantly towards error reduction.

The effect is similar to regularization. Other pruning techniques start
with overly large networks and delete nodes (instead of weights) that
don't contribute significantly towards error reduction. On the other
hand, cascade correlation starts with a single hidden node and
iteratively builds a more complex network by adding hidden nodes as
needed.

The central idea behind weight pruning algorithms is that some
networks contain too many weights. In these cases, some of the weights
have little or no impact on the final output of the network. Therefore,
the weight vector is “pruned” of these individual weights. The final
effect is similar to that of regularization in that the estimated
function becomes increasingly more smooth as the number of pruned
weights increases. This effect is desirable when generalization is
important, as will be discussed later.

The problem lies in choosing the weights to be pruned.

Originally, it was thought that the smallest weights contributed the
least to the network, so they should be deleted first (Chauvin 1989;
Hanson and Pratt 1989). This method has no theoretical basis and
performed somewhat poorly (LeCun, Denker and Solla 1990).

A new way of measuring the contribution of each weight to the
error function, or saliency was needed. A complete sensitivity analysis

of the weight vector where each weight is removed from the network, the

 

 

 

127

network is retrained, and the error is recalculated becomes
computationally prohibitive as the network size increases.

Since the Hessian provides the sensitivity analysis information
that is needed to determine the saliency of each weight, LeCun, Denker,
and Solla (1990) devised a saliency measure for each weight based on an
approximation of the second derivative of the error function with
respect to the weight vector. Their approximation assumes a diagonal
Hessian matrix where all off—diagonals are set to zero. For the usual
mean squared error function, the diagonal elements of the Hessian can be
calculated using an extension of backprop (Bishop 1995). The saliency,
81, for each weight is then calculated as a function of the Hessian

diagonals, hm, and the weight, w“

U)
I”
ll
:7
F“
F‘
I .N

wl. (3-55)
2

The optimal brain damage algorithm (LeCun, Denker and Solla 1990)
uses this approximation to determine which connections (weights) to
delete from the network. The algorithm proceeds as

Step 0: Choose a suitably large network structure.
Step 1: Repeat steps 2-6 until stopping condition is satisfied.

Step 2: Initialize parameters and train using an error correction
algorithm (this could be any of the previously discussed
methods).

Step 3: Compute diagonal elements of Hessian.

Step 4: Compute saliencies, Si, for each weight.

Step 5: Sort the weights according to saliencies and delete weights

with low-saliency.
Step 6: Check for stopping condition.

Note that this general representation of the optimal brain damage
algorithm requires more than the specification of a stopping condition
and weight update schedule as in previous algorithms. Step 0 involves
knowledge of an appropriate size for the network. Therefore, network
structures of various sizes must be constructed and tested to determine
a “suitable” size for the network that meets the needs of the specific

problem. Once this size is established, a larger network is used for

 

 

128

pruning. In this case, network size refers to the number of weights to
be trained.

Furthermore, note that Step 2 basically embeds another algorithm
(such as quickprop or backprop with momentum) inside the optimal brain
damage algorithm. Again, a st0pping condition and a weight update
schedule are needed for the embedded algorithm.

Although a number of problems arise from this procedure, LeCun,
Denker, and Solla (1990) applied it to a large learning problem with
dramatic results. Their network began with 2578 free parameters. After
applying optimal brain damage, they found that a network reduction of
60% of the free parameters resulted in little difference in performance
on the training and test data sets.

One of the more important problems with optimal brain damage is
the assumption that the Hessian's off-diagonals are zero. In practice
this is often not the case. Hassibi and Stork (1993) proposed an
alternative method called optimal brain surgeon which does not make this
assumption. Instead, a sensitivity analysis is performed by iteratively
removing a single weight and computing a new weight vector. But,
instead of retraining the weights to reduce the error as usual, the new

weights are changed with respect the deleted weight. The formula for

the changes in the weight vector, 6w, and the error function, 8E are

0w = -—w:i-l—H‘1[t - y(x)] (3-56)
“11
1 W
8:: = __—I (3—57)
2 “11

The algorithm for optimal brain surgeon becomes:

Step 0: Choose a suitably large network structure.

Step 1: Initialize parameters and train using an error correction
algorithm (this could be any of the previously discussed
methods).

Step 2: Compute the Hessian.
Step 3: Compute the inverse Hessian.
Step 4: Repeat steps 5-8 until st0pping condition is satisfied.

Step 5: Evaluate SE, for all i.

« I'I‘A .a‘i.. ‘1)! IT

129

Step 6: Set w, to zero for i where SE, is smallest.

Step 7: Update remaining weights, wj, where i¢j.

Step 8: Check for stopping condition.
As in optimal brain damage, Step 0 involves extensive trial and error
and Step 1 is another embedded algorithm where stopping conditions and a
weight update schedule must be specified.

It was previously noted that the calculation of the Hessian and
especially its inverse can be computationally prohibitive in training
algorithms. In Newton's method the Hessian and its inverse must be
calculated at each iteration, but in the optimal brain surgeon algorithm
it is needed only once. Since it contains rich information that is very a-
useful in determining the saliency of each weight, it becomes worthwhile
to undertake these calculations in this case. It should also be noted,
though, that after a number of prunings the procedure is usually
restarted such that the Hessian and its inverse are recalculated. This
ensures the validity of the information used in the weight change
formulas. Therefore, in practical applications the costly calculation
of the Hessian and its inverse are taken a number of times, but not in
every iteration. Furthermore, the rest of the algorithm is very simple
and extremely fast.

Cascade correlation

The final algorithm to be discussed is a structurally generative
method called cascade correlation. In structurally degenerative methods
such as pruning, an oversized network is “cleaned up” by removing
structural connections that contribute very little to the ultimate
output of the network. In cascade correlation the rationale is
reversed. This method starts with a network that is too small (one
hidden node) and builds by adding nodes until it is sufficiently complex
to solve the problem.

Cascade correlation was designed to attack the step size problem

(discussed earlier) and the moving target problem in order to decrease

130

training time and produce a parsimonious network structure (Fahlman and
Lebiere 1991). The step size problem is attacked by using the quickprop
algorithm (Fahlman 1988).

The moving target is a side-effect of the power that neural
networks generate from their parallel structure. Hidden nodes act
locally based strictly on local information and can have little
interaction with many parts of the network. In a single iteration each
of the weights in the network can change. Afterwards, the task for each
hidden node can change drastically. That is, at each iteration the
target moves. The farther the network is from the minimum, the greater
the movement of the target. The moving target problem is attacked in
cascade correlation by freezing all but of one the weights during
training. Once the weight is trained, it is frozen and another node is
added to the structure.

The cascade structure and the use of correlation (actually
covariance) in the cascade correlation algorithm are the motivations
behind its name. The cascade structure is shown in Figure 13. New
hidden units are added one at a time and receive input from all existing
hidden nodes and their output is sent to all output nodes. The input
weights to the newest (and all previously added) node(s) are frozen and
training is only performed on the output nodes of the newest hidden
node.

The process begins with no hidden nodes, only inputs and outputs.
These weights are trained until the network error reaches zero or
approaches an asymptotic value. Next, a candidate unit is created which
receives input from the input layer but is not connected to the output
layer. This network is then trained to maximize S where

s = Z 2(2n — 2)(eno — 60) (3-57)

0 n

 

 

131

where o is the index of the output units, n is the index of the training
patterns, 21 is the candidate unit's value for training pattern 1, E is
the average value of the node over all the training patterns, em,is the
error for training pattern n at output node 0, and e; is the average
error over all the training patterns (Fahlman and Lebiere 1991).

To maximize S,

63_

8W1

 

Z 2(eno — eo)f'(z_ mm) = 0 (3—58) ‘
O n

where f(z_inno) is the first derivative of the activation function at

output 0 for pattern n. The quickprop algorithm then uses this
expression to maximize S for the candidate unit. While Fahlman and
Labiere refer to S as the correlation between the network output and the
candidate unit’s output, it is actually the covariance. But in any
case, once it is trained to correspond to as much of the network error
as possible it is added to the network as described earlier and the
process is repeated. Each successive hidden node is added for the
primary purpose of reducing the existing network error. The algorithm
can be stated as follows:
Step 0: Construct and initialize network with only inputs and outputs.
Step 1: Train the network using quickprop.
Step 2: Repeat steps 3-8 until stopping condition is satisfied.

Step 3: Add candidate unit to network by connecting to all inputs

and previously added hidden units (if any).

Step 4: Initialize weights from inputs to candidate unit and freeze
all other weights.

Step 5: Train network to maximize S using quickprop.

Step 6: Freeze inputs to candidate unit and connect to output units.
Step 7: Train network using quickprop.

Step 8: Test stopping condition.

As before, a stopping condition must be determined for Steps 1 and 8
prior to implementation.

METHODOLOGY FOR CREATING, TRAINING, AND EVALUATING A PROBABILITY NEURAL
NETWORK (PNN).

The other artificial neural network architecture that has been

chosen for this study is the probability neural network (PNN). The PNN

132

has its roots in Bayesian theory and the Parzen (1962) method for
estimation of population probability density functions. But it was not
developed until 1988 when Donald Specht (1988) recognized the usefulness
of combining a neural network structure with Parzen's PDF estimation
method. Although the PNN is relatively new, it has been used
successfully in a variety of applications including character
recognition, classification of ships from infrared images, and the
discrimination of rumination and eating phases of animals from audio
recordings. See (Zaknich and Attikiouzel 1992) for a summary. It is
instructive to begin with a brief discussion of Bayesian classification
theory as an introduction to the PNN.

The Bayes optimal decision rule can be stated as follows:

h1C1f1(X) 3> hjc1f1(x)

where h, is the a priori probability that the sample is a member of
class i, c, is the cost function associated with membership in class i,
and f1(x) is the probability density function (PDF) associated with

class i. To implement the decision rule, an unknown sample is

classified into class i which satisfies the above equation for all j x

The model used in this study involves two classes - success and
failure. The a priori probability that a sample is a success or failure
and the cost function associated with success and failure can be
estimated from a firm's historical success rate and their tolerance for
failure.

The missing element in using the Bayesian classification decision
rule is the PDF. Fortunately, Parzen (1962) developed an efficient
method for estimating a population PDF from a random sample. In his
method, a kernel function is applied to each training sample. The
kernel function has a maximum value when the distance between the

unknown vector and the training vector is zero and decreases as the

 

 

 

133

distance increases. The PDF is formed by taking a weighted sum of the
kernel function for each training sample. A variety of functions can be
used as Parzen kernels, but the most popular and useful is the Gaussian

kernel,

 

[ﬁg]
1 202
e

2n0

g(x) =

Parzen’s method for estimating a population PDF from a sample was
operationalized in an artificial neural network by Specht (1990). The
parallel nature of neural networks provide an efficient structure for
the implementation of Parzen’s method. The basic structure of a PNN is
a four layer feedforward network with an input layer followed by a
pattern layer followed by a summation layer and ending with an output
layer. See Figure 10. The input layer is identical to an MFN and
represents the inputs to the network. The pattern layer is similar to
the hidden layer in an MFN, but has some important differences. The
pattern layer contains a node for each pattern in the training data.
The weights from the input layer to the pattern node are held constant
at the values of the inputs for the training pattern. The summation
layer contains a node for each class in the model. Pattern nodes are
connected only to the summation node associated with their class and the

weights between pattern nodes and summation nodes all remain a constant

1
value of -—— where nC is the number of pattern nodes for class c. The
no

summation nodes are then connected to the output node. The output node
simply selects the summation node with the largest value to determine
the predicted class.

The only parameter that needs to be established during training is
sigma (0). Sigma represents the degree of spread for the Gaussian

kernel. A small value for sigma results in sharply peaked kernels.

Large values of sigma produce wide kernels. The optimal value for sigma

134

depends on the distribution (sparseness) of the training data. Figure
11 depicts the estimated PDF of a given data set for four values of
sigma. The curves along the bottom of each graph are the kernel
functions. The sum of these kernel functions is graphed and labeled in
each figure. If the selected sigma is too small, the estimated PDF will
closely follow the series of spikes (See Sigma = l in Figure 11). If
the selected sigma is too large, information will be lost and the PDF
will be blurred (See Sigma = 15 in Figure 11). For the data presented
in Figure 11 the optimal sigma is probably somewhere between five and
ten.

This presents a situation similar to that of deciding on the
appropriate number of hidden nodes to incorporate into an MFN. If the
chosen sigma is too small, the contribution of individual data points to
the estimated PDF are exaggerated and the network performs poorly on new
data. Alternatively, if the chosen sigma is too large, the estimated
PDF becomes too smooth and fails to recognize important details in the
data. Thus, a cross-validation procedure similar to that which was
discussed earlier and a line optimization procedure is used to determine
the appropriate value of sigma. These procedures utilize only the
training samples and once sigma is determined, the PNN is ready for use.

The cross-validation procedure used in this PNN is the leave—one-
out method and provides an estimate of the predictive ability of the PNN
for a given value of sigma. For a given value of sigma, each training
pattern is used as the unknown pattern vector and that pattern vector is
removed from the network. If the pattern vector is left in the network,
the result is trivial and the network will predict correctly.

The determination of sigma begins by selecting a trio of sigmas:
two widely separated values and an intermediate value between them.
Network performance is evaluated at each value using the cross-
validation procedure. The intermediate sigma will represent the current

best value and the endpoints will provide a frame within which to choose

 

 

135

a new competing value. If the competing sigma represents the best
performing value, then it becomes the middle of the new trio and the two
adjacent values become the new endpoints. Otherwise, the competing
value becomes a new endpoint in the new trio. The process continues
until the improvement in network performance is negligible. Placement
of the intermediate sigma in the original trio and the competing sigmas
in subsequent iterations follows the “golden-section” placement rule as

described in Masters (1995) where the two interior values lie at 0.382

71

and 0.618 when the distance between the endpoints is scaled to 1.0.

Once constructed, the PNN is ready to use. That is, training

.4 Lil“

involves the building of the network by adding a pattern node for each
pattern in the training data and the application of training vectors to
weight vectors between inputs and pattern nodes. To use the network, an
unknown input vector is fed through the network. At each pattern node,
the input is the Euclidean distance between the input vector and the
weight vector (which was the original input vector for the pattern
node). The Euclidean distance is then processed through the node’s
activation function (the Gaussian kernel). The output of each pattern
node is then multiplied by the weight between the pattern node and
summation node and summed at the summation nodes. The output of the
summation nodes represent the probability that the unknown sample is a
member of the class corresponding to each summation node. The output
node then selects the summation node with the highest probability. The
PNN used in this study will not include the output layer and will end
with the class probabilities at the summation layer.

Methodology for creating, Training, and Evaluating a Probability Neural

Network.

Step 0: For each training pattern, do Steps 1-3.

Step 1: Create a pattern unit.

Step 2: Assign weight vector for pattern unit to be equal to pattern
input vector.

Step 3: Connect pattern unit to summation unit corresponding to
pattern unit's class.

136

Step 4: To determine appropriate sigma (0), do Steps 5-
Step 5: Select starting endpoints for sigma.
Step 6: Select intermediate point for sigma which cuts off 38.2% of

the distance between the endpoints. These three points
constitute the first sigma trio.

Step 7: Select a competing sigma which cuts off 0.618 of the distance
between the endpoints of the trio.

Step 8: Do Steps 9-11 for each sigma until each training pattern has
been held out for evaluation.

Step 9: Remove one training pattern from data.

Step 10: Propagate remaining training patterns through PNN to determine
class probability predictions.

Step 11: Compute network performance as the misclassification rate.

Step 12: If performance of competing sigma is equal to performance of

intermediate sigma (i.e. no improvement in performance),
select intermediate sigma and proceed to Step 14.

Step 13: If performance of competing sigma is better than performance
of intermediate sigma, competing sigma becomes the new
intermediate sigma and a new trio is formed. Return to Step
7.

Step 14: End.

DISCUSSION

This appendix presents the most popular types of learning
algorithms used for training MFN's. Once a network structure has been
determined, a decision must be made regarding the appropriate learning
algorithm to use. Unfortunately, no systematic comparisons have been
performed to provide 'benchmarks' for each learning algorithm (1996).
In general, though, the decision is based on four factors: the size of
the network (measured as the number of weights to be estimated), the
size of the training data set, the capabilities of the computer solving
the problem, and the objective of the network. Even on the most
advanced computers, the training process can involve a large amount of
computational resources and take a long time to complete. The time
needed to complete the training process grows as the network size grows
and the size of the training data set grows.

Therefore, it is often beneficial to use the most advanced
algorithm available to reduce the number of epochs necessary for

training. Unfortunately, the more advanced algorithms require greater

137

computational resources that may prohibit their use. If the network is
of reasonably small size, the Levenberg-Marguardt algorithm is the best
starting point. This method will take slightly more time per epoch, but
usually converges in considerably less (many times orders of magnitude
less) than first order methods. Therefore, overall training time can be
considerably less than first order methods.

If, on the other hand, the network has a relatively large number
of weights to estimate, it may be more useful to use conjugate gradient
methods. While the conjugate gradient method doesn't make use of the
second derivative of the error function as the Levenberg-Marguardt
method does, it uses considerably less computational resources. When
the size of the network prohibits the use of the Levenberg-Marquardt
method, the conjugate gradient method is probably (in general) the best
choice for large networks.

The PNN is also discussed and detailed here. The PNN has specific
advantages and disadvantages when compared to MFNS. The most limiting
disadvantage is the nature of the outputs that can be used in PNNs. PNN
outputs must be categorical. Continuous variables are not appropriate

for use in PNNs.

Among the advantages of the PNN architecture compared to the MFN
architecture are its simplicity, speed of training, and near Bayesian
optimal results. The PNN is easy to construct as the hidden layer is
determined by the training data. No cross—validation procedures are
necessary to determine the appropriate number of hidden nodes as in the
MFN architecture. Training is also much quicker than with the MFN as
the lengthy iteration process is replaced with a much simpler

minimization via a line search technique. Finally, with a PNN the

138

output is scaled to probabilities which approximate Bayesian optimal

results.

LIST OF REFERENCES

 

 

REFERENCES

Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski (1985),
“A Learning Algorithm for Boltzman Machines,” Cognitive Science, 9
147—169.

Anderson, James, A. Pellionisz, and E. Rosenfeld (1990), Neurocomputing
2: Directions for Research. Cambridge, MA: The MIT Press.

----- , et al. (1977), “Distinctive Features, Categorical Perception, and
Probability Learning: Some Applications of a Neural Model,”
Psychological Review, 84 413-451.

----- , and E. Rosenfeld (1988), Neurocomputing: Foundations of
Research. Cambridge, MA: The MIT Press.

Barney, Jay (1991), “Firm Resources and Sustained Competitive
Advantage,” Journal of Management, 17 (1), 99-120.

Battiti, R. (1992), “First- and Second-order Methods for Learning:
Between Steepest Descent and Newton's Method,” Neural Computation,
4 141-166.

Bishop, Christopher M. (1995a), Neural Networks for Pattern
Recognition. Oxford: Clarendon Press.

Bounds, et al. (1988), “A Multilayer Perceptron Network for the
Diagnosis of Low Back Pain,” Proceedings of the IEEE International
Conference on Neural Networks (ICNN '88), IV407—IV416.

----- , D. G., P. J. Lloyd, and B. G. Mathew (1990), “A Comparison of
Neural Network and Other Pattern Recognition Approaches to the
Diagnosis of Low Back Disorders,” Neural Networks, 3 583-591.

Bradshaw, G., R. Fozzard, and L. Ceci (1989), “A Connectionist Expert
System that Actually Works.” in Advances in Neural Information
Processing Systems. D. S. Touretzky ed. San Mateo, CA:Morgan
Kaufmann,

Brent, Richard (1973), Algorithms for Minimization without Derivatives.
Englewood Cliffs, NJ: Prentice-Hall.

Bridle, John S. (1989), “Probabilistic Interpretation of Feedforward
Classification Network Outputs, with Relationships to Statistical
Pattern Recognition.” in Neurocomputing: Algorithms,
Architectures, and Applications. Francoise Fogelman Soulie and
Jeanny Herault ed. 227-236.

139

140

----- (1990), “Training Stochastic Model Recognition Algorithms as
Networks Can Lead to Maximum Mutual Information Estimation of
Parameters.” in Advances in Neural Information Processing Systems.
D. S. Touretzky ed. San Mateo, CAzMorgan Kaufmann, 211-217.

Brown, Shona L., and Kathleen M. Eisenhardt (1995), “Product
Development: Past Research, Present Findings, and Future
Directions,” Academy of Management Review, 20 (2), 343—378.

Brown, T. H., E. W. Kairiss, and C. L. Keenan (1990), “Hebbian Synapses:
Biophysical Mechanisms and Algorithms,” Annual Review of
Neuroscience, 13 475—511.

Calantone, Roger J., Jeffrey B. Schmidt, and X. Michael Song (1996),
“Controllable Factors of New Product Success: A Cross-National
Comparison,” Marketing Science, 15 (December), 341-358.

Carpenter, Gail A., and Steven Grossberg (1987), “ART 2: Stable Self-
Organization of Stable Category Recognition Codes for Analog Input
Patterns,” Applied Optics, 26 4919—4930.

----- , and ----- (1990), “ART3: Hierarchical Search Using Chemical
Transmitters in Self-Organizing Pattern Recognition
Architectures,” Neural Networks, 3 129-152.

Chauvin, Y. (1989), “A Back—Propagation Algorithm with Optimal Use of
Hidden Units.” in Neural Information Processing Systems. D.
Touretzky ed. Denver:Morgan Kaufmann,

Cheng, Bing, and D. M. Titterington (1994), “Neural Networks: A Review
from a Statistical Perspective,” Statistical Science, 9 (1), 2-54.

Clark, Kim B., and T. Fujimoto (1991), Product Development Performance.
Boston: Harvard Business School Press.

Conner, Kathleen R. (1991), “A Historical Comparison of Resource-Based
Theory and Five Schools of Thought Within Industrial Organization
Economics: Do We Have a New Theory of the Firm?,” Journal of
Management, 17 (1), 121-154.

Cooper, Robert G. (1979), “Indentifying Industrial New Product Success:
Project NewProd,” Industrial Marketing Management, 8 (2), 124-35.

----- (1979), “The Dimensions of Industrial New Product Success and
Failure,” Journal of Marketing, 43 (Summer), 93—103.

----- (1992), “The NewProd System: The Industry Experience," Journal of
Product Innovation Management, 9 (June), 113-27.

Cotter, N. E. (1990), “The Stone-Weierstrass Theorem and Its Application
to Neural Networks,” IEEE Transactions on Neural Networks, 1 (4),
290-295.

Cottrell, G. W., P. Munro, and D. Zipser (1987), “Learning Internal
Representations from Gray-scale Images,” Ninth Annual Conference
of the Cognitive Science Society, 462-473.

141

Cybenko, G. (1989), “Approximation by Superpositions of a Sigmoidal
Function," Mathematics of Control, Signals and Systems, 2 304-314.

Day, George S., and Liam Fahey (1988), “Valuing Market Strategies,”
Journal of Marketing, 52 (July), 45-57.

----- , and Robin Wensley (1988), “Assessing Advantage: A Framework for
Diagnosing Competitive Superiority,” Journal of Marketing, 52
(April), 1—20.

Dutta, 8., and S. Shekhar (1988), “Bond-Rating: A Non—Conservative
Application of Neural Networks,” Proceedings of the IEEE
International Conference on Neural Networks, 2 (II), 443-450.

N

Elman, Jeffrey L. (1990), “Finding Structure in Time, Cognitive

Science, 14 (April-June), 179—211.

----- (1991), “Distributed Representations, Simple Recurrent Networks,
and Grammatical Structure," Machine Learning, 7 (September), 91—

225.

Fahlman, Scott E. “An Empirical Study of Learning Speed in Back-
Propagation Networks." . Pittsburgh: Carnegie-Mellon University,
1988.

----- , and Christian Lebiere. “The Cascade-Correlation Learning
Architecture." . Pittsburgh, PA: Carnegie Mellon University,
1991.

Fausett, Laurene (1994), Fundamentals of Neural Networks:

Architectures, Algorithms, and Applications. Englewood Cliffs, NJ:
Prentice Hall.

Fish, Kelly E., James H Barnes, and Milam W. Aiken (1995), “Artificial
Neural Networks: A New Methodology for Industrial Market
Segmentation,” Industrial Marketing Management, 24 (October), 431-
438.

Frankel, D., et a1. (1991), “Use of Neural Networks to Predict Lightning
at Kennedy Space Center,” Proceedings of the International Joint
Conference on Neural Networks (IJCNN '91), 1319-1324.

Friedman, Jerome H. (1991), “Multivariate Adaptive Regression Splines
(with Discussion)," Annals of Statistics, 19 (1), 1-141.

Funahashi, K. (1989), “On the Approximate Realization of Continuous
Mappings by Neural Networks,” Neural Networks, 2 (3), 183-192.

Geman, Stuart, Elie Bienenstock, and René Doursat (1992), “Neural
Networks and the Bias/Variance Dilemma,” Neural Computation, 4
(1), 1-58.

Gorman, R. P., and T. J. Sejnowski (1988), “Analysis of Hidden Units in
a Layered Network Trained to Classify Sonar Targets,” Neural
Networks, 1 75-89.

----- , and -—--- (1988), “Learned Classification of Sonar Targets Using
a Massively Parallel Network,” IEEE Transactions on Acoustics,
Speech, and Signal Processing, 36 1135-1140.

142

Griffin, Abbie (1997), “PDMA Research on New Product Development
Practices: Updating Trends and Benchmarking Best Practices,”
Journal of Product Innovation Management, 14 (November), 429—458.

----- , and John R. Hauser (1992), “Patterns of Communication Among
Marketing, Engineering and Manufacturing - A Comparison Between
Two New Product Teams,” Management Science, 38 (March), 360—73.

————— , and ----- (1996), “Integrating R&D and Marketing: A Review and
Analysis of the Literature,” Journal of Product Innovation
Management, 13 (May), 191-215.

Gupta, Ashok, S. P. Raj, and David A. Wilemon (1986), “A Model for
Studying R&D-Marketing Interface in the Product Innovation
Process,” Journal of Marketing, 50 (April), 7—17.

Hagan, Martin T., and Mohammad B. Menhaj (1994), “Training Feedforward
Networks with the Marquardt Algorithm,” IEEE Transactions on
Neural Networks, 5 (November), 989-993.

 

Hanson, S. J., and L. Y. Pratt (1989), “Some Comparisons of Constraints
for Minimal Network Construction with Back—Propagation.” in Neural
Information Processing Systems. D. Touretzky ed. Denver:Morgan
Kaufmann,

Harrison, R. F., S. J. Marshall, and R. L. Kennedy (1991), “The Early
Diagnosis of Heart Attacks: A Neurocomputational Approach,"
Proceedings of the International Joint Conference on Neural
Networks, Il-IS.

Hassibi, B., and D. G. Stork (1993), “Second Order Derivatives for
Network Pruning: Optimal Brain Surgeon.” in Adances in Neural
Information Processing Systems. S. J. Hanson, J. D. Cowan and C.
L. Giles ed. San Mateo, CA:Morgan Kaufmann, 164-171.

Hassoun, Mohamad H. (1995), Fundamentals of Artificial Neural Networks.
Cabridge, MA: Bradford.

Haykin, Simon (1994), Neural Networks: A Comprehensive Foundation.
Hamilton, Ontario, Canada: Macmillan College Publishing Company.

Hebb, Donald (1949), The Organization of Behavior. New York: Wiley &
Sons.

Hecht-Nielsen, Robert (1989), “Kolmogorov's Mapping Neural Network
Existence Theorem.” in Proceedings of the International Joint
Conference on Neural Networks. San Diego, CAzlEEE, 593-605.

----- (1990), Neurocomputing. Reading, MA: Addison-Wesley.

Hertz, John, Anders Krogh, and Richard G. Palmer (1991), Introduction
to the Theory of Neural Computation. Reading, MA: Addison-Wesley.

Hinton, Geoffrey E. (1989), “Connectionist Learning Procedures,”
Artificial Intelligence, 40 (1), 185—234.

----- , and Terrence J. Sejnowski (1983), Optimal Perceptual Inference.
Washington, D.C.: IEEE Press.

 

143

----- , and ----- (1986), “Learning and Relearning in Boltzman Machines.”
in Parallel Distributed Processing: Explorations in the
Microstructure of Cognition. D. E. Rumelhart and J. L. McClelland
ed. Cambridge, MA:The MIT Press, 282-317.

Hopfield, John J., and David W. Tank (1985), “Neural Computation of
Decisions in Optimization Problems," Biological Cybernetics, 52
141-152.

Hornik, K. (1991), “Approximation Capabilities of Multilayer Feedforward
Networks," Neural Networks, 4 (2), 251-257.

----- , M. Stinchcombe, and H. White (1989), “Multilayer Feedforward

Networks are Universal Approximators,” Neural Networks, 2 (5),
359-366.

Hush, Don R., and Bill G. Horne (1993), “Progress in Supervised Neural
Networks: What's New Since Lippmann?,” IEEE Signal Processing
Magazine, 10 (January), 8-39.

Ito, Y. (1991), “Representation of Functions by Superpositions of a Step
or Sigmoid Function and their Applications to Neural Network
Theory,” Neural Networks, 4 (3), 385-394.

Karayiannis, Nicolaos B., and Anastasios N. Venetsanopoulos (1993),
Artificial Neural Networks: Learning Algorithms, Performance
Evaluation, and Applications. Boston: Kluwer Academic Publishers.

Kelso, S. R., A. H. Ganong, and T. H. Brown (1986), “Hebbian Synapses in
Hippocampus,” Proceedings of the National Academy of Sciences of
the U.S.A, 83 5326-5330.

Knight, Kevin (1990), “Connectionist Ideas and Algorithms,”
Communications of the ACM, 33 (November), 59—74.

I

Kohonen, Teuvo (1972), “Correlation Matrix Memories,’ IEEE Transactions

on Computers, C-21 353-359.

----- (1988), “The 'Neural' Phonetic Typewriter," Computer, 21 (3), ll-
22.

----- (1989), Self—Organization and Associative Memory. Vol. 3. Berlin:
Springer-Verlag.

----- (1990), “Improved Versions of Learning Vector Quantization,”
International Joint Conference on Neural Networks, I 545-550.

----- , and E. Oja (1976), “Fast Adaptive Formation of Orthogonalizing
Filters and Associative Memory in Recurrent Networks for Neuron-
Like Elements,” Biological Cybernetics, 21 85-95.

Kramer, M. A. (1991), “Nonlinear Principal Component Analysis Using
Autoassociative Neural Networks,” AIChe Journal, 37 (2), 233-243.

Kreinovich, V. Y. (1991), “Arbitrary Nonlinearity is Sufficient to
Represent all Functions by Neural Networks: A Theorem,” Neural
Networks, 4 (3), 381-383.

KroSe, Ben J. A., and P. Patrick van der Smagt (1993), An Introduction
to Neural Networks. Amsterdam: The University of Amsterdam.

144

Kuan, Chung-Ming, and Halbert White (1994), “Artificial Neural Networks:
An Econometric Perspective,” Econometric Reviews, 13 (1), 1-91.

Kuhnel, H., and P. Traven (1991), “A Network for Discriminant Analysis.”
in Artificial Neural Networks. T. Kohonen, et al. ed.
AmsterdamzNorth-Holland, 1053-1056.

Kumar, Akhil, Vithala R. Rao, and Harsh Soni (1995), “An Empirical
Comparison of Neural Network and Logistic Regression Models,”
Marketing Letters, 6 (4), 251-263.

LeCun, Yann (1986), “Learning Processes in an Asymmetric Threshold
Network.” in Disordered Systems and Biological Organization. E.
Bienenstock, F. Fogelman-Souli and G. Wesbuch ed. Berlin:Springer—
Verlag, 233-240.

----- , et a1. (1989), “Backpropagation Applied to Handwritten Zip Code
Recognition,” Neural Computation, 1 541-551.

----- , John S. Denker, and Sara A. Solla (1990), “Optimal Brain Damage.”
in Advances in Neural Information Processing Systems. D. S.
Touretzky ed. M. Kaufman, 598-605.

Lippmann, Richard P. (1987), “An Introduction to Computing with Neural
Nets,” IEEE ASSP Magazine, 4 (April), 4-22.

Marketing Science Institute (1996), Research Priorities: A Guide to
MSI Research Programs and Procedures. Cambridge, MA.

Masters, Timothy (1995), Advanced Algorithms for Neural Networks: A
C++ Sourcebook. New York: Wiley & Sons.

McCulloch, W. S., and W. Pitts (1943), “A Logical Calculus of the Ideas
Immanent in Nervous Activity,” Bulletin of Mathematical
Biophysics, 5 115-133.

McLachlan, G. J. (1992), Discriminant Analysis and Statistical Pattern
Recognition. New York: Wiley & Sons.

Mighell, D. A., T. S. Wildinson, and J. W. Googman (1989),
“Backpropagation and its application to handwritten signature
verification.” in Advances in Neural Information Processing

Systems. D. S. Touretzky ed. San Mateo, CaliforniazMorgan
Kaufmann, 340-347.

Minsky, Marvin, and Seymour Papert (1969), Perceptrons. Cambridge, MA:
The MIT Press.

Moenaert, Rudy, K., et al. (1994), “R&D-Marketing Integration
Mechanisms, Communication Flows, and Innovation Success,” Journal
of Product Innovation Management, 11 (January), 31-45.

----- , and William E. Souder (1990), “An Analysis of the Use of
Extrafunctional Information by Marketing and R&D Personnel: A

Review and Model,” Journal of Product Innovation Management, 7
(3), 213-29.

----- , and ----- (1990), “An Information Transfer Model for Integrating
Marketing and R&D Personnel in New Product Development Projects,”
Journal of Product Innovation Management, 7 (June), 91-107.

145

Myers, Summer, and Donald G. Marquis. “Successful Industrial
Innovations." : National Science Foundation, 1969.

Nelson, Marilyn McCord, and W. T. Illingworth (1991), A Practical Guide
to Neural Nets. Reading, MA: Addison-Wesley.

Oja, E. (1982), “A Simplified Neuron Model as a Principal Component
Analyzer," Journal of Mathematical Biology, 16 267—273.

----- (1989), “Neural Networks, Principal Components, and Subspaces,”
International Journal of Neural Systems, 1 61-68.

Olson, Eric M., Orville C. Jr. Walker, and Robert W. Ruekert (1995),
“Organizing for Effective New Product Development: The Moderating
Role of Product Innovativeness,” Journal of Marketing Research, 59 TH
(January), 31-45. "

Opitz, David W., and Jude W. Shavlik (1995), “Dynamically Adding
Symbolically Meaningful Nodes to Knowledge-Based Neural Networks,"
Knowledge-Based Systems, 8 (December), 301-311.

 

Parker, D. “Learning Logic.” . Cambridge, MA: MIT, 1985.

Parzen, E. (1962), “On Estimation of a Probability Density Function and
Mode,” Annals of Mathematical Statistics, 33 (September), 1065-
1076.

Porter, Michael E. (1985), Competitive Advantage: Creating and
Sustaining Superior Performance. New York, NY: The Free Press.

Prechelt, Lutz (1996), “A Quantitative Study of Experimental Evaluations
of Neural Network Learning Algorithms: Current Research
Practice," Neural Networks, 9 (3), 457-462.

Press, William H., et al. (1992), Numerical Recipes in C: The Art of
Scientific Computing. 2nd ed. New York: Cambridge University
Press.

Refenes, Apostolos-Paul (1995), Neural Networks in the Capital Markets.
New York: Wiley & Sons.

Ripley, Brian D. (1993), “Statistical Aspects of Neural Networks.” in
Networks and Chaos: Statistical and Probabilistic Aspects. O. E.
Barndorf—Nielsen, J. L. Jensen and W. S. Kendall ed. Champman &
Hall, 40-123.

----- (1994), “Neural Networks and Related Methods for Classification,”
Journal of the Royal Statistical Society Series B, 56 (3), 409-
456.

----- (1996), Pattern Recognition and Neural Networks. Cambridge:
Cambridge University Press.

Rosenblatt, Frank (1958), “The Perceptron: A Probabilistic Model for
Information Storage and Organization in the Brain,” Psychological
Review, 65 386-408.

Rothwell, Roy (1974), “The 'Hungarian Sappho': Some Comments and
Comparison,” Research Policy, 3 30-8.

 

 

146

----- , et al. (1974), “SAPPHO Updated- Project Sappho Phase II,”
Research Policy, 3 258-91.

Ruekert, Robert W., and Orville C. Walker (1987), “Marketing's
Interaction with Other Functional Units: A Conceptual Framework
and Empirical Evidence,” Journal of Marketing, 51 (1), 1-19.

Rumelhart, David E., Geoffrey. E. Hinton, and James L. McClelland
(1988), “A General Framework for Parallel Distributed Processing."
in Parallel Distributed Processing: Explorations in the
Microstructure of Cognition. David E. Rumelhard and James L.
McClelland ed. Cambridge, MA:The MIT Press, 45-76.

----- , -----, and Ronald J. Williams (1986), “Learning Internal
Representations by Error Backpropagation.” in Parallel Distributed
Processing: Explorations in the Microstructure of Cognition.
David E. Rumelhart and James L. McClelland ed. Cambridge, MA:The
MIT Press, 318-362.

----- , and James L. McClelland (1988), Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, I and II.
Cambridge, MA: The MIT Press.

----- , and D. Zipser (1985), “Feature Discovery by Competitive
Learning,” Cognitive Science, 9 75-112.

Sanger, T. D. (1989), “Optimal Unsupervised Learning in a Single-Layer
Linear Feedforward Neural Network,” Neural Networks, 2 459-473.

Sarle, Warren S. (1994), “Neural Networks and Statistical Methods,”
Proceedings of the Nineteenth Annual SAS Users Group International
Conference, 19 (April), 1-13.

Sejnowski, T. J., and C. R. Rosenberg (1987), “Parallel Networks that
Learn to Pronounce English Text,” Complex Systems, 1 145—168.

Sheperd, Gordon M., and C Koch (1990), “Introduction to Synaptic
Circuits.” in The Synaptic Organization of the Brain. Gordon M.
Sheperd ed. New York:Oxford University Press, 3-31.

Song, X. Michael, and Mark Parry (1996), “What Separates Japanese New
Product Winners from Losers,” J0urnal of Product Innovation
Management, 13 (September), 422-439.

----- , and ----- (1997), “The Determinants of Japanese New Product
Successes," Journal of Marketing Research, 34 (February), 64—76.

----- , and Jinghong Xie (1997), “Does Product Innovativeness Moderate
the Relationship Between R&D-Manufacturing-Marketing Integration
and New Product Performance? A Comparative Study of Japanese and
American Firms,” Journal of Marketing,

Specht, Donald F. (1988), “Probabilistic Neural Networks for
Classification, Mapping, or Associative Memory," Proceedings of
the IEEE International Conference on Neural Networks, 1 (July),
525-532.

----- (1990), “Probabilistic Neural Networks,” Neural Networks, 3 109-
118.

 

 

147

----- (1991), “A General Regression Neural Network,” IEEE Transactions
on Neural Networks, 2 (November), 568-576.

Tank, David W., and John J. Hopfield (1986), “Simple "Neural"
Optimization Networks: An A/D Converter, a Signal Detection
Circuit, and a Linear Programming Circuit,” IEEE Transactions on
Circuits and Systems, 33 533-541.

Towell, Geoffrey G., and Jude W. Shavlik (1992), “Interpretation of
Artificial Neural Networks: Mapping Knowledge-Based Neural
Networks into Rules,” Proceedings of the Neural Information
Processing Systems, 4 977-984.

----- and —-—-- (1993), “Extracting Refined Rules from Knowledge-Based
Neural Networks," Machine Learning, 13 (1), 71-101.

Utans, Joachim, and John Moody (1995), “Architecture Selection
Strategies for Neural Networks: Application to Corporate Bond
Rating Prediction.” in Neural Networks in the Capital Markets.
Apostolos-Paul Refenes ed. New York:Wiley & Sons, 277—300.

van der Smagt, P. Patrick (1994), “Minimisation Methods for Training
Feedforward Neural Networks,” Neural Networks, 7 (1), 1-11.

Werbos, Paul. “Beyond Regression: New Tools for Prediction and Analysis
in the Behavioral Sciences.” Unpublished Ph. D. Dissertation.
Harvard University, 1974.

Wernerfelt, Birger (1984), “A Resource-based View of the Firm,”
Strategic Management Journal, 5 (2), 171-180.

Widrow, Bernard, and Marcian E. Hoff (1960), “Adaptive Switching
Circuits,” 1960 IRE WESCON Convention Record, 96-104.

Wind, Jerry, and Vijay Mahajan (1997), “Issues and Opportunities in New
Product Development: An Introduction to the Special Issue,”
Journal of Marketing Research, 34 (February), 1—12.

Zaknich, Anthony, and Yianni Attikiouzel (1992), “Applications of the
Probabilistic Neural Network,” Technical Report 92/37.

Zaltman, Gerald, Robert Duncan, and Jonny Holbeck (1973), Innovations
and Organizations. New York: Wiley & Sons.

Zirger, Billie Jo, and Modesto A. Maidique (1990), “A Model of New
Product Development: An Empirical Test," Management Science, 36
(July), 867—83.

"Illllllll'lll'l’llllﬁ