“:3. ”3:11. 5:“ ‘3; . “T" :7,“ "H'. 3 !' ’v', AWE‘ I" '17.": — - i- _ gk)’°.l"'1 '3 v“, .1 ‘ '2 ""J 3?” , ‘ 7.1. : £234“ 2-; , . 5355;“? ,. "3‘ ; - ', , ' ' mv- ~14 ._ L01. 1 -, - m... ..__..5~ s {v WE 5* if {1lmfl‘MLqE‘th‘fi I: ' L71 lo' '9 [it]? "E5 tr 1‘“ ' M452“? {if 5LT}? “w “1W3“ gt, ‘- t ‘ 1’, g E U11 1:1"). M 1H ,5; M1 v‘ “5 ; “1+ng -. ~. U:.§::'I£nl....alml:1;h :‘Hfififis‘mmgi rhilflEM :MPJ 3%} mans NIVERSITY Ll 8R RABIES IUIlHHIHHIIIWWW|| ll NH HUI 31293 01712 916 This is to certify that the dissertation entitled NEURAL NETWORK APPLICATIONS IN NEW PRODUCT DEVELOPMENT RESEARCH presented by Robert Jeffrey Thieme has been accepted towards fulfillment of the requirements for Ph. D. degree in Marketing jo$rofea>r V April 14, 1998 I)ate MSU i: an Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY : Michigan State University PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. I MTE DUE DATE DUE DATE DUE ACT ‘”0 s 1999 1!” WM“ NEURAL NETWORK APPLICATIONS IN NEW PRODUCT DEVELOPMENT RESEARCH BY Robert Jeffrey Thieme A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Marketing and Supply Chain Management Department 1998 em—w—m "I ABSTRACT NEURAL NETWORK APPLICATIONS IN NEW PRODUCT DEVELOPMENT RESEARCH BY Robert Jeffrey Thieme Two types of artificial neural network decision support systems are created, trained, and evaluated to serve as guides for managers Mi making complex new product development (NPD) go/no-go and resource allocation decisions in the NPD process. The first is based on a b, multilayer feedforward network (MFN) structure and is referred to as the multilayer feedforward network decision support system (MFN-DSS). In this system, ten independently trained networks are combined to predict overall success or failure; the degree of financial success; and the degree of technical success for new product projects. Each network structure has forty-three variable inputs, five hidden nodes, and one output. Network output corresponds to one of three aspects of NPD success. A piecewise cross-validation procedure is used to determine the appropriate network structure. The second neural network architecture is a probability neural network (PNN) and provides output probabilities of success and failure that approach Bayesian optimal. Data to train and test the networks are based on historical data from 612 NPD projects. After training with data from 400 projects, the MFN—DSS correctly classifies 96.7% of 212 additional projects and the PNN correctly classifies 92.9% of 212 additional projects. These results are compared with k-nearest neighbor, logistic regression, ordinary least squares (OLS) regression, and discriminant analysis methods and found to be superior. Finally, usage of the decision support systems as an aid to managers in new product project selection is discussed and demonstrated using an actual case study. Copyright by ROBERT JEFFREY THIEME 1998 W“ Kin ACKNOWLEDGMENTS I am very appreciative of the support of many people who had confidence in me and generously gave their time and effort. I begin by thanking my parents and wife for their support throughout the whole process. They supported me when I quit a good paying job with opportunities for advancement in favor of a four to five year journey through a doctoral program. I'm not sure any of us fully understood what I was doing, but their confidence in my decisions gave me confidence that I could succeed. Of particular note was a period in the beginning of my second year when I changed majors from Operations Management to Marketing. This was probably the most difficult decision to make during the program because it was based more on intuition and “gut feel” than on data and quantitative analysis. Even in this tough time when I was unsure about my decision, the support and confidence from my family helped me focus on making it work. I would like to thank Dr. Everett Adam, Dr. Ronald Ebert, and Dr. Lori Franz at the University of Missouri—Columbia for providing the spark that lead to my desire to become a professor. I am indebted to them for inspiring me to pursue this wonderful, challenging, and fulfilling profession. I would also like to thank my dissertation co-chairs, Dr. X. Michael Song and Dr. Roger Calantone. In my opinion, choosing a committee and a chair for a dissertation is the most important step in the process. I was advised by various people that choosing co—chairs iv would probably infuse conflict and could be a mistake. I spent a lot of time on this decision and have no regrets. Dr. Song and Dr. Calantone played different, yet important roles in my program. They both provided guidance, direction, support, and encouragement during each stage of my dissertation and in my job search. I would also like to thank the professors who took their doctoral seminars seriously even when some of my fellow doctoral students chose not to. In retrospect, I could have spent less time working on assigned readings and course projects in many of my doctoral seminars and still made it through the program; but I have no regrets in this regard either. Because of the hard work and dedication of Dr. Stanley Hollander, Dr. Cornelia Droge, Dr. John Hollenbeck, Dr. Ram Narasimhan, Dr. Glenn Omura, Dr. Tom Page, Dr. Lloyd Rinehart, and Dr. Calantone, I feel that I have a well-rounded doctoral education that will pay dividends in the future. It is a shame that some of my fellow students chose to do the minimum expected in these classes, because they have missed out on a great opportunity. Dr. Drbge deserves a special thanks for her willingness to discuss many of the practical issues involved in being a professor. She spends a lot of time working with all incoming doctoral students trying to prepare them for what is ahead. She has always been willing to share personal experiences to make a point, even when the experiences are not necessarily flattering. I learned many things from her that paid off enormously as I progressed through my dissertation and job search. Furthermore, I don’t expect that the payoffs will stop any time soon. I imagine that her words of wisdom will echo in my mind throughout my career - sometimes joyfully and sometimes hauntingly. ;o L I would also like to thank many of my fellow doctoral students. Without the guidance and advice from those who were a step or two ahead of me, namely, Jeff Schmidt, Steve Clinton, and Matt Myers, things would have been much more difficult. The insight from their experience was extremely helpful. In addition, Tom Goldsby and Jim Eckert have made the program interesting and enjoyable. While we didn’t always see eye to eye in our seminars and the basketball wasn't fit for prime time, their friendship was much enjoyed and I wish them well in the future. I would also like to thank the staff in the Marketing and Supply Chain Management Department, namely, Cheryl Lundeen, Renee Dixon, Laurie Fitch, Cindy Seagraves, Kathy Mullins, and Marilyn Brookes. Their hard work and dedication contributes to the success of the department and sometimes their effort is overlooked. Finally, I would like to thank my wife, Kelly. She has been extremely supportive throughout the program. She worked long hours teaching first graders to read; took graduate classes in the evenings; made due without new clothes, vacations, bedroom furniture, etc.; used weekends as “catch-up” time instead of fun time; and spent many Friday and Saturday evenings relaxing at home with a video instead of going out. I cannot express my appreciation for her patience and support. I can't wait to enjoy the fruits of our labor. vi TABLE OF CONTENTS LIST OF TABLES ........................................................ ix LIST OF FIGURES ........................................................ x CHAPTER 1 INTRODUCTION ........................................................... 1 Artificial Neural Networks .......................................... 4 Neurobiological Roots ............................................... 7 Focus and Contribution of This Study ................................ 9 Organization of the Manuscript ..................................... 12 CHAPTER 2 NETWORK STRUCTURES AND TRAINING PROCESSES ............................. 15 An Historical Look at Neural Network Research ...................... 15 Neural Network Notation ............................................ 19 Neural Networks Defined ............................................ 20 Processing Units ................................................ 21 State of Activation ............................................. 21 Output Function ................................................. 22 Pattern of Connectivity ......................................... 22 Propagation Rule ............................ ~ .................... 23 Activation Function ............................................. 25 Learning Algorithm .............................................. 26 Learning Paradigm ............................................... 26 A Learning Process Hierarchy for Neural Networks ................... 27 Supervised Learning Paradigm .................................... 28 Unsupervised Learning Paradigm .................................. 29 Hebbian Learning Algorithm ...................................... 3O Competitive Learning Algorithms ................................. 31 Error-Correction Learning Algorithms ............................ 32 Discussion ......................................................... 34 CHAPTER 3 STATISTICS, GENERALIZABILITY, AND A RESEARCH PROPOSAL ................. 37 Neural Networks and Statistics ..................................... 37 Advantages and Disadvantages of Neural Networks ................. 41 Generalization Strategies for MFNs ................................. 43 Proposed Applications in New Product Project Selection ............ 53 Potential Contributions of This Study .............................. 56 vii CHAPTER 4 DATA AND ANALYSIS ..................................................... 57 Data.... ........................................................... 57 Network Inputs .................................................. 57 Firm Skills and Resources .................................... 58 Newness to Market and Newness to the Company ................. 59 Product Market Characteristics ............................... 60 Network Outputs ................................................. 60 Analysis ........................................................... 61 Methodology for Creating, Training, and Evaluating a MFN—DSS....63 Stage I: Determine the neural network structure appropriate for good generalization ....................... 63 Stage II: Train each network in the MFN-DSS with training data using the Levenberg-Marquardt second—order method....64 Stage III: Evaluate the performance of the MFN-DSS with evaluation data ........................................... 65 Methodology for Creating, Training, and Evaluating a PNN ........ 68 CHAPTER 5 RESULTS AND DISCUSSION ................................................ 69 Results ............................................................ 69 Discussion ......................................................... 71 Case Study ......................................................... 74 Future Directions .................................................. 76 Limitations ........................................................ 77 Limitations of This Study ....................................... 77 Limitations of Neural Networks in General ....................... 78 APPENDICIES APPENDIX A: Tables and Figures ................................... 81 APPENDIX B: Neural Network Demonstrations ........................ 97 APPENDIX C: Neural Network Methodology Details .................. 107 LIST OF REFERENCES ................................................... 139 viii Table Table Table Table Table Table Table Table LIST OF TABLES Neural Network Variants of Statistical Methods .............. 81 Measurement Items ........................................... 82 Summary of Cross-validation Procedures for MFNs ............. 84 Summary of Predictive Performance for all Models ............ 84 Success/Failure Prediction Performance ...................... 85 Success/Inconclusive/Failure Prediction Performance ......... 86 MFN-DSS Performance on Case Study Data ...................... 86 Application of Decision Rules to Case Study Data ............ 87 ix Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 10: 11: 12: 13: LIST OF FIGURES Basic Neuron Components .................................... 89 Desired Response for the XOR Logic Function ................ 89 Processing Unit ............................................ 90 Example of a Feedforward Artificial Network ................ 91 Example of a Recurrent Artificial Neural Network ........... 91 Common Activation Functions ................................ 92 The Learning Process Hierarchy ............................. 92 Function Approximation Under—fitting and Over-fitting ...... 93 Training and Validation Errors ............................. 93 Probability Neural Network (PNN) Structure ................ 94 Effect of Changes in Sigma in PNNs ........................ 95 Kohonen Two—Dimensional SOFM .............................. 96 Cascade Correlation Structure ............................. 96 v. v-u '-- 'N 7-. -~ ..~ CHAPTER 1 INTRODUCTION In order to be successful in a competitive environment, firms must create and sustain a competitive advantage (Porter 1985). The ability to develop and launch new products successfully has been identified as a major determinant in sustaining a firm's competitive advantage (Clark and Fujimoto 1991; Day and Wensley 1988) and increasing the value of the firm (Day and Fahey 1988). Firms that are successful in developing unique new products that provide superior benefits are more likely to increase market share and profits. Managers charged with project selection decisions in the new product development (NPD) process such as go/no-go choices and specific resource allocation decisions are faced with a complicated problem. On one hand, they have to deal with limited resources; yet on the other, they have an overwhelming number of new product ideas. Most marketing and R&D departments generate plenty of ideas that appear to have great potential. Their major problems are identifying which of the ideas are worth pursuing and determining which of the firm’s scarce resources should be allocated to each project. Product champions often push their product ideas and request the firm’s best and brightest R&D people for their team. Most projects appear to be very promising at the time of proposal, or they wouldn't have garnered support and been proposed. Despite this long list of seemingly high potential product ideas for management to choose from and pledged support from product champions, we still see a relatively stagnant success rate near 59% for new products in the marketplace (Griffin 1997; Wind and Mahajan 1997). my nu. . . C. :v :. .\» .14 Why so many failures? Why can’t management “weed out” these failures and concentrate the firm’s resources on the successes? Among the reasons are the following. First, the NPD process is extremely complex and innovation demands participation from many different functional areas. Each functional area has its own goals, languages, procedures, and thought worlds (Griffin and Hauser 1992). This inevitably causes miscommunications, misunderstandings, and ultimately, missed opportunities. Second, project screening is difficult. Any attempt to predict the future is a risky venture, but with new products this is especially so. The product is not well defined; it is still simply an idea. Managers must decide on the viability of a product idea given incomplete information regarding the nature of the technology, market, and competition. Third, once a portfolio of product ideas is chosen, management must decide how to allocate specific resources to each project. Given incomplete information, managers are expected to make tough choices concerning project selection and resource allocations. Over the past thirty years NPD research has uncovered many robust relationships between various organizational, competitive, market, and technical factors and successful development. Empirical research began with studies in the late 1960's and throughout the 1970's (e.g. Cooper 1979; Cooper 1979; Myers and Marquis 1969; Rothwell 1974; Rothwell et al. 1974). In the SAPPHO studies (Rothwell 1974; Rothwell et al. 1974) and the NewProd project (Cooper 1979; Cooper 1979), successful and unsuccessful NPD projects were compared. In these early studies, factor analysis and linear discriminant analysis were used to determine which factors were most useful in discriminating between successful and unsuccessful projects. After numerous studies throughout the 1970's and 1980's produced similar lists of factors providing discrimination between success and failure in the NPD process, researchers began to develop conceptual models (Brown and Eisenhardt 1995; Griffin and Hauser 1996; Olson, Walker and Ruekert 1995; Zirger and Maidique 1990) of the NPD process. From these and other conceptual models, researchers began testing models or portions of models using multivariate statistical techniques such as regression, MANOVA, path analysis, and most recently structural equations modeling (Calantone, Schmidt and Song 1996; Song and Parry 1997). These conceptual models have been useful to both academicians and practitioners. They provide theoretical frameworks with which the NPD process can be studied and they provide guidance to practitioners in both designing NPD structures and managing the process. These models provide visual descriptions of an assortment of factors that have been found to be covariates of successful projects under a variety of conditions. Unfortunately, most of these models are not directly applicable to the problem of predicting whether or not an ongoing project will ultimately be successful. In order to predict the outcome of a project, a technique must be able to recognize patterns in the project that are consistent with patterns in previously studied projects. Recently, a new tool has been developed which solves this type of pattern recognition problem. Artificial neural networks have been created and used in a variety of applications. Their use in the study of the NPD process may provide a new tool that can be used during the process to predict the likelihood of success. This study will apply the powerful tools of artificial neural networks to the study of success and failure in NPD projects. Artificial neural network decision support systems will be created and trained to recognize patterns in the relationships between NPD success and failure and a variety of antecedent factors. The result will be “artificial NPD brains" which will simulate a manager with extensive experience in both successful and unsuccessful NPD projects. Given this experience, the networks will be able to provide managerial guidance and insights. For example, they could be used by practitioners early in the NPD process as a screen to select projects with the highest probability of success. Training programs for development personnel could be based on knowledge gained from the trained network. Also, they could be used as a diagnostic tool throughout the development process to highlight areas where additional attention and resources are warranted. In marketing, a few recent applications of artificial neural networks have been seen in the journals. Artificial neural networks outperformed traditional discriminant analysis and logistic regression in an industrial market segmentation problem (Fish, Barnes and Aiken 1995). In another application, artificial neural networks were compared to logistic regression models in a number of classification applications (Kumar, Rao and Soni 1995). This study will be the first to incorporate artificial neural networks data analysis into the study of the NPD process. ARTIFICIAL NEURAL NETWORKS Artificial neural networks, also known as connectionist models or parallel distributed processing models because of their highly connected parallel structure, represent a relatively new paradigm in computing. Traditional computers process information sequentially according to a prespecified set of instructions. Artificial neural networks process information in parallel without prespecified programming. Artificial neural networks have a wide variety of applications. While an exhaustive list of applications is beyond the scope of this study, the examples which follow serve as an indication of the diversity of possible uses of artificial neural networks. One of the first applications was as a noise filter in telephone lines. Echoes developed as telephone connections became increasingly long and transmission times approached a half second. Artificial neural networks provided an adaptive method of canceling the echo in these long transmissions (Fausett 1994). Image compression is a common application. An artificial neural network can be trained as both an image encoder and decoder. The first part of the network structure encodes the image to reduce its dimensionality. The last part of the network decodes the compressed image to its original (or near original) signal (Cottrell, Munro and Zipser 1987). A variety of optimization problems have been solved using artificial neural networks. Hopfield and Tank (1985) pioneered work in this area by applying artificial neural networks to the traveling salesman problem. Other artificial neural network applications in this area includes the job-sequencing problem and an analog to digital signal converter (Tank and Hopfield 1986). Sejnowski and Rosenberg (1987) developed an artificial neural network to convert English text into verbal speech. Their network, NETalk, has been tested in a variety of ways. In one case, the network was trained using the 1000 most often used words from Miriam Webster’s Pocket Dictionary and then tested using 20,012 other words from the dictionary. The network correctly pronounced 77% of the words in the test set. Subsequent variations of this network improve the success rate to 90% (Karayiannis and Venetsanopoulos 1993). Artificial neural networks are especially useful in pattern recognition problems. Typically, a network is trained with representative samples from the domain of interest. Given a new set of data that the network hasn’t seen, the trained network identifies patterns learned from the training samples. One popular pattern recognition application is character recognition such as handwritten character recognition. The United States Post Office uses artificial neural network technology for handwritten zip code recognition (LeCun et a1. 1989). Artificial neural networks also have been used in signature analysis to detect forgeries (Mighell, Wildinson and Googman 1989). Similar to pattern recognition applications, artificial neural networks can be used as classifiers. A network is trained with representative samples from the domain of interest and then used to classify new data into categories learned from the training set. One application in this area is the classification of objects from sonar signals (Gorman and Sejnowski 1988; Gorman and Sejnowski 1988). In this case, a network is trained to detect the difference between the sonar signals from different submerged objects such as a metal cylinder and a cylindrically shaped rock. Artificial neural network classification applications also have been used in the medical field for early diagnosis of heart attacks (Harrison, Marshall and Kennedy 1991) and diagnosis of the causes of lower back pain (Bounds, Lloyd and Mathew, 1990; Bounds et a1. 1988). In weather forecasting, artificial neural networks are trained to predict certain events based on a variety of conditions that precede the event. For example, artificial neural networks have been used to predict solar flares and lightning strikes. A network for forecasting solar flares was created that performed at least as well as a rule-based expert system (Bradshaw, Fozzard and Ceci 1989). In another example, researchers used wind, electric field, and wind divergence as inputs to an artificial neural network to predict lightning strikes at the Kennedy Space Center (Frankel et al. 1991). Artificial neural networks are beginning to be used in business applications. Their first business applications were in financial analysis. Dutta and Shekhar (1988) began by comparing an artificial neural network model to regression in prediction of bond ratings. Utans and Moody (1995) used an artificial neural network to predict corporate bond ratings using ten financial ratios that reflect fundamental operating characteristics of firms. Artificial neural networks have been used in many other financial forecasting applications including modeling stock returns, stock markets, gold futures, and foreign exchange markets (see Refenes 1995). NEUROBIOLOGICAL ROOTS The structure of artificial neural networks is biologically inspired. That is, they were developed from attempts to model the human brain. These models are based on the neuron (Figure 1). A simple neuron has four parts: (1) dendrites that carry signals into the nucleus; (2) a nucleus that processes incoming information; (3) an axon that carries the signal away from the neuron; and (4) synapses that connect the neuron to other neurons and transmit the signal forward (Hertz, Krogh and Palmer 1991). A neuron combines incoming input signals from other neurons and feeds them into its nucleus. The nucleus evaluates and transforms the input signal into an output signal. During the evaluation, if the input signal does not exceed a certain threshold value, the neuron will not fire. That is, it will not send an output signal to the next neuron. If the threshold value is reached, the neuron fires and the output signal travels through the axon and synapses to eventually become an input signal to other neurons. Essentially, the nucleus of the neuron serves as an on/off switch for its output. The human brain contains approximately ten billion neurons connected through a network of approximately sixty trillion synapses (Sheperd and Koch 1990). A typical neuron interacts with between 1,000 and 10,000 other neurons, but a single neuron could interact with as many as 200,000 neurons (Nelson and Illingworth 1991). Currently, the nature of the working mechanisms in the brain are not known for certain. It is believed that knowledge or intelligence in the brain is represented by the pattern of neuron firings. This pattern of firings is called a neuron’s activation. A neuron firing frequently has a high activation and a neuron that is seldom firing has low activation. The human brain can make a decision based on thousands of factors almost immediately. Consider, for example, the process of driving through heavy traffic on a highway. Changing road conditions, environmental conditions (rain, snow, wind, etc.), actions of others (drivers, bikers, pedestrians, etc.), and other factors are inputs and processed simultaneously to make nearly instantaneous decisions concerning a number of possible actions such as steering, signaling, shifting, braking, and accelerating. Even the fastest traditional computer system processing information serially using a prespecified set of rules and instructions could not provide the reaction time necessary for many of the decisions that are made on a routine basis by the brain. A traditional serial processing computer also would have trouble dealing with incomplete input data. That is, a decision would be difficult in the presence of missing data. On the other hand, the parallel structure of processing in the brain provides a more fault tolerant system. The human brain is constantly making decisions based on incomplete data. The processing power of the brain stems from its highly connected, parallel structure. Artificial neural networks are designed to take advantage of what knowledge there is concerning the abilities and processing mechanisms of the brain. In artificial neural networks information is processed using a parallel as opposed to a serial structure. In addition to providing a powerful platform for computing, the parallel structure of neural networks also results in their robust nature. This is generally referred to as fault tolerance (Knight 1990). In biological structures, because of their parallel and highly connected structure, neural networks tend to perform very well even when parts of the system are destroyed or degraded. In artificial neural networks this feature manifests in the ability to produce acceptable results even when some inputs are missing. FOCUS AND CONTRIBUTION OF THIS STUDY Effective project screening continues to be a priority for both researchers and practitioners (Cooper 1992; Marketing Science Institute 1996). A firm’s NPD performance depends greatly on management’s ability to allocate scarce resources where they will produce the greatest benefit. The ability to screen out likely failures early and concentrate resources on projects identified as having high potential 10 for success is an important asset to any firm. The central issue addressed in this study is developing tools that can predict new product success or failure better than existing methods. This study investigates a new methodology for predicting project success or failure - artificial neural networks. Properly constructed for generalizability, artificial neural networks are particularly useful for modeling underlying patterns in data through a learning process. They are quite useful in pattern recognition problems such as the modeling of the complex NPD process. The artificial neural networks used in this study gain their performance characteristics mainly from two features: their highly connected, parallel structure and their nonlinear hidden layer. The parallelism that is built into the structure of artificial neural networks allow for the consideration of numerous contingencies and dependencies in the relationships between independent and dependent variables. Furthermore, the nonlinearities incorporated into these networks remove the linear constraint normally imposed on many traditional statistical methods such as linear regression and structural equation modeling. Artificial network decision support systems are developed for new product development project selection and management. These systems can be used by new product managers during the development process to provide indications of eventual success or failure of ongoing or proposed new product projects. Networks are created and trained with survey data to recognize patterns in the relationships between NPD performance and a variety of antecedent factors. A primary goal in constructing the systems is to ensure the ability to correctly classify unfamiliar or new projects, not used during training, as successful or unsuccessful by recognizing patterns within the data that are consistent with previously analogous projects. This property is generally referred to as network 11 generalizability. Experienced managers recognize patterns within projects consistent with previously successful and failed projects. Practical knowledge such as this enables them to guide projects to ultimate success or kill projects doomed for failure before valuable resources are wasted. The artificial neural network decision support systems in this study use learning algorithms to approximate this practical experience. In the case of the multilayer feedforward network decision support system (MFN- DSS), each of ten artificial neural networks in the system starts with different initial weight matrices and is trained with data from 400 NPD projects to recognize patterns consistent with success and failure. Once trained, the result is a system of ten “NPD artificial brains." Using a MFN-DSS is analogous to obtaining the insight, opinions, and advice of ten artificial brains with experience in 400 successful and unsuccessful projects, each offering different points of view based on their different learning approaches. The other type of artificial neural network employed here is the probability neural network (PNN). The PNN uses the training data to construct a probability distribution function for the output (NPD success) based on the responses indicated by the inputs. This study has four main purposes: (a) to investigate whether or not decision support systems based on neural network technology can outperform traditional methods of prediction from a statistical point of view; (b) to compare the performance of neural network methods and traditional methods in the prediction of a dichotomous success variable; (c) to compare the performance of neural network methods and traditional methods in predicting continuous success variables (financial success and technical success); and (d) to demonstrate the usefulness of these neural network decision support systems via case studies. Based on the highly connected, non-linear structure of artificial neural networks and their impressive performance in other applications it is likely they 12 will provide a superior predictive system for the NPD process. Therefore, the following three research questions guide this study: RQ#1: Based on statistical criteria, can artificial neural network decision support systems outperform traditional methods of prediction? RQ#2: When predicting a project’s success or failure, can artificial neural network decision support systems outperform traditional methods? RQ#3: When predicting a project’s degree of success or failure, can artificial neural network decision support systems outperform traditional methods? This study is the first to incorporate artificial neural networks data analysis into the study of the NPD process and the first to create, train, and validate decision support systems based on artificial neural networks to aid managers in complex decision making problems. Furthermore, there are only a few studies which provide predictive models of NPD success (e.g. Calantone, Schmidt and Song 1996; Song and Parry 1997). These studies of NPD success make use of structural equation models which are most useful for explanation but can also be used for prediction. The advantages of using neural networks to create predictive models of NPD success are: (1) no assumptions regarding the underlying statistical properties of the data (e.g. multivariate normality) need to be made when using neural networks, (2) neural networks are very robust even when multicolinearity is prevalent in the data, and (3) no prior specifications of the relationships between the factors of the model need to be made with neural networks. ORGANIZATION OF THE MANUSCRIPT Chapter 2 begins with an historical look at research in artificial neural networks. This is followed by a list of notations used throughout the manuscript. Next, artificial neural networks are defined and described in more detail. The chapter ends with the development and explanation of a hierarchy of artificial neural networks. This 13 hierarchy is designed to be broad enough to provide an overview of the most popular neural networks, but due to the scope of this study, does not give a detailed account of every possible architecture. Thus, Chapter 2 provides the background and tools needed to understand the concept of artificial neural networks. More detailed discussions regarding the mathematics behind different learning algorithms are found in the Technical Appendices. In Chapter 3, artificial neural network methods are compared to traditional statistical methods. This is followed by a discussion of issues in developing neural networks with good generalization properties. Next, a proposal for applications of artificial neural network techniques to new product development data to construct prediction tools usable in the management of new products is presented. These techniques are based on the multilayer feedforward network (MFN) architecture and the probability neural network (PNN) architecture. These predictive tools are "artificial brains" that learn from experience from a large-scale, cross-national study of NPD projects and predict success or failure based on characteristics of projects not used during training. The chapter ends with a discussion of the potential contributions of this study. Chapter 4 begins with details on the data used in this study. Next, the methodologies used to create, train, and evaluate the neural networks used in this study are specified. Chapter 5 presents the results of the evaluations of the trained networks. These results are compared to traditional methods. Next, the results are discussed from both academic and practitioner points of View. A case study is then presented to further amplify the results and l4 usefulness of the decision support systems created in this study. This chapter ends with discussions of future research directions and limitations associated with this study. Appendix A contains all Tables and Figures in this study. Appendix B is a selection of artificial neural network demonstrations as applications to data analysis. These demonstrations offer an intuitive indication of the possible uses of artificial neural networks. They also provide a "hands on" level of knowledge concerning the learning process in artificial neural networks. In Appendix C, the backpropagation algorithm is derived and discussed. The backpropagation algorithm is the basis for the most popular class of learning algorithms. The derivation is followed by a discussion of variations to the basic backpropagation algorithm and other more advanced training methods. In addition to providing an in- depth discussion of the algorithm (backpropagation) that eventually led to the proliferation of artificial neural networks research, this Appendix provides an overview of the most popular and useful algorithms for solving pattern recognition problems. Finally, a brief background on the PNN is given. CHAPTER 2 NETWORK STRUCTURES AND TRAINING PROCESSES This chapter serves as an artificial neural network tutorial which provides a basic understanding of the structure and learning environment of artificial neural networks. It begins with an historical background of research of neural networks. A listing of the notation that will be used throughout this study follows. Next, a definition of artificial neural networks is presented and discussed. Finally, a learning environment typology incorporating learning paradigms and learning algorithms is presented. AN HISTORICAL LOOK AT NEURAL NETWORK RESEARCH While a complete history of artificial neural network research is beyond the scope of this study, a brief discussion including selected milestones and contributions of influential researchers follows. Artificial neural network research began with the work of McCulloch and Pitts in the 1940’s with the development of the McCulloch and Pitts (1943) neuron. Their simple neuron structure was able to compute a number of arithmetic and logical functions. A few years later, in 1949, Donald Hebb wrote his influential book The Organization of Behavior (Hebb 1949). This book contained the first explicit statement of the psychological learning rule for synaptic modification. The “Hebb rule” as it is now called is the basis for many of the learning algorithms in use today was stated as follows: 15 16 When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased (Hebb 1949, p. 62). This description of synaptic change has been extended to artificial neural networks and restated as: If two neurons on either side of a connection are activated simultaneously, then the strength of that connection is increased. Alternatively, if one neuron on either side of a connection is activated while the neuron on the other side of the neuron is inactive, the strength of that connection is decreased. In the late 1950's and early 1960’s, Frank Rosenblatt (1958) made an important contribution to the artificial neural network field by introducing perceptrons and the perceptron learning rule. Perceptrons are input layers connected by paths with adjustable weights to an output layer of neurons. The perceptron learning rule is one of the original learning algorithms that uses an iterative process to train a neural network and lays the foundation for later work on learning algorithms. Widrow and Hoff (1960) developed a learning algorithm that is similar to the perceptron learning rule, but was the first algorithm to incorporate the prediction error of the network as part of the weight adjustment. The Widrow—Hoff learning rule is also known as the delta rule or the least mean square (LMS) rule. Widrow and Hoff used the delta rule to train networks called ADALINES (ADaptive LInear NEurons) and MADALINES (Multiple Adaptive neurons). ADALINES and MADALINES are networks with multiple input nodes connected by paths with adjustable weights to single or multiple processing nodes. Among other problems, these networks were used in rotation invariant pattern recognition and control problems such as broom balancing and backing up a truck (Fausett 17 1994). Widrow also established the first neurocomputer hardware company in the 1960’s (Hecht—Nielson 1990). In 1969, Minsky and Papert proved mathematically that a perceptron could not learn the exclusive/or (XOR) logical function. The XOR function returns a value of true (1) if exactly one of the input values is true (1) and false (0) otherwise. The possibilities for a network with two inputs (xu )Q) and one output (y) follow: H f0 OOHHX Or—‘OI—‘x OHHOKC Perceptrons could not learn this logical function because the solution space is not linearly separable. That is, no single straight line can separate the points for which the desired response is 1 from the points for which the desired response is 0. See Figure 2. Minsky and Papert's book, Perceptrons (Minsky and Papert 1969), was a major factor in the decline of artificial neural network research throughout the 1970’s. Before the book, there was a tremendous amount of enthusiasm surrounding artificial neural networks. Wild claims were being made concerning artificial neural networks performing task that humans ordinarily perform and that artificial brains soon would be created that would be able to "think" and "reason" in much the same way as humans (see discussion in Hecht-Nielson 1990). When Minsky and Papert (1969) revealed the flaws in the perceptrons, many researchers lost interest and diverted into artificial intelligence research. Another contributing factor in the declining interest was the lack of computing power (Hecht-Nielson 1990). Since learning algorithms for artificial neural networks can be relatively time consuming, involving many iterations through the data, many interesting problems were 18 computationally prohibited from being solved with neural networks. This changed later after the development of the backpropagation algorithm (which solved the XOR problem) and faster computers (which eased the computational restrictions) (Bishop 1995). Although little work was done on neural networks in the 1970's, there are two researchers that stand out. Kohonen (1972) pioneered work in self-organizing feature maps. Feature maps have a variety of applications such as speech recognition (Kohonen, 1988) and musical composition (Kohonen 1989). Also of note was the work of James Anderson. In 1977, along with others, Anderson developed a network known as the Brain-state-in-a-box (Anderson et al. 1977). This network was trained to make medical diagnosis and learn multiplication tables. A resurgence of research began in the 1980's with the discovery of the generalized delta rule or backpropagation learning algorithm. Backpropagation (in a number of forms) is probably the most widely used algorithm for training artificial neural networks. It was independently discovered by Parker (Parker 1985) and by LeCun (1986). In fact, once the backpropagation algorithm gained acceptance it was found that Werbos first described the method in his 1974 Ph. D. dissertation (Werbos 1974). The backpropagation algorithm is a gradient descent method of minimizing the total squared error of the network’s output using first- order derivative information. This generalization of the earlier delta rule allows for the training of multiple layered networks. Multiple layered networks are much more complex than simple perceptrons and can therefore be used to solve more complex problems. Further research was spurred by the encouraging writings of Nobel Prize winning John Hopfield (physics) and funding from the Defense 19 Advanced Research Projects Agency (DARPA) (Hecht-Nielson 1990). The input from Hopfield and his distinguished colleagues excited many researchers and gave credibility to the field after many years of dormancy. resources necessary for those gaining In summation, artificial neural considerably in the past fifty years, in the past fifteen to twenty years. The influx of funding from DARPA projects provided the interest. network research has progressed with most of the growth occurring Significant improvements in both structures and learning algorithms have lead to the development of powerful networks that have a diverse range of applications. NEURAL NETWORK NOTATION Unfortunately, notation used in artificial neural network research is not always consistent. One of the notational differences resides in the order of subscripts in weight variables; Some authors reverse the order of subscripts. The following notation is adopted from Fausett (1994) and will be used consistently throughout this manuscript. Other notations, specific to different algorithms, are defined within the text. x1, yi Activations of units X1, Y1, respectively: For input units Xi, x1== input signal; For other units Yj, yj f(y_inj). y_inj Net input to unit Yj: y_inj = bj + inwij . i Z_inj Net input to hidden unit Z]. wlj Weight on connection from unit X, to unit Y]. bj Bias on unit Y]. W Weight matrix: W = {WU}- wq Vector of weights: W-j = (W13: W231 W334 - - . : Wm” a Training input vector: 8: (51! 521 S3! - . - I Sn)o t Training (or target) output vector: t = (t1, t2, t3, - o o I tn). x Input vector (for the net to classify or respond to): X: (XII X2! X3! - ' - I Xn)- 20 Aw,j Change in w”: Awij = [wij(new) " Wij(Old)]. a Learning rate parameter. These notations are explained more deeply in succeeding sections. NEURAL NETWORKS DEFINED Just as notation differs from author to author, the definition of an artificial neural network can differ from author to author. This study adopts the following broad definition: An artificial neural network is a grouping of simple processing units connected by weighted, directed paths which are adjusted through a learning process. It is important to note where artificial neural networks gain the “artificial” part of their name. As stated earlier, it is not known exactly how the brain works. That is, it isn't known how synapses, dendrites, the nucleus, axons, the firing patterns of neurons, and other brain mechanisms really work together to create intelligence. It is also not known what types of special activation functions exist in neurons. But when current knowledge concerning the brain’s processing is modeled through connections, nodes (processing units), activations, layers, etc., a powerful computational system is discovered. Therefore, it is important to realize that artificial neural networks are inspired from what has been learned from biology, but don't necessarily represent biological reality. This should not deter researchers and practitioners from using artificial neural networks. What they gain from their adopted structure and processing methods can provide many benefits over conventional serial processing. Throughout the remainder of this manuscript the general convention of dropping “artificial” from the title will be followed and artificial neural networks will be referred to simply as neural networks, networks, or nets. 21 There are eight aspects common to all neural networks (Rumelhart, Hinton and McClelland 1988): A set of processing units. A state of activation for each unit. An output function for each unit. A pattern of connectivity among units. A propagation rule for propagating patterns of activities through the network connections. 0 An activation function that combines the signals entering a unit with the current state of that unit to produce a new level of activation for the unit. 0 A learning algorithm whereby connection weights are modified through experience. 0 A learning paradigm within which the system operates. Each of these aspects will be discussed in the following sections: Processing units The basic building block of a neural network is the processing unit (Figure 3), also known as a perceptron, node, or unit. In this manuscript these terms are used interchangeably. While some units (inputs and outputs) do represent specific constructs or variables, others (hidden units) do not have an assignable label or meaning. This aspect of neural networks is different from other modeling methods such as regression and structural equations where all units in the model represent or are assigned to specific constructs. Currently, in neural networks, the focus is on the inputs and outputs for explanation while other units are simply computational. State of.Activation Next, the incoming signal, y_in3, (including weighted inputs from other nodes and the bias) is evaluated by an activation function, f(y_infi. The output of the activation function determines the state of activation of the processing unit at a specific time. 22 Output.Function The output function of a processing unit determines the signal that is to be passed to other units in the network. In most cases, the output function is the identity function and the output of the unit at any time is equal to its activation at that time. In some cases, though, the output function is binary, bipolar, or a nonlinear function similar to the activation functions within nodes. Pattern of Connectivity Neural networks are made up of highly connected parallel processing units or nodes. These nodes are generally arranged in layers in the network. Technically, networks must have a minimum of two layers (input and output), but can have any number of additional layers. Defining the number of layers in a network often becomes a source of confusion. In this study, a network with an input layer and an output layer is referred to as a single layer network. At first glance this may appear counterintuitive, but the convention is to only enumerate layers where processing occurs. Since the input layer doesn’t perform any processing, it is not counted in the number of layers in the network. Hidden layers are layers of nodes located between the input and output layers. Units can be connected either in a feedforward or feedback system. In a feedforward system (Figure 4), units are only connected to units lying in higher layers. Signals are transferred from the input layer nodes to hidden layer nodes and from hidden layer nodes to output layer nodes. A fully connected feedforward network is a special case of neural networks that is often used. In fully connected feedforward networks each node is connected to every node in the next 23 higher layer. Figure 4 is a fully connected network, but feedforward networks in general are not required to be fully connected. A shortcut notation for feedforward structures is to specify the number of nodes in each successive layer separated by slashes. Thus, a fully connected, feedforward network with six inputs, nine nodes in the first hidden layer, five nodes in the second hidden layer, and three output nodes would be a three layer network with a 6/9/5/3 structure. The network in Figure 4 is a two layer, 3/2/2 network. In a recurrent system (Figure 5), nodes may connect to any node in any layer. Feedback loops may connect to the same node, nodes in the same layer, or nodes in previous layers. The feedback loops create a special challenge in the training process but the added complexity allows for the solving of more complex problems. In general, the greater the complexity of the network structure, the greater the complexity of the problems that it can be used to solve. Complexity in the network structure depends on a number of structural properties such as: the number of nodes, the number of processing layers, the type of activation functions in each node, and the type of connectivity between nodes. Propagation Rule The propagation rule specifies the manner in which the input signals are combined in a processing unit. While other propagation rules exist, all of the networks in this study will use the inner product rule such that the signal to be evaluated will be the inner product of the vector of activation values and the weight vector. In the processing unit, input signals from other processing units and a bias unit are linearly combined to form a single signal to be 24 prmxmsed. Specifically, this signal is most often formed as a weighted Each incoming signal is sum oftfim incoming signals and the bias. multuflied by the weight associated with the connection between the two units. The weighted bias and the weighted input signals are then summed to provide the input to a unit y_inj = bj + Exiwij . i The biological roots of the bias stem from the threshold value in a biological neuron. Most often it is included in a network as another input signal with a fixed activation value of 1.0 and a variable weight. Thus, if the weight associated with a neuron’s bias is negative, it serves as the threshold value that must be exceeded for the neuron to provide a positive output. Alternatively, a threshold node can be incorporated into a neural network in place of a bias. A threshold unit is simply a node with an activation of -1 connected to other nodes by an adjustable weighted connection. Mathematically, the bias and threshold are identical except that their signs are opposite; Nevertheless, their impact on a neural network is the same. In this study the bias term is For demonstration purposes, the bias is used for sake of consistency. shcwniaas an input separate from the incoming input signals from other nodes. but the bias plays an Netynirks can be constructed without biases, inmxartarn:.role in defining the decision boundaries of the network. each separating line (surface) of the decision Without a bias tenm The bias term allows for space would have to pass through the origin. to detach itself from the origin and the separating line (surface) provides an increased level of power to the network (Fausett 1994). 25 Activation Function Activation functions can take many forms. Four representative functions are shown in Figure 6. A hard limiter function can take on only two values. Typically the function takes on either bipolar values (-1 and +1) or binary values (0 and +1). The hard limiter function is useful when the outputs to be modeled are discrete. Ramping functions are similar to hard limiters, except they linearly model the input over a specified range. In Figure 6, the ramping activation function mirrors the inputs in the interval (0,+1). Outside this range, the ramping function takes on values of 0 or +1. Sigmoid functions accomplish the same type of representation as the ramping function using a smooth, nonlinear curve. The two types of sigmoid functions commonly used in neural networks are the logistic function, 1 1+exp’”' fix): and the hyperbolic tangent function, l—exp“ 1+exp”" fix): The basic logistic function ranges from 0 to +1 while the hyperbolic tangent function ranges from -1 to +1. A common transformation of the logistic function, f(x)=——L—-1, 1+exp'“ which ranges from -1 to +1 is also used. The logistic and hyperbolic tangent functions are continuously differentiable, which is important in calculatixug weight changes during the learning process. This point is discussed in.further detail in the section on learning algorithms. Eacfi1<3f these activation functions typically range from -1 to +1 or front() to +1. The hard limiter function is useful when outputs are 26 binary or bipolar. The ramping and sigmoid functions are useful when the outputs to be modeled are continuous, but no specific rules exist which define the activation function that is most appropriate. Generally, the more complex the relationships to be modeled, the more complex the activation functions need to be. Also, many networks incorporate linear activation functions in output nodes where the output of the network is not constrained to either (—1,+1) or (0,+l). The output of a node with a linear activation function is simply the sum of the weighted inputs and bias. Most neural networks use some form of sigmoid function in the hidden layer(s) and either sigmoid or linear functions in the output layer. Learning.Algorithm A learning algorithm is a set of rules which govern the weight changes in the learning process. Knowledge or information about relationships between inputs and outputs is stored in the weight matrix of neural networks. Processing units store short term information concerning the state of each unit at a point in time, but the weights for each connection and the pattern of connections are where long term information is stored (Rumelhart, Hinton and McClelland 1988). The values of weights in a neural network are adapted through a learning process. Haykin (1994) defines learning as follows: Learning is a process by which the free parameters of a neural network are adapted through a continuing process of stimulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place. Learning Paradigm A learning paradigm is the environment that the learning algorithm works within during the learning process. The environment defines specific attributes of the learning process such as when weights are 27 Inxmted,vumn the process terminates, and how the learning algorithm is applied to the network. A LEARNING PROCESS HIERARCHY FOR NEURAL NETWORKS Eigune7 graphically illustrates a learning process hierarchy. Varnmusframeworks have been developed that capture and categorize neural networks (e.g. Haykin 1994; Hinton 1989; Hush and Horne 1993; Knight 1990; Lippmann 1987; Rumelhart, Hinton and McClelland 1988). The focus of this study is on developing neural network applications useful Therefore, while there may exist in the study of the NPD process. exotic and unique learning algorithms and learning paradigms that cannot be categorized within the hierarchy shown in Figure 7, they are beyond the scope of this study. Instead, a broad and comprehensive foundation is presented. of neural networks based on (Haykin 1994) Most of the popular learning rules and learning paradigms have Numerous extensions and adaptations to their roots in this hierarchy. Hundreds the basic architectures exist and are still being developed. of learning rules have been published in relevant neural network Unfortunately, once they are published, journals in the past few years. little work is done to compare them to each other to determine specific advantages and disadvantages of each method (Prechelt 1996) . Many of these algorittmm;are developed either to solve very specific problems or to provitna linkages between artificial neural networks and biological Since the focus of this study is to develop neural neural networks. an in-depth network applications useful in studying the NPD process, analysis of the nuances associated with each of the numerous extensions well established and commonly used methods is not the goal. humead, and structures will be incorporated into the applications studied here. 28 Supervised Learning Paradigm Supendsed learning is also known as associative learning. There aretwotypmsof associative learning: autoassociative and Haykin 1994). In autoassociative heteroassociative (Fausett 1994; leanfing Uxeinput vector is identical to the output vector in the the number of inputs equals the number of training process. Therefore, Autoassociative networks will be discussed in more detail in outputs. the section which compares neural networks to traditional statistical Hethods.Pkmeroassociative learning involves the mapping of an input vector on to a different output vector. In heteroassociative learning the number of inputs and outputs may or may not be equal. a target vector is available which defines In supervised learning, of the network for a given input vector. A the desired output(s) learning algorithm is then used to adapt the weights such that the are reproduced when the input vector is propagated desired output(s) Weights are adjusted iteratively according to the through the network. chosen learning rule as training data propagates through the network. Each weight change is called an iteration and each pass through the Note that a weight change can occur training data is called an epoch. or after an either atfter the presentation of a single input vector, On-line learning is when an iteration is taken after each entire epoch. batch learning is when the weight updates occur vector presentation, (Ripley 1996). training a network after each epoch Except for very simple demonstration examples, involves a number of epochs. Weights are adjusted until the network converges on some stopping rule specified either in the learning rule or Generally, training terminates either when the output by the user. 29 error of the network reaches a prespecified value or after a certain number of epochs. Supervised learning can be either static or dynamic. In static models the network is trained using a training data set. After training is complete the weights are held constant while other input vectors are propagated through the network to yield network output values for each input vector. In dynamic models, the network may be trained with a training data set, but the weights are not held constant and learning continues as new input vectors are introduced to the network. Static learning is sometimes referred to as off-line learning and dynamic learning is called on-line learning (Haykin 1994), but this can lead to confusion with online and batch terms. In this study, the static and dynamic terms are used and online refers to the weight update scheme. unsupervised Learning.Paradigm Unsupervised learning is performed in the absence of a desired or target output vector. Only input values are supplied in the unsupervised training process. Without output values to provide the network with an error or deviation from a desired value, the network undergoes self-organization. Through repeated training iterations, input nodes that are similar in activation form clusters in the output nodes. For this reason, networks that incorporate unsupervised learning are often called self-organizing systems. The network learns to respond to patterns or clusters in the training data without any a priori specification of output classes or categories. There are a variety of applications in NPD for both supervised and unsupervised learning paradigms. For example, unsupervised neural networks could be used to discover clusters of both independent and 3O dependent variables as a data reduction technique similar to exploratory factor analysis. From these clusters, a supervised neural network could be built to discover the relationships between the clusters of independent (input) variables and dependent (output) variables. Independent variables could include factors that are thought to be related to successful development and dependent variables could be various measures of NPD success or failure. Once trained, the supervised network could be used to analyze the relationships between inputs and outputs and to predict outcomes based on inputs from NPD projects to which the network has not yet been exposed. Hebbian Learning Algorithm Most learning algorithms have roots that can be traced to the Hebbian rule. Donald Hebb introduced a simple learning rule that has strong neurobiological support (Brown, Kairiss and Keenan 1990; Kelso, Ganong and Brown 1986). The rule is based on Hebb’s description of learning presented earlier in the historical section of this chapter. Perhaps the easiest way to understand the Hebb rule is to consider a network with binary nodes. Nodes with an activation of O are considered to be off while nodes with an activation of 1 are considered to be on. The Hebb rule states that if two connected units are both on, then the weight associated with the connection between the nodes should be strengthened. Mathematically, the Hebb rule can be stated in terms of the weight update formula: wi(new) = wi(old) + xiyi. Note that a weight change (network learning) takes place only when both units have activations of 1. 31 The Hebb rule is not often used in modern neural network applications, but it has served as a launching point for a variety of algorithm variations. These new variants are much more powerful and have overcome many of the shortcomings of the original Hebb rule. Nonetheless, the simplicity of the Hebb rule has led to its common use as a pedagogical tool and it is often used in examples to demonstrate the learning process in neural networks. See Appendix B for a Hebb net demonstration. Coupeti tive Learning Algorithms Competitive learning algorithms evolved out of the Hebb rule. The three basic components of a competitive learning algorithm are: (1) Start with a set of units that are all the same except for some randomly distributed parameter which makes each of them respond slightly differently to a set of input patters; (2) Limit the “strength” of each unit; and (3) Allow the units to compete in some way for the right to respond to (learn from) a given subset of inputs (Rumelhart and Zipser 1985). In the learning phase of competitive neural networks, output units compete for the opportunity to become the active unit. The active unit, or winner, determines the weights to be adapted in each iteration of the learning process. Once the network is trained, the outputs become feature detectors and recognize patterns in the data. Neural networks trained with a competitive learning algorithm are often used to map patterns in data. Typically, competitive neural networks are used for clustering, vector quantization, dimensionality reduction, and feature extraction (Krose and Smagt 1993). Popular classes of neural networks based on competition include the Kohonen 32 self-organizing feature map (SOFM), learning vector quantization (LVQ), and counterpropagation (Fausett 1994). Appendix B includes a detailed demonstration of a Kohonen SOFM. Error-correction Learning Algorithms Probably the most popular class of learning algorithms is the error-correction algorithm. Error-correcting algorithms are based on a gradient descent learning algorithm originally called the ADALINE learning rule by Widrow and Hoff (1960). They were the first to develop a learning rule that incorporated an error term. It is also known as the delta rule, the Widro-Hoff rule, or the LMS (least mean square) rule (Hertz, Krogh and Palmer 1991). The goal in error-correcting algorithms is to minimize a cost function based on an error signal of the network (Haykin 1994). The error signal of the network, e,, is almost always defined as the difference between the desired output of the network, t,, and the actual output of the network, yk. Thus, e, = t, - yk, where k denotes the index of the output units. The cost function is usually the sum of squared errors of the network, '3' E: NIH Zei. k Error-correcting algorithms use gradient descent to minimize the cost function. The change in weights or the weight matrix according to this method is expressed as Aw” = aepg, where a is a constant representing the learning rate. Thus, the weight change is proportional to the product of the error and the activation at each output node. The backpropagation algorithm is an extension of the delta rule and is the most popular and useful learning rule. Many variations of 33 the bmflqnopagation algorithm have been developed which increase the speedvnthvuuch it converges, increase its resistance to stopping at hocal mhfinm, or improve various operating characteristics of the ,fl,goritmn. Some of these variations are discussed in more detail in Appendix C. Neural networks are created by combining algorithms, structures, and environments to meet the needs of specific problems. In general, this combination is made based upon characteristics of the data. When information is not known concerning desired or target outputs of the neu~ork unsupervised learning algorithms are generally chosen; otherwise, error-correcting supervised learning algorithms make use of the known outputs. A common first step in designing a network is to choose an algorithm or family of algorithms based on the existence of target values. Next, the structure and environment are chosen based on or some form of cross-validation. trial and error, experimentation, For example, a researcher wanting to reduce the dimensionality of a data set may choose to create an unsupervised, self-organizing network that incorporates some form of competitive learning. Once this task is completed, the researcher may want to create a network that classifies A supervised network could then samples based on some known criteria. be built which uses the outputs from the previous network as inputs. The CAHJMIts would be the categories of the known criteria. This network coultikme solved using a variety of error-correcting algorithms. In a researcher using neural networks to solve problems is armed summary, One of the tasks the researcher faces is with a toolbox full of tools. choosiju; the right tools for the job. These tools are discussed in further detail throughout the rest of the manuscript. 34 DISCUSSION There are three basic decisions that need to be made when designing a neural network. The network must adopt a specific: (1) structure; (2) learning paradigm (supervised or unsupervised); and (3) learning algorithm. Taken together, a network that adopts a certain structure, learning paradigm, and learning algorithm is said to belong to a specific class of neural networks or have a specific architecture. For example, multiple-layer feedforward networks (MFNs) trained using supervised backpropagation are a specific class of neural networks and are said to have a specific architecture. Probability neural networks represent another architecture which is similar in structure to MFNs, but are quite different in the way in which their structure is created and the learning paradigm which is adopted. Structure in neural networks is defined by the number of layers in the model, the number of nodes in each layer, the activation functions at each node, and the pattern of connections between nodes (e.g. feedforward or recurrent). The number of input nodes is simply the number of input variables and the number of output nodes is the number of output variables to be modeled. The number of hidden layers and number of nodes per hidden layer are not as easy to determine a priori. Typically the number of hidden nodes and number of hidden layers is determined through trial and error, experimentation, or some form of validation procedure. The pattern of connections between nodes refers to whether the network is strictly feedforward or contains recurrent connections. The connection pattern may be fully connected or partially connected. In general, the more complicated the underlying 35 rdatmmdupstotm modeled, the more complicated the network must be to represent those relationships. TheneUmrknmst adopt either a supervised or unsupervised If information is available concerning the actual or learning paradigm. a supervised learning paradigm is likely to desired values of outrn1ts, an unsupervised paradigm Maused. If Hus information is not available, must be chosen. Zlgreatrnmmer of learning rules are available for training neural rmtworks,kmm:they all perform the same basic function of adjusting the Selection of an vmights corresponding to connections between nodes. Each algorflflmiis dependent upon the demands of the particular problem. algorithm has certain characteristics (advantages and disadvantages) that must be matched to the characteristics of the problem. there are two important features that distinguish neural Finally, parallelism and networks from other methods of data analysis: Hinton and Williams 1986). In general, distributed memory (Rumelhart, the parallel nature of processing in neural networks gives them the For ability to deal with noisy data better than traditional methods. netwnxrks designed to recognize printed digits from video inputs example, rotated, or (optical cfluaracter recognition) can recognize ill-formed, handwritten digits that are similar to the perfectly formed digits that were used in network trahflng. Theeciistrdlmated nature of memory in artificial neural networks refers to the way knowledge is represented throughout the network. Individual hidden nodes are normally not interpretable as specific causes or factors in relationships between input and output nodes. the overall pattern of the weights associated with connections Instead, between nodes represents the relationships between inputs and outputs. Knowledge is distributed in a parallel sense throughout the network 36 through the pattern of weights which are adapted to fit the data in the training phase (Rumelhart, Hinton and McClelland 1988). These two features of neural networks (parallelism and distributed memory) are the driving forces behind the power and speed of neural networks. They provide neural networks their speed, fault tolerance, pattern recognition abilities, and generalizability. Chapter 3 STATISTICS, GENERALIZABILITY, AND A RESEARCH PROPOSAL This chapter begins with a number of comparisons between neural network and traditional statistical methods of discrimination, classification, and prediction. Next, important issues in controlling generalizability in neural networks are discussed. Finally, a proposal for applications of neural networks to NPD data analysis is presented. NEURAL NETWORKS AND STATISTICS Neural network research has benefited from multidisciplinary research efforts as contributions have come from cognitive science, neurobiology, engineering, computer science, etc. For the most part, though, the focus has been practical in nature. 'Networks have been built and algorithms have evolved to solve practical problems mostly in pattern recognition and prediction. The main concern in this area is in correct classification and accurate forecasts. Existing structures and learning algorithms are modified and adapted to best meet the specific circumstances of the current problem. When using neural networks for pattern recognition, the goal is to build a network that learns from data in such a way that it is useful when exposed to new data. No a priori knowledge regarding the relationships between independent and dependent variables is assumed. A network with this ability to properly predict future values or properly classify new data is said to have good generalization. On the other hand, statisticians generally seek to build a model of the relationships between independent and dependent variables from an 'a priori' knowledge of these relationships. Their focus is on finding 37 38 thermxml that best represents the phenomena they are studying. Thus, in general, statisticians take more of a macro approach to problems while neural network research has been motivated from a micro approach. While there are differences in the purposes and driving forces behind the use of neural networks and traditional statistical methods, these tools have many underlying similarities. Recently, statisticians (Cheng and Titterington 1994; Ripley 1993, 1994; Sarle 1994) and econometricians (Kuan and White 1994) have taken interest in neural network models. This interest and further research is likely to benefit all three research streams. Statisticians and econometricians get a new set of tools to add to their toolkit and neural networks researchers may make important breakthroughs as a more theoretical approach is taken. For example, neural network users have struggled with issues such as: establishing appropriate network complexity, monitoring and controlling the learning process to inhibit under-learning and over-learning, and determining theoretical limits of networks such as lower bounds on network error from characteristics of the data set. Rigorous theoretical analysis of neural network performance, structure, and learning by statisticians and econometricians may provide answers to these and other important questions. Recent review articles have highlighted many of the similarities and differences between neural network models and traditional statistical methods (Cheng and Titterington 1994; Kuan and White 1994; Ripley 1993, 1994). Most traditional methods have a neural network counterpart which can be considered either a duplicate or a variant of the stenzistical technique. While an exhaustive comparison between statitfizical and neural network methods is beyond the scope of this study, iJ:.is informative to understand the statistical roots of neural Familiar statistical methods that are used in classification Table 1 networks. and chxyzrimination problems are the focus of this comparison. 39 its neural network variant, and relevant liststjw statistical model, citathxm. For example, consider three statistical methods which are commonhytmed: principal components analysis, linear regression, and logistic regression. Principal components analysis is used as a tool for dimensionality reduction. Orthogonal factors are estimated such that linear composites of the original variables can be used in further analyses. Principal components can be extracted using a variety of neural network methods (Bishop 1995; Ripley 1996). One method for determining the principal components of a given set of variables is to create a network with an input layer, a single hidden layer, and an output layer with linear activation functions in both the hidden and output nodes. the network consists of n To extract m components from a set of n variables (m