A CONCEPTUAL FRAMEWORK FOR RESILIENCE ENGINEERING IN CONSTRUCTION SAFETY By Don Wallace Schafer II A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Construction Management - Doctor of Philosophy 2014 ABSTRACT A CONCEPTUAL FRAMEWORK FOR RESILIENCE ENGINEERING IN CONSTRUCTION SAFETY By Don Wallace Schafer II Poor safety performance is a chronic problem that plagues the U.S. construction industry. Some researchers contend that accidents are the result of disruptions. Disruptions are those happenings that interrupt, or disturb, the normal course of work. The overarching goal of this study is to explore schemes and methods to understand, harness, and foresee disturbances that arise from demands placed on the construction operations of project-based organizations that deliver the built environment. This study examines the emerging paradigm of Resilience Engineering (RE) as a means to avoid, mitigate, and recover from disruptions, considering these as the bell weather of untoward happenings on construction projects that endanger worker safety. The ways that RE differs from traditional thinking about dealing with disruptions, the current principles and practices of RE, elements of RE useful to construction projects, and simulation of RE were critical elements addressed in this work. RE is an emerging discipline that has been described as a paradigm shift about safety. The RE approach recognizes the need for a progressive safety plan that is systems-based, sociotechnical in its outlook, views work as a complex activity, and is concurrently reactive and proactive in its vigilance to prevent accidents. In the RE outlook safety is a core value of the organization. RE looks for ways to balance the tensions among ongoing production and economic pressures and safety, recognizing the need to pull back or stifle production when safety is threatened. A basic premise of RE maintains that adjustments are needed in systems given that performance conditions are always underspecified. To be considered resilient, a system must possess the abilities of anticipation, monitoring, response, and learning. This work includes an extensive literature review, development of a framework that describes how RE might be deployed in a construction setting, and a hybrid computer simulation employing agent-based and discrete event modeling that demonstrates the RE principle of the Efficiency-Thoroughness Trade-Off (ETTO). The literature review revealed that RE has a rich history that is solidly built upon and expands previous approaches to system disruption in general and safety management in particular. The framework developed utilized the literature review and focused on translating RE premises and principles to the construction industry. The main thesis of the work posits that the key features needed for construction companies to act resiliently are for executives to consider resilience as a quality of the system, to consider RE as a definitive positional strategy, to develop a “just” culture to support RE implementation, and to view construction systems functionally as opposed to structurally. Additionally, the framework offers guidance and instruction with regard to the essential abilities of anticipation, monitoring, response, and learning. Finally, the hybrid computer simulation proved to be a worthy exemplar of the possibility for agents with resilient behaviors to populate and act in a simulated discrete-event production setting beset with disruptions. Given that RE is an emerging paradigm, hybrid computer simulation may provide a useful tool for researchers to scaffold as this concept is carried forward. Copyright by DON WALLACE SCHAFER II 2014 TABLE OF CONTENTS LIST OF TABLES .......................................................................................................... vii LIST OF FIGURES ....................................................................................................................... ix Chapter 1: Introduction ................................................................................................................... 1 1.1 Background ................................................................................................................ 1 1.2 Defining Characteristics of the Construction Industry ...................................................... 4 1.2.1 Construction is Complex ........................................................................................... 4 1.2.2. Construction is Uncertain and Underspecified ............................................................. 4 1.2.3 Construction is Quick and Demanding ........................................................................ 6 1.3 The Construction Safety Problem .................................................................................. 7 1.4 A Brief Introduction to Resilience Engineering ............................................................. 10 1.5 Goals, Research Question, Objectives, and Method ....................................................... 11 1.5.1 Goal...................................................................................................................... 11 1.5.2 Research Questions ................................................................................................ 12 1.5.3 Objectives ............................................................................................................. 12 1.6 Chapter Summary ..................................................................................................... 13 Chapter 2: Literature Review........................................................................................................ 14 2.1 Introduction.............................................................................................................. 14 2.2 Disruption in General Industry and Construction Operations .......................................... 15 2.3 The Need for New Approaches to Safety – Background of RE ....................................... 21 2.3.1 The Evolution of Safety Thought ............................................................................. 22 2.3.1.1 Popular Accident Models ..................................................................................... 23 2.3.1.1.1 Sequential Accident Models .............................................................................. 24 2.3.1.1.2 Epidemiological Accident Models ...................................................................... 24 2.3.1.1.3.1 Normal Accident Theory ................................................................................ 29 2.3.1.1.3.2 High Reliability Organizations (HROs) ............................................................ 32 2.4 The Five Ages of Safety............................................................................................ 37 2.4.1 The First Age of Safety - The Technology Age ......................................................... 38 2.4.2 The Second Age of Safety - The Human Factors Age................................................. 40 2.4.2.1 Human Error ...................................................................................................... 42 2.4.3 Third Age of Safety – Management Systems and Culture ........................................... 47 2.4.3.1 Culture and Climate ............................................................................................ 48 2.4.3.1 Safety Culture .................................................................................................... 50 2.4.4 Fourth Age of Safety – The Integration Age ............................................................. 51 2.4.5 Fifth Age of Safety – The Adaptive Age ................................................................... 52 2.5 Current Understanding of Resilience Engineering ......................................................... 53 2.5.1 Background and Definitions of Resilience Engineering .............................................. 58 2.5.2 The Four Premises of Resilience Engineering ........................................................... 64 2.5.3 The Four Abilities of a Resilient Organization ........................................................... 65 2.5.3.1 Learning ............................................................................................................ 66 v 2.5.3.2 Monitoring ......................................................................................................... 69 2.5.3.3 Anticipation ....................................................................................................... 72 2.5.3.4 Responding ........................................................................................................ 75 2.6 Managing Performance Variability –“Making Ends Meet” ............................................ 78 2.6.1 Explanation of Performance Variability – The Efficiency Thoroughness Tradeoff (ETTO) .................................................................................................................................... 81 2.7 Functional Resonance Analysis Method (FRAM) ......................................................... 84 2.8 The Resilience Analysis Grid (RAG) .......................................................................... 90 2.9 Other Understandings of Resilience Engineering.......................................................... 91 2.9.1 Stress-Strain Analogy for Resilience Engineering....................................................... 91 2.9.2 Madni and Jackson’s Resilience Engineering Framework .......................................... 94 Chapter 3: Methods ....................................................................................................................... 96 3.1 Introduction.............................................................................................................. 96 3.1.1 Conceptual Frameworks ......................................................................................... 98 3.1.2 Method ............................................................................................................... 100 3.2 Objective 1 ............................................................................................................. 102 3.3 Objective 2 ............................................................................................................. 103 3.4 Objective 3 ............................................................................................................. 104 3.4.1 Discrete Event Modeling ....................................................................................... 107 3.4.2 Agent Based Modeling .......................................................................................... 108 3.4.3 Multi-Scale Modeling ........................................................................................... 109 3.4.4 Anylogic Software ................................................................................................ 110 Chapter 4: Resilience Engineering Conceptual Framework for Construction Safety ................ 113 4.1 Purpose and Features of the Framework ..................................................................... 113 4.1.2 Linking Disruptions, Resilience Engineering, Safety, and Production ......................... 113 4.1.2.1 Disruptions can/may cause accidents .................................................................... 114 4.1.2.2 RE is proposed as a Formalized Approach to Understand Disturbances .................... 115 4.1.2.3 Two of the four premises of RE deal directly with the relationship between production and safety .................................................................................................................... 115 4.2 The Elements of a Project-Based RE Construction Project ........................................... 116 4.2.1 Perspectives ......................................................................................................... 116 4.2.1.1 Resilience is a Quality of the System ................................................................... 117 4.2.1.2 Resilience is a Strategy ...................................................................................... 118 4.2.1.3 Culture for Resilience Engineering...................................................................... 119 4.2.1.4 View Systems Functionally ................................................................................ 121 4.2.1.5 The Four Abilities ............................................................................................. 124 4.2.1.5.1 Responding ................................................................................................... 124 4.2.1.5.2 Anticipating................................................................................................... 130 4.2.1.5.2.1 Patterns in Anticipation ................................................................................. 130 4.2.1.5.2.2 When to Anticipation .................................................................................... 132 4.2.1.5.2.3 General comments on Anticipation ................................................................. 134 4.2.1.5.3 Monitoring .................................................................................................... 137 4.2.1.5.4 Learning ....................................................................................................... 139 vi 4.3 Developing a Graphical Model for RE in Construction: The Software Development Process and The Sociocognitive Framework for Engineering Work Systems ................................... 141 4.3.1 The Software Development Process ....................................................................... 141 4.3.3 The Resilience Engineering Model for Projects ....................................................... 142 Chapter 5: Resilience Engineering Conceptual Model Simulation ............................................ 147 5.1: Introduction .......................................................................................................... 147 5.1 Background of Conceptual Modeling........................................................................ 149 5.1.1 Developing the Conceptual Model ........................................................................ 150 5.1.2 Understanding the problem situation ..................................................................... 151 5.1.3 Determining the Modeling and General Project Objectives ...................................... 152 5.1.4 Identifying the Model Outputs.............................................................................. 153 5.1.5 Identify the Model Inputs .................................................................................... 153 5.1.6 Determining the Model Content ........................................................................... 154 5.1.7 Determining the Model Level of Detail ................................................................. 155 5.2 A Conceptual Framework to Describe Disruptions in the Construction Process............. 157 5.2.1 Step 1: Understand the Problem Situation .............................................................. 157 5.2.2 Step 2: Determine the Objectives .......................................................................... 159 5.2.3 Steps 3 and 4: Identifying the Model Outputs and Inputs ........................................ 160 5.2.4 Step 5: Determining the model content .................................................................. 161 5.2.5 Coding the Computer Model ................................................................................ 163 5.2.5.1 Experiment 1 Simulation................................................................................... 163 5.2.5.2 Experiment 2 Simulation................................................................................... 166 5.2.5.3 Experiment 3 Simulation................................................................................... 168 5.2.5.4 Experiment 4 Simulation................................................................................... 170 5.2.6 Understanding of the Experiments ........................................................................ 170 5.3 Verification and Validation ...................................................................................... 174 Chapter 6: Conclusions .............................................................................................................. 179 6.1 Review of the Research Goals, Questions, and Objectives............................................ 179 6.2 Contributions to Knowledge of this Research ............................................................. 180 6.3 Limitations of this Research ..................................................................................... 182 6.4 Future Agenda ........................................................................................................ 182 BIBLIOGRAPHY………………………………………………………………………………184 vii LIST OF TABLES Table 2.1: Probing Questions About the Ability to Learn ........................................................... 69 Table 2.2: Probing Questions about the Ability to Monitor ........................................................ 72 Table 2.3: Probing Questions about the Ability to Anticipate ..................................................... 75 Table 2.4: Probing Questions About the Ability to Respond ...................................................... 77 Table 4.1: Construction project guidance analysis item - Response…………………………...125 Table 4.2: Construction project guidance analysis item - Anticipate ………………………….134 Table 5.1: Template for consideration of level of detail by component type . ........................... 156 Table 5.2: Model Content………………………………………………………………………161 Table 5.3: Model Assumptions and Simplifications................................................................... 163 Table 5.4: Data Requirements .................................................................................................... 163 Table 5.5: Compiled Average Throughput of the system for each experiment. ........................ 171 Table 5.6: Compiled performance statistics of the model. ......................................................... 173 viii LIST OF FIGURES Figure 1.1: The Hierarchy of Socio-Technical Systems in Organizational Risk Management ..... 7 Figure 1.2: Plateaus in Overall Safety Reached after Interventions ............................................. 10 Figure 2.1: Disturbances registered at a construction site ............................................................ 17 Figure 2.2: A Generic Epidemiological Model............................................................................. 26 Figure 2.3: Reasons Swiss cheese Model ..................................................................................... 28 Figure 2.4: HRO Model that incorporates Collective Mindfullness Process................................ 37 Figure 2.5: Function/Activity representation and aspects............................................................. 87 Figure 2.6: Stress-strain state-space.............................................................................................. 92 Figure 3.1: Research Method ...................................................................................................... 101 Figure 4.1: Time Periods............................................................................................................. 133 Figure 4-2: Resilience Engineering at the Project Workface...................................................... 143 Figure 5.1: Robinson’s Conceptual Model in the Simulation Project Life-Cycle. ................... 151 Figure 5.2 : Listing the Project Aim and Objectives …………………………………………159 Figure 5.3: The model outputs ……………………………………………………………….160 Figure 5.4 : Identifying the model inputs …………………………………………………….160 Figure 5.5: Experiment 1 ............................................................................................................ 164 Figure 5.6: Sample Arrival schedule for Work Entities. ............................................................ 164 Figure 5.7: Resource Schedule ................................................................................................... 166 Figure 5.8: Animation for each Experiment ............................................................................... 166 Figure 5.9: Experiment 2, Type A Disruptions .......................................................................... 167 Figure 5.10: Experiment 2, Type B Disruptions ...................................................................... 1678 ix Figure 5.11: Experiment 3 Java coding for Trade B. .................................................................. 169 Figure 5.12: Experiment 4 Java coding for Trade B. .................................................................. 170 Figure 5-13: Graph showing the behavior of the production system subject to disruptions and increasing production pressure. .......................................................................................... 172 x Chapter 1: Introduction 1.1 Background It is not uncommon to learn that a construction project has been delayed or postponed. The reasons proffered vary widely, for instance, needed funding may be withheld or suddenly unavailable, or perhaps the prime contractor or a key subcontractor went bankrupt. Another scenario might involve a shortage of labor or a labor strike that shuts down work temporarily. A common thread among projects is that they are subject to the vagaries of disruptions that interfere with commonly desired project outcomes such as customer satisfaction, on-time completion, profitability, and safety. Disruptions are those happenings that interrupt, or disturb, the normal course of work. Disruptions may assume many different forms and combinations that can bedevil the best planning, scheduling, and risk-aversion efforts. They range from the mundane, such as a key employee calling in sick, to the catastrophic, such as a crane collapse that shuts down the project for several months. In some cases, several seemingly unrelated disruptions may combine in previously unforeseen and unimaginable ways. Disruptions are rarely thought of as fortuitous events and frequently are the precursor to some degree of failure, such as physical injury or financial loss. In the U.S. construction industry there does not appear to be a definitive answer to the question “Why do disruptions occur?” It is the goal of this work to answer that question by exploring schemes and methods to understand, harness, and foresee disruptions that arise from demands placed on the construction operations of project-based organizations that deliver the built environment. The underlying assumption of this goal is that a disruption free construction project is desirable. 1 However, as with many other undertakings that seek to understand complex socio-technical problems, achieving the goal as stated is a tall order. Disruptions may occur in many different aspects of a project as mentioned above. It is prudent therefore to break down the analysis into more focused and manageable portions. For instance, disruptions could be examined with respect to finance, the supply chain, labor, or any combination or multitude of disruptions that have been observed on construction projects. The scope of this work involves understanding production workflow related disturbances. Geographically, this discussion is limited to the United States of America (U.S.A.), although it is envisioned to apply to other countries with similar means and methods of construction. This work posits that the emerging paradigm of Resilience Engineering (RE) is a way to begin to better understand how to handle the observed phenomena of disruptions on construction projects and to begin to answer the question of “Why do disruptions occur?” This will get us closer to the paradigm shift needed to make the next improvements in construction safety. In a nutshell “ Resilience Engineering looks for ways to enhance the ability of organisations to create processes that are robust yet flexible, to monitor and revise risk models, and to use resources proactively in the face of disruptions or ongoing production and economic pressures.” (Nemeth et al. 2009). RE is based on four major premises (Hollnagel et al. 2011). First, that performance conditions are always underspecified. Second, that adverse events can be understood as the result of unexpected combinations of performance variability. Third, that safety management must be both proactive and reactive. Finally, that safety can neither be isolated from the core business process (e.g., production operations), nor vice versa. A resilient system must possess the 2 intrinsic abilities to respond, anticipate, monitor, and learn at all levels of the organization (Hollangel et al. 2011). Many operations and industries are turning to RE to better handle disruptions (and the failures it may bring) as a means to proactively avoid them, to deal with and mitigate disruptions in realtime (i.e., as they are occurring), and to recover from them. RE is an idea that is gaining traction in process-centric and high-risk industries such as oil and gas operations, the nuclear industry, air traffic management, and health care, to name but a few. In this work, the concept of RE is utilized to devise a conceptual framework of how RE could be applied to construction operations to combat disruptions. Then the idea that RE might be an avenue to understand, harnesses, and foresee disturbances is explored via a conceptual modeling approach. The use of the simulation modeling methods of agent-based and discrete event are employed to perform experiments that include disturbances in a production setting populated by agents. Given that a basic tenet of RE is concerned with the relationship between safety and production, the simulation utilizes discrete event modeling in a production setting. The agents simulate the behavior of a trade crew or installers. The ideas presented in this brief introduction are expanded upon in the following chapters. The remainder of this opening chapter will briefly characterize the nature of the United States construction industry, discuss the chronic construction safety problem, list and then briefly elucidate the goals, research questions, and objectives of the dissertation. Finally, the remaining chapters are summarized. To begin to understand disruptions in the construction industry it is informative to first briefly examine the nature of the construction industry. 3 1.2 Defining Characteristics of the Construction Industry In 1989 Oglesby, Parker, and Howell (1989) described construction projects as complex, uncertain, and quick. On closer examination, a fourth descriptor emerges – that working conditions are underspecified. 1.2.1 Construction is Complex Construction is complex. Bertelsen (2003) argues that even though “The general view of the construction process is that it is an ordered, linear phenomenon, which can be organized, planned and managed top down” however, upon closer inspection he observes that “construction is indeed a complex, nonlinear and dynamic phenomenon, which often exists on the edge of chaos.” A complex system is one which has, within itself, a capacity to respond to its environment in more than one way, and to select among the options in some way (Miller and Page 2007). Simon (1997) defined a complex system as one made up of a large number of individual parts that have many interactions. A key concept of complexity is that systems emerge to something that is greater than the sum of its parts. 1.2.2. Construction is Uncertain and Underspecified Construction projects, like most other human undertakings, have uncertain futures. Uncertainty is defined by the Construction Industry Institute (CII) (1989) as “The gap between the information required to estimate an outcome and the information already possessed by the decision maker.” Uncertainty resides in all human activity but seems to be magnified in the 4 construction industry. Wild (2005) posits that uncertainty arises from asymmetric information, limited information processing capacity, and bounded rationality. In other words, construction managers either cannot gather or do not share all of the information necessary to reduce uncertainty and this hinders the evaluation of all of the possible outcomes of decisions made for a complex construction project. Even if it were possible to determine all of the possible outcomes resulting from decisions few firms have the necessary resources (e.g., time and money) to consider all of the possibilities for the available information. Given this state of limited information firms make the best decisions possible in an opportunistic manner that suits their self-interest. The resulting outcome is a series of optimized tasks or phases resulting in a suboptimized project. Closely allied with uncertainty is the notion that working conditions are underspecified. No amount of pre-planning effort can achieve a complete description of conditions that the frontline worker will encounter. There are multiple scenarios and situational developments to account that are nearly impossible to describe beforehand. This is compounded by the interactions of various trades (plumbing, electrical, carpentry) and their corresponding supply chains and organizational structures that are simultaneously working together on site and may have competing goals for resources. It is the job of these frontline workers to “make ends meet” given the unique conditions encountered on-site and complete the project according to plans and project specifications, along with the demands outlined above. Underspecification is directly related to the performance variability of workers that is discussed in Chapter 2. In a nutshell, performance conditions are usually underspecified. Individuals and organizations must therefore adjust what they do to match current demands and resources. Because resources and time are finite, such adjustments will inevitably be approximate (Nemeth et al. 2009). 5 1.2.3 Construction is Quick and Demanding The mantra of the modern business world is “better, faster, cheaper.” Broadly this might be termed the ‘demands’ placed on the construction industry to rush to completion. However, there are several dimensions to the demands placed on the industry. Rasmussen (1997) captured and abstracted the demands placed on the typical organization in a 1997 paper titled “Risk management In A Dynamic Society: A Modelling Problem” and in a follow-up report to the Swedish Rescue Services Agency (SRSA) (Rasmussen and Svedung 2000). He succinctly captures the problems facing modern risk analysis by arguing that present models of accident causation are inadequate, narrowly defined, and do not reflect the changing social landscape. He argues that risk management must be modeled by cross-disciplinary studies, done in a control theoretic manner, and considered on a socio-technical systems basis. Figure 1.1 describes this outlook. Rasmussen argues that there should be “vertical alignment” across the levels. That is, information about what is happening at the workface should be communicated up through the column while decisions should propagate down through the column. The vertical interdependencies are essential to the functioning of the system and should be primary considerations when attempting to reduce risk. However, Rasmussen points out that much of the research is conducted horizontally by the various disciplines. For instance, research is conducted on the management level in isolation of the work level. The environmental stressors listed, such as the fast pace of technological change and changing competency levels, occur at different levels and on various time scales. 6 Figure 1.1: The Hierarchy of Socio-Technical Systems in Organizational Risk Management (Rasmussen 1997) Next, the chronic construction safety problem is reviewed as well as the phenomena of “plateauing” with respect to improvements in safety. 1.3 The Construction Safety Problem Although safety is a widely touted priority in the construction industry, in the period from 1992 to 2005 an average of about 1,147 worker deaths per year were reported in the U.S.(CPWR 7 2008). Construction safety appears to be a chronic problem in the industry. Based on statistics alone, safety appears to be improving somewhat in the construction industry. The most recent version of the “Construction Chart Book” (CPWR 2008) reports that rates for construction overall work-related fatalities have decreased by 22% in the period from 1992 to 2005 and nonfatal injuries and illnesses with days away from work (DFW) dropped by 55% in the same period. Falls and electrocutions, the leading causes of death in the industry, have declined over the past 15 years. The Chart Book attributes these improvements to focused efforts on prevention. Despite the improvements, 16,068 construction fatalities were recorded in the same period, an average of about 1,147 worker deaths per year. Based on 2005 statistics from The Chartbook, only the agriculture, mining, and transportation industries had higher annual fatality rates (per 100,000 workers) and only transportation industry had a higher rate of nonfatal injuries and illnesses with days away from work (DFW) (per 10,000 workers). These numbers are disproportionately high compared to other industries given that construction workers only account for approximately 7% of the workforce. Other statistics reveal the instance of elevated lead blood levels is disproportionally high in construction workers compared to other workforce sectors and about 41% of construction workers over age 55 were diagnosed with hypertension in 2005 (CPWR 2008). It appears there are many areas for improvement in construction worker safety, health, and well-being. Another discouraging aspect of the construction safety problem is that it is costly. In 2002 dollars the total (direct and indirect) costs of fatalities and nonfatal injuries was $13 billion. Construction researchers frequently speak about reaching a “plateau” with regard to construction accidents. Groeneweg (1998) contends that plateauing (his term is “stabilizing”) is a 8 natural occurrence in an effort to reduce accidents. In a typical scenario, a company recognizes that the number of accidents is a problem and takes measures to decrease that amount. The efforts (i.e. audits, inspections, safety meetings) yield positive results and accidents are reduced – for a while. Stabilization often takes place even with more effort being expended. To break the pattern more measures and countermeasures are implemented; it is as if the former actions have lost their impact on safety. Groeneweg (1998) observes that, after safety improvements are put into place, that “Routinenisation and normalizations of programs and initiatives, the aging of systems, and the sometimes intrinsically hazardous environment seem to push the number of failures back up again.” Taking an historical overview of industrial safety it is seen that plateaus are not uncommon. From 1937 to 1956 the fatality rate for U.S. workers in all industries decreased from 43 to 23 deaths per 100,000, respectively (Groeneweg 1998). This decrease was attributed to the intervention of engineering controls applied to industrial hardware, ergonomics science, and the increased use of personal protective equipment (PPE). Another factor was a decrease in fatigue due to federally mandated lowering of working hours. In the 1960s and 1970s the number industrial accidents stabilized. Efforts were expended on employee behavior in the form of motivational programs and improving the quality of associates. The 1980’s saw a 21% decline in the accident death rate. The current focus is on improving the fatality rate via sociotechnical organizational change while engineering controls and behavior modifications are ongoing. Figure 1.2 summarizes these trends. 9 Figure 1.2: Plateaus in Overall Safety Reached after Interventions (adapted from Groeneweg 1998) This thesis posits that a way to deal with the disruptions that surround the safety problem in the construction industry is found in the emerging discipline of RE as introduced below. 1.4 A Brief Introduction to Resilience Engineering Resilience Engineering (RE) is a different way to look at safety and is unique compared to other approaches in its perspective (Hollnagel et al. 2008). In RE, failure and success are viewed as opposite sides of the same coin, both depend on normal performance variability of the production process of the system. Workers are not viewed as “cogs in a machine” but as humans whose performance naturally varies. Mostly this variance maintains system stability but on rare occasions it does not. The goal is to dampen the variability that precedes failure and to amplify the variability that produces success (Hollnagel et al. 2008). A resilient system aims to adjust its functioning prior to or following changes and disturbances so that it can continue functioning after a disruption or a major mishap and in the presence of continuous stress (Hollnagel et al 2006). 10 The approach of Resilience Engineering is to look for ways to maintain control of a system and harness (but not constrain) performance variability. Additionally, it strives to apply the skills of foresight and imagination to proactively promote needed change and to constantly monitor and adjust the safety model as project conditions vary or may vary. RE is an emerging discipline that has been described as a paradigm shift about safety in the “Kuhnian” sense (Woods and Hollnagel 2006). RE builds upon previous approaches such as those found in the Normal Accident Theory (NAT) and High Reliability Organization (HRO) approach. The RE approach recognizes the need for a progressive safety plan that is systems-based, sociotechnical in its outlook, views work as a complex activity, and is concurrently reactive and proactive in its vigilance to prevent accidents. In the RE outlook safety is a core value of the organization. In a complex, uncertain, and quick world of business, RE looks for ways to productively balance the tensions among ongoing production and economic pressures and safety, recognizing the need to pull back or stifle production when safety is threatened. The goal, research questions, objectives and method of this work are described below. 1.5 Goals, Research Question, Objectives, and Method 1.5.1 Goal The goal of this study is to explore schemes and methods to understand, harness, and foresee disturbances that arise from demands placed on the construction operations of project-based organizations that deliver the built environment. 11 1.5.2 Research Questions The research questions are: 1. How does RE differ from traditional ways of thinking about how to deal with disruptions? 2. What are the current principles and practices of RE? 3. What elements of RE may help a construction project avoid, survive, and recover from disruptions? 4. How can we begin to simulate disruptions in construction operations and use RE principles? The focus, or scope, of this work is on the disruptions that may create the conditions conducive for an accident. However, the Research Questions could be a springboard for future researchers to examine other scopes of interest that arise in construction operations such as disturbances from financial or logistical difficulties. The first two questions are considered from the literature review. Questions three and four comprise the Conceptual Framework presented. 1.5.3 Objectives The Objectives of this research are: 1. Abstract the concept and underlying theories of RE and explore RE deployment in nonconstruction industries for use in formulating Objectives 2 and 3. 2. To present a RE conceptual framework for construction safety. 12 3. To explore RE implementation in construction production settings that experience disruptions in a formalized way using hybrid computational methods. 1.6 Chapter Summary The research presented is contained in six chapters. This chapter began with the observation that disruptions are endemic to the construction industry and examined the general characteristics of the construction industry and the specific problem of safety. It was hypothesized that the paradigm of RE is an approach to break the plateau that safety efforts seem to be mired in. The Goal and Objectives of this research were discussed as was the proposed method. Chapter 2 presents a background on the relationship between safety and production as well as the evolution of RE. Chapter 3 outlines the methods used in the study. Chapter 4 introduces a new conceptual model, developed by the author, for understanding disturbances in the paradigm of RE. The chapter discusses building the computational model. The chapter concludes by discussing the steps taken to experiment with the computational model. Chapter 5 presents the Conceptual Model and results and discussion of the simulation. Finally, Chapter 6 contains conclusions of the research and its contributions to knowledge. The chapter also suggests possible areas for future research in RE. The Appendix provides the model code and raw data. 13 Chapter 2: Literature Review 2.1 Introduction This literature review has three primary purposes. First, to examine the knowledge and research associated with disruptions in industry. Second, to provide the reader with the background necessary to understand how RE evolved from safety thought and approaches throughout the industrial age and how RE diverges from and expands upon popular safety approaches. Third, it presents the current state of RE to the reader as well as other understandings of RE. It first explores the current state of research of construction disruptions. Given that the research of disruptions is sparse; this information is brief and straightforward. The second and third purpose of this literature review is to fully explore the emerging paradigm of RE. This is guided by first focusing a skeptic’s eye on the four premises of RE as presented in Chapter 1. The first three premises are primarily examined by looking at popular models of safety and the evolution, or “ages,” of safety thought in order to examine why the new paradigm of RE may be useful. In short, the first three premises challenge the notions that human variability at the workface is a threat to safety, that malfunctions always occur in a linear fashion and always have a root cause, and finally that the calculation of failure probabilities through techniques such as fault tree analysis, while useful, only partially explain the mechanism of accidents. The background essential to understand RE is also presented in the context of the first three premises, namely, the “Normal Accident Theory” and that of “High Reliability Organizations.” The so-called “softer factors of culture and climate as they relate to the understanding of RE are also introduced. The concepts and applications of RE are then presented as they are currently understood in the literature. Finally, alternate understanding and approaches to the RE paradigm are presented. 14 The review begins by looking at how disruptions are described in the literature, 2.2 Disruption in General Industry and Construction Operations The purpose of this section is to discuss disruptions in general as well as how they are understood in construction operations. Disruptions are those happenings that interrupt, or disturb, the normal course of work. Disruptions are a fact of life on many construction projects. Additionally, they may contribute to the performance variability of a system. In general, “Resilience represents the ability of a system to adapt or absorb disturbances, disruptions, and changes and especially those that fall outside the textbook operational envelope” (Woods et al. 2007). The Association for the Advancement of Cost Engineering (AACE) defines disruption as “an action or event which hinders a party from proceeding with the work or some portion of the work as planned or as scheduled” (AACE 2004). Ibbs et al. (2007), in the context of discussing owner or contractor change upon a project, notes that some studies define disruption as “…the occurrence of events that are acknowledged to negatively impact on labor productivity.” Kuivanen (1996) proposes defining a disturbance in a general way as an “unplanned or undesirable state or function of the system. ”Disturbances are stochastic in nature and are difficult to foresee and therefore difficult to plan for (Lindau and Lunmsden 1995). They can range from mildly annoying and having just a slight impact on the project, for instance a crew waiting on a late ready-mix concrete delivery, to more impactful events, such as demolishing and replacing a misplaced concrete wall or a crane collapse. Traditionally disturbances have been primarily evaluated from either a legal or contractual point of view but have rarely been examined in the context of the production site or to the assessment of construction processes 15 (Gehbauer et al. 2007). The topic of disturbance management receives little attention in construction research. Gehbauer et al. (2007) defines a construction project disturbance as an “"unexpected occurrences causing an interruption or at least a delay in the execution of tasks; they cause a significant discrepancy between the target and actual data" and note that “target data usually refer to time or cost-related operations.” They observed a construction project for twenty days for disturbances in order to build a disturbance database. The database is categorized into four broad areas: project description, data regarding the observed disturbance, disturbance elimination, and disturbance effects. The results of the observations are shown in Figure 2.1. Execution errors (42%) were observed most frequently, followed by information errors (27%), delivery problems (12%), and then planning errors (8%). The term “disturbance factor” is used to describe a reason for a target-actual discrepancy. They define “construction operations” as the project under consideration and “construction processes” as any sub-section of the project, such as masonry work. Operational disturbances may be external (i.e., those related to natural, legislative or economical events) or internal such as those occurring in procurement, sales and construction site work. “Disturbances are further subdivided into personnel-, material- and arearelated disturbances.” Here, area-related refers to the sub-process under examination, such as masonry or formwork. The authors also introduce primary and induced disturbance factors and describe them thusly: “Primary disturbance factors are deviations caused by independent actions within the same area. A primary disturbance arises in construction site work when for example the forms for the pouring of a concrete wall burst because the locks were forgotten during the assembly of the forms. In contrast, an example of an induced disturbance would be the opening of form locks too soon because of an error made when calculating the concrete pressure. Induced 16 disturbances are deviations originating in another area.” Safety is not mentioned in the Gehbauer et al. (2007) paper. Figure 2.1: Disturbances registered at a construction site (Gehbauer et al (2007)) Jackson (2010), speaking of the general industry, states that “Accidents are the result of disruptions, and the resilience of the system to disruptions will depend on the nature of the disruption. The goal of resilience is to avoid, survive, and recover from disruptions.” Madni and Jackson (2009), speaking in the context of RE define disruption as “conditions or events that interrupt or impede normal operations by creating discontinuity, confusion, disorder or displacement.” They can take the form of operational contingencies, natural disasters, terrorism and political instability, and financial meltdown. 17 Disruptions can also be described as either Type A or Type B (Jackson 2010, Madni and Jackson 2009). Type A disruptions are external to the system and are exampled as earthquakes, floods, tornados, and so forth. Type A disruptions can be caused by the influence of one system upon another. For instance, an aircraft flying too close to another may exert an aerodynamic wake and damage the structural integrity of a nearby jet. Type B disruptions are systematic in nature and are “a disruption of function, capability or capacity.” The spaceflights of Apollo 13 and the shuttle Columbia could be broadly classified as Type B failures. In the first case, Apollo 13 was resilient and survived; in the second case, Columbia was brittle (i.e. the opposite of resilient) and met with disaster. Type B disruptions are in socio-technological systems and its sources include humans, automated systems, and combinations of these (Madni and Jackson 2009, Jackson 2009). These are collectively termed “agents” and are classified as human agents, automation agents, and multiagent agents. Type B disruptions can also be termed predictable or unpredictable. A predictable disruption is one that has been previously uncounted and is usually accounted for in design and operations. “Unpredictable disturbances can occur, either because a phenomenon was unknown to modern science, or because it was unanticipated/unknown to the systems designers” (Madni and Jackson 2009). A special case of a Type B disruption is called a “Disruption of Unreliability.” A condition of this is that the component has been verified by historical or test data to have failed by “mean time between failure” (MTBF) standards established for the component. For instance a formwork lock, is considered a disruption if it fails prematurely from its verified useful life. Otherwise, the component is categorized as “management failure to assure proper verification” (Jackson 2009). Finally, disruptions can be caused by latent conditions, as described above in Reason’s Swiss Cheese Model (Madni and Jackson 2009, Jackson 2009)). 18 Lindau and Lumsden (1995) examined disturbances in manufacturing with the aim to “…classify safety actions used in manufacturing companies and to evaluate their efficiency in preventing the propagation of disturbances from a holistic perspective.” In their study they classify a disturbance as “as an event which affects a planned resource movement in such a way that a deviation from plan occurs.” They classified the actions to prevent the propagation of disturbances as either formal or informal. Formal approaches are actions considered in the planning stage to absorb the effects of a disturbance. They include safety stock, safety capacity, safety lead time, overplanning, expediting, and subcontracting. Safety in this sense means to prevent interruptions to the work flow and does not concern the worker. Informal actions are used when the actions of the formal are not effective in absorbing the disruption. Informal actions include subcontracting, expediting, partial delivery, short-term replanning, and reservation breaking (i.e. ordering materials sooner than needed). To analyze the effects of common disturbances on overall system performance the authors observed the common disturbances of material shortage, absenteeism, machine breakdown, tool shortage, and technical documentation shortage. They noted the formal and informal actions taken to absorb the disturbances on the system and developed a mean absorption ability, relative absorption ability, and general absorption ability to gauge the effectiveness of the actions while broadly considering the cost and upset to other parts of the system. The authors concluded that there is no efficient way of preventing the propagation of disturbances that is concurrently cost efficient or unsettling to the system. Barroso and Wilson (1999), writing in the manufacturing area, define disturbance as “any event which has not been planned for or which is undesirable and that reduces or has the potential for reducing overall system performance in terms of either production or safety 19 requirements and goals.” They report that a common finding among several authors who have analyzed the process of industrial accident causation is that accidents tend to occur more often in abnormal, conflicting system states than in normal ones and that the number of accidents occurring while the operator is involved in troubleshooting or disturbance control tasks varies between one-third and two-thirds of the total of accidents analyzed. In terms of production they report that that between 80% and 94% of the disturbances registered have an effect on production, stressing that 27% of the disturbances resulted in material damage. Barasso and Wilson (1999) summarize the categories for “Consequences of Disturbances” from several researchers, the categories include: no effect, presence of risk factors, hazardous situation, minor and serious accident, catastrophe, fatal, nonfatal lost time, and nonfatal non-lost time. Toulous (2002) writes about operator intervention in automated batch production systems and defines a disturbance thusly, “A disturbance corresponds to a variation in the state or function of the system that requires operator intervention to avoid production shutdowns, material damage, defects in quality, or to return the automated production system to its operating state following an unanticipated shutdown or the appearance of defects in the product.” Toulouses’ study mainly explored how operators came into contact by manual intervention with a source of energy required to run the system after a disturbance. He also reports that research has shown that 50% of disturbances reduce the operators’ safety, and one accident occurs for approximately 2% of them. Even if the understanding of disruptions is not well known and the literature on the concept is sparse (Toulouse 2002, Barasso and Wilson 1999), they are phenomena that must be anticipated and avoided. RE is proposed as a way to formally understand, harness, and foresee disturbances 20 that arise from demands placed on the construction operations of project-based organizations that deliver the built environment. It is discussed in the following sections. 2.3 The Need for New Approaches to Safety – Background of RE RE protagonists argue that while historical approaches to safety have been fruitful and improved safety over time, the approaches of the past that model systems as linear and simple do not contribute to the understanding of today’s complex industrial world. The focus has been, as Hale and Hovden (1998) point out, on the “negative” or the unreliable, weak, and problem-generating areas of a socio-technical system and to fix the most recent breakdown or flaw. This adjustment is often made just after a high-profile disaster. Many times this is done by erecting some form of barrier, either physical or rule-based and placing blame on technical, human, or organizational features of the system. However, following Perrow’s ideas that accidents are “normal” and are the “unanticipated interaction of multiple failures” (Perrow 1984), RE seeks to identify how work is performed under “normal” conditions and resource pressures (especially time and how system variability is handled on a daily basis). The identification of possible interactions of multiple failures of the system and how to eliminate or mitigate them is also considered. The following sections examine different approaches to understanding safety from the advent of the industrial age to the present. Researchers have made several attempts to delineate and trace the genealogy of safety thought. There are conflicting opinions about the evolution and trajectory of safety science. In short, there does not seem to be a consensus with regard to how safety is understood and best practiced in both general industry and construction. However, two lines of thought are useful to explain how RE came to be and why it came to be. First, the “Five Ages of Safety” as identified by Borys et al. (2009) provides a direct path to the current need for 21 and understanding of RE. Additionally, it is useful to examine prominent and popular models of safety to help to understand the background of RE. Along the way, the concepts behind High Reliability Organizations (HRO’s) and Normal Accident Theory (NAT) are examined as are the concepts of climate and culture as they relate to RE. 2.3.1 The Evolution of Safety Thought The Literature Review explores the background and justification of RE by examining both the “Ages” of safety as well as popular safety models. It is informative to briefly look at the different ways researchers have divided safety into various epochs or categories for analysis. Hale and Hovden (1998) delineate safety management efforts in terms of three ages, legislative, human factors, and a management age. Groeneweg (1998) divided safety evolution into periods of engineering, employee, and organizational control as illustrated in Figure 1.2. Hollnagel (2004) posits that the focus on each area might stem from the “…strong and natural tendency to look for explanations or causes of the systems that fail most frequently or which in some ways are conspicuous.” So, for instance, engineering failures were most prevalent as materials and engineering sophistication struggled to keep pace with changes in the early part of the 20th century; correspondingly accidents seemed most reasonable to explain in terms of human errors around the 1950’s and organizational struggles seemed a likely target in 1980’s. Finally, Hollnagel (2004) categorizes accident models as occurring sequentially, epidemiologically, and systematically. These approaches have overlap of theories and models and exhibit the absence of a consistent worldview of how safety is achieved. . Borys et al. (2009) built upon Hale and Hovden’s (1998) three ‘ages of safety’ to better reflect current understanding of modern complex systems. They discuss the fourth age of safety as introduced by Glendon et al. and then introduce a fifth age of safety which they call the 22 adaptive age which is informed by Resilience Engineering. It is useful to examine these five ages to learn not only the precepts upon which Resilience Engineering is built but also to unearth how prior attempts at risk management fail to explain and manage some of today’s complex safety problems. Categorizing the evolution of safety thought chronologically is not ideal, however, reviewing other researcher’s attempts to corral safety thought is the best available option for the purposes of this thesis along with Hollnagel’s take on the evolution of accident models that are classified as sequential, epidemiological, and systematic because they highlight the “negative” outlook and represent, even today, popular outlooks on safety by professionals and allow the logical introduction of systems approaches of the NAT and HROs. Popular accident models are first examined followed by the “Five Ages of Safety.” 2.3.1.1 Popular Accident Models Hovden et al. (2010) writes “Most accident models and theories applied in the field of occupational accidents are still based on the ideas in Heinrich’s Domino Model, Gibson’s and Haddon’s epidemiological models of energy-barriers, and are using a closed system safety mindset with mechanistic metaphors to describe the conditions, barriers and linear chains of an accident process.” Because of this popularity they are described below. In general, “Sequential and epidemiological accident models are inadequate to capture the dynamics and nonlinear interactions between system components in complex sociotechnical systems” (Qureshi et al. 2009). Researchers developed systematic socio-technical models to move beyond the linear and epidemiological models in order to better understand modern accidents and to better identify risk. Popular systems models include Normal Accident Theory (NAT) and High Reliability Organizations (HROs). These are described below given that they are crucial to the understanding of RE. Additionally, the Functional Resonance Analysis Method (FRAM) is 23 discussed. The FRAM is described because it “proposes a methodology to identify and assess performance variability. Based on a functional modeling, the FRAM shares Resilience Engineering assumptions about the complex socio-technical systems underspecification and recognizes in it the need for local adjustments” (Macchi and Hollnagel 2011). 2.3.1.1.1 Sequential Accident Models A sequential accident model describes an accident as the result of a sequence of linear events (Hollangel 2004). Qureshi et al. (2007) points out that the assumption in these types of models is that the cause-effect relationships between consecutive events are linear and deterministic and that a single initiating factor can be found that triggered the offending event. Therefore, if that factor can be removed then the accident can be avoided. However, Qureshi et al. (2007) states that “The reality is that accidents always have more than one contributing factor.” Sequential models are useful for describing component failures or human error in simple systems but are an oversimplification in many cases. (Hollnagel 2004, Qureshi et al. 2007). “The underlying assumption, as illustrated by the domino model, is that an accident is the result of a sequence of events and that causes, once they have been found, can be eliminated or encapsulated, thereby effectively preventing future accidents” (Hollnagel 2004). An advantage of sequential models is that they are easy to communicate graphically and are easily understood as compared to multicausal reasoning (Hollnagel 2004). 2.3.1.1.2 Epidemiological Accident Models The 1979 investigation into the nuclear core meltdown and radiation release occurring at the Three Mile Island Nuclear Generating Station in Pennsylvania prompted accident researchers to explore more sophisticated accident models. Sequential models, while useful for simple accident 24 scenarios, could not explain this complex accident. Hollnagel (2004) explains that the use of epidemiological accident models was used to explain the accident because sequential models were not powerful enough to explain what happened. The epidemiological approach to industrial accident analysis is borrowed from the medical field. Lingard and Rowlinson (2005) note that Gordon and Suchman pioneered the use of the epidemiological approach for industrial accident prevention in 1949 and 1961, respectively. It was observed that the occurrence of occupational injuries bears a resemblance to the study of infectious and non-infectious diseases (i.e. the science of epidemiology) and that the techniques used in the medical field might be transferred to the study of occupational health. Lingard and Rowlinson (2005) describes accident causation as derived from the combination of at least three sources: the host, agent, and environment. The host is the person to whom the injury or illness occurred. Hosts may have characteristics that promote certain types of injury or illness such as physiological features (e.g. strength, age, and gender), levels of training, and competence, and motivation or behavior issues. The agent is the deliverer of the injury or illness and can be physical, chemical, or biological in nature. The environment is the physical, biological, and socio-political aspects of the work environment, although socio-political aspects are rarely considered in construction site accidents. A generic epidemiological model modified for the construction case is shown in Figure 2.2. 25 ENVIRONMENT Physical: site layout, noise levels, housekeeping, temperature, ventilation. Biological: sanitary conditions, insect bites. BARRIERS, DEFENSES HOST: age, strength gender, AGENT: tools, equipment, chemical, building components Figure 2.2: A Generic Epidemiological Model: an unsupportive environment may weaken defenses, the Host has defenses against attacks (adapted from Hollnagel (2004) and Lingard and Rowlinson (2005)) Epidemiological models differ from sequential models in four main areas (Hollnagel 2004). First, the neutral term “performance deviation” replaced the idea of human error when discussing an unsafe act. Performance deviation can include a component or a human and thus took the focus off of blaming human error entirely for an accident. Second, discussing environmental conditions leaves the door open to discuss if multiple causes could have contributed to the accident. Lingard and Rowlinson (2005) describe epidemiological models as “…consistent with the concept of multi-causality.” Multi-causality simply states that “Contributing causes combine 26 together in a random manner resulting in an accident.” Third, barriers are incorporated in to the model that could prevent unintended consequences and thus the accident. Finally, the concept of latent conditions was introduced into the model. Latent conditions are those that are present in the system prior to the accident sequence, they may not trigger accidents but may become apparent in the course of performance deviation. They have been described as “resident pathogens” that may lie within the system for years until they combine with other triggering factors to create an accident opportunity. “Latent conditions have two kinds of adverse effect: they can translate into error provoking conditions within the local workplace (for example, time pressure, understaffing, inadequate equipment, fatigue, and inexperience) and they can create long lasting holes or weaknesses in the defenses (untrustworthy alarms and indicators, unworkable procedures, design and construction deficiencies, etc.)” (Reason 1990). Hollangel (2004) classifies James Reason’s “Swiss Cheese Model” (see Figure 2.3) as an epidemiological model. Reason characterizes defenses, barriers, and safeguards as layers of Swiss cheese. He describes his model (2000) as each “slice” being a barrier that could be engineered (e.g. alarms or physical barriers), rely on humans (e.g. pilot, surgeon), or depend on procedures and administrative controls. For a total defense each layer would be intact and prevent the occurrence of an accident. However, holes arise in the barriers (“slices”) that allows for the possibility of an unwanted outcome. “The presence of holes in any one “slice” does not normally cause a bad outcome. Usually, this can happen only when the holes in many layers momentarily line up to permit a trajectory of accident opportunity—bringing hazards into damaging contact with victims.” The holes arise because of active failures and latent conditions (as discussed previously). Active failures are committed by those in contact with the systems and take the form of “…slips, lapses, fumbles, mistakes, and procedural violations.” The move 27 from latent failures to active failures can be thought of as moving from the executive (i.e. “blunt”) end of the spectrum to the field (i.e. “sharp”) level or operations. It is interesting to note that although Hollnagel considers the Swiss cheese model to be an epidemiological model Reason considers it to be a systems model; he also considers the Domino and other types of sequential models to be systems models (Reason 2008). Figure 2.3: Reasons Swiss cheese Model (Reason 2000) 2.3.1.1.3 Systems Models Systems models differ from the structural decomposition of linear and epidemiological factors by focusing on the “characteristic performance on the level of the system as a whole” (Hollnagel 2004). Accidents are viewed as emergent phenomena, which is in-line with Perrow’s view that accidents are “normal” or to be expected. Systems models trace their roots to many different 28 disciplines, including complexity theory (as evidenced by the emergent nature of systems), control theory, chaos theory, and systems theory, cognitive science and its branches, and numerous other disciplines that may affect the system (Hollnagel 2004, Qureshi et al. 2008, Levenson 2004). Systems are sometimes termed socio-technical systems. This term was coined from studies at the Tavistock Institute in London and is concerned with the interaction of people and technology with respect to work design (Trist and Bamforth 1951). Because of the multitude of systems approaches and the corresponding disciplines that they draw upon, systems approaches are presented here as they are relevant to the development of Resilience Engineering. Below the systems models of Normal Accident Theory (NAT) as devised by Perrow, and High Reliability Organizations (HROs) as developed by researchers at Berkeley and refined by researchers at the University of Michigan are examined as they relate to RE. The Functional Resonance Analysis Method (FRAM), which could be classified as a systems approach, as developed by Hollangel is discussed in the section on the background on RE. All three are important to the understanding of Resilience Engineering. 2.3.1.1.3.1 Normal Accident Theory In 1984 sociologist Charles Perrow published a book titled “Normal Accidents: Living with High-Risk Technologies” in which he examined several high-risk technologies along with the corresponding industries and enterprises that house the technology. He examined systems, which he viewed as “organizations, and the organization of organizations” and the technology these organizations used. These included nuclear power plants, petrochemical plants, air and sea travel, and genetic engineering, among others. These risk-laden undertakings had the common denominator in that each could cause injury and death to an untold number of workers and innocent bystanders as well as to future generations. In fact, the text was inspired by a major 29 accident in 1979 at the Three Mile Island Nuclear Generating Station (TMI) near Harrisburg, Pennsylvania that released radioactivity due to a partial core meltdown. By analyzing this and other catastrophes related to complex technologies and organizations, Perrow came to the conclusion that further disasters were inevitable. This seemingly inevitable and repeating collision course with disaster and catastrophe, was labeled as “normal” by Perrow (Perrow 1984) – thus the “Normal Accident Theory” (NAT) was born. The idea behind the NAT is elegantly simple. NAT focuses on the elements of design, equipment, procedures, operators, and environment, abbreviated as DEPOSE. Perrow abstracted the common elements of several catastrophes that were based in high-technology industries (Perrow 1984). In a nutshell his argument is as follows. “Something, such as a plane, a factory, or a university for instance, that has a lot of interacting components (commonly parts, procedures, and operators), and two or more of them fail in an unforeseen way to the designers of the “thing””. Perrow (1984) calls this the: interactive complexity” of the systems. In addition, the interaction of the two parts is not obvious while the accident is occurring and perhaps not till years afterward, if ever. If the system has a lot of “slack” between the interacting components, and time to react to the accident, and other resources the accident may not spread or become dangerous and the system will not destabilize. In “tightly coupled” systems the interactions are swift and have major impacts on one another. Perrow defines an accident as involving some damage to people, objects, or to both. In Normal Accident Theory (NAT) the degree of disturbance is crucial to how we define an accident. In Perrow’s thinking, there are degrees of disturbances to a system and can help to define what we really mean when one states that an accident occurred. In some respects, Perrow posits, a system is what we make it and the system definition and boundaries are defined by 30 one’s own self-interest and task. To be consistent, Perrow proposes a scheme that can be used across different system boundaries and in different industries. In this scheme the system is divided into four levels consisting of parts, unit or a collection of parts, a subsystem or an array of units, and finally, a collection of subsystems which together is termed the project. It is then ranked according to the level on which the disruption occurs. Perrow’s (1984) take on humans in the system is worth noting. He considers humans as “mere parts” while admitting that this characterization sounds “heartless.” However, he notes that the focus of his work in high-risk industries is on the systems level and is to ultimately protect humans, “it is the character of the systems that cause that damage.” He is concerned with stopping the catastrophes that have the potential to kill hundreds or thousands, not the individual. Perrow divides interactions into two types, linear and complex. Linear interactions are the most common form of interactiveness in our daily lives and in business, are simple, and comprehensible to people involved in the system. They are formally defined as the “interactions of one component in the DEPOSE system …with one or more components that precede or follow it immediately in the sequence of production.” Complex interactions “are those of unfamiliar sequences, or unplanned and unexpected sequences, and either not visible or not immediately comprehensible” (Perrow 1984). Perrow (1984) contends that complex systems can be universities, research and development firms, and some government bureaucracies; not only those kinds of high-risk undertakings such as nuclear plants. Perrow posits that complexity exists because in most systems designers do not know how to make a production system linear, and thus create “expected sequences.” He also adds that complexity is not intrinsically undesirable as it is welcome in some bureaucracies. 31 Perrow also distinguishes between “loose” and “tight” coupling in systems. Coupling is a word adapted from the engineering field that describes how two things are attached. “Tight coupling” means that “there is no slack or buffer or give between two items” (Perrow 1984). In the “loose coupling” condition slack or buffer exists between two items. Perrow (1984) discusses four main characteristics of coupling in systems. First, tightly coupled systems have more time-dependent processes than those that are loosely coupled. One reason for this is that the production process may not allow for waiting. A second characteristic of coupling is that tightly coupled systems are invariant. This means that there is only one way to make the product that A must precede B (Perrow 1984). A third characteristic is that in tightly coupled systems the “overall design of the process allows only one way to reach the production goal” (Perrow 1984). This is in addition to the invariance described above. Finally, the fourth characteristic is that tightly coupled systems have little slack in terms of time, resources, and equipment. In many instances there are no substitutes available for resources or equipment in a tightly coupled operation. Tightly coupled systems respond quickly to perturbations and the results may be disastrous (Perrow 1984). The reader is referred to Perrow (1984) for further analysis and insight on systems coupling. 2.3.1.1.3.2 High Reliability Organizations (HROs) Some researchers disputed the underlying assumptions of the NAT. There has been and there is a healthy debate in the literature about the merits of the NAT as well as its shortcomings. The reader is directed to Sagan (1993) for a full discussion of the NAT versus HROs in the nuclear industry. The basic argument of opponents to the NAT state that there are some industries that are tightly coupled and interactively complex but have fantastic safety records. One important branch of this opposing viewpoint is termed “High Reliability Organizations” (HROs). 32 As described by Dr. Karlene Roberts (1993), in 1984 a team of interdisciplinary researchers at the University of California, Berkley began studying the Federal Aviation’s Administration’s Air Traffic Control System, the U.S. Navy’s nuclear powered aircraft carriers, and Pacific Gas and Electric Company’s nuclear power plant at Diablo Canyon. Each of these three organizations operates complex and potentially hazardous technologies that have the potential to unleash catastrophe in the event of operational error(s). These organizations and other complex and tightly coupled ones like them, “somehow seem to avoid the unavoidable” (Boin and Schulman 2008). These three organizations were chosen because of their outstanding safety records in the face of daily danger (i.e., they are reliable). In addition to traditional research methods such as collecting archival data, the team spent considerable time with the organizations in the field and in workshops to understand their inner workings. Unlike traditional academic pursuits, the team strove to keep preconceived notions about the organizations to a minimum as a research strategy for theory building. In doing so they felt that this uniformed approach helped to strengthen the trust between the research team and the operators given that the systems are difficult to understand by outsiders and that the different operators did not necessarily have a “big picture” of their own organization. In time these and similar organizations became known as High Reliability Organizations (HRO’s) based on the research and theories developed by the team. HRO’s can be defined as “organizations which have fewer than normal accidents. Boin and Schulman (2008) define HRO’s as “those organizations that had successfully avoided such failure while providing operational capabilities under a full range of environmental conditions.” Boin and Schulman (2008) elegantly summarize the essence of what sets HRO’s apart from non-HRO organizations. “What makes HROs special is that they do not treat reliability as a probabilistic property that can be traded at the margins for other organizational values such as efficiency or 33 market competitiveness. An HRO has identified a specific set of events that must be deterministically precluded; they must simply never happen. They must be prevented not by technological design alone, but by organizational strategy and management.” The underlying theories of HRO’s are termed High Reliability Theory (HRT). From the initial three studies and others over time the Berkley group found two main themes. First, they found that HROs react quickly to any threat to safety. Boin and Schulman (2008) report that HROs are extremely sensitive to safety threats and “immediately “reorders” and reorganizes to deal with that threat.” and that “Safety is the chief value against which all decisions, practices, incentives, and ideas are assessed — and remains so under all circumstances.” Across the board HROs acted in similar ways to value safety above everything else. In general HROs had the following features (Boin and Schulman 2008): • High technical competence throughout the organization • A constant, widespread search for improvement across many dimensions of reliability • A careful analysis of core events that must be precluded from happening • An analyzed set of “precursor” conditions that would lead to a precluded event, as well as a clear demarcation between these and conditions that lie outside prior analysis • An elaborate and evolving set of procedures and practices, closely linked to ongoing analysis, which are directed toward avoiding precursor conditions • A formal structure of roles, responsibilities, and reporting relationships that can be transformed under conditions of emergency or stress into a decentralized, team-based approach to problem solving 34 • A “culture of reliability” that distributes and instills the values of care and caution, respect for procedures, attentiveness, and individual responsibility for the promotion of safety among members throughout the organization. Roberts (1993) identified four factors that contribute to risk mitigation in HRO’s. They are: 1. Command by exception or negation: this refers to upper management “pushing” authority to closely monitored subordinates. Decision making is done (“migrates”) by the person(s) with the most expertise and can be migrate in any direction (i.e., up, down, or laterally). 2. Redundancy in people and technology 3. Procedures and rules as a means to prevent errors 4. The ability of management to “see the big picture” to capture the various migrating decisions and integrate them within the organizations. Researchers at the University of Michigan, primarily Karl Weick and Kathleen Sutcliff, built on Weick’s notions of “sensemaking” in organizations to further clarify and define HRO’s and the concept of “mindfulness.” In organizations (Roberts 2003) “sensemaking” seeks to answer the questions “How does something come to be an event for organizational members?” and “What does an event mean?” It is defined as “…the ongoing, retrospective development of plausible images that rationalize what people are doing” (Weick et al.2005). “The basic idea of sensemaking is that reality is an ongoing accomplishment that emerges from efforts to create order and make retrospective sense of what occurs…Sensemaking emphasizes that people try to make things rationally accountable to themselves and others” (Weick 1996). Roberts (1993) clarifies and 35 defines sensemaking as “…the importance of various people in the organization correctly perceiving the events before them and artfully tying them together to produce a “big picture” that includes processes through which error is avoided. A representation of the knowledge available might be a Venn diagram or a hologram in which no one has the whole story but different individuals have important parts of the story that then are tied together to represent the whole.” Weick and his research team built on the work of the Berkely HRT researchers, their previous work on sensemaking, the concept of “collective mindfullenss,” and extensive field studies to develop a popular model of HRO’s. Weick et al. (1999) hold that the distinctive nature of HRO’s lies in how “…diverse but stable cognitive processes interrelate in the service of the discovery and correction of errors.” Traditional organizational theory and accident prevention focusses on decision-making while HRO’s are “…more about inquiry and interpretation grounded in capabilities for action” along with “…a persistent mindset that admits the possibility that any “familiar” event is known imperfectly and is capable of novelty. This ongoing wariness is expressed in active, continuous revisiting and revision of assumptions, rather than in hesitant action (Weick et al. 1999). Five different cognitive processes are isolated that HROs focus on to achieve the state of “collective mindfulness.” These are preoccupation with failure, reluctance to simplify interpretations, sensitivity to operations, commitment to resilience, and underspecification of structures. The first three items deal with anticipation of potential dangers while the latter two are concerned with containment and mitigation of hazards after an incident. As illustrated in Figure 2.4, the coalescence of the five processes results in a collective organizational mindfulness that recognizes and manages 36 unexpected events and “produces” a reliable organization. The reader is directed to Weick et al. (1999) for details of the elements described in Figure 2.4. Figure 2.4: HRO Model that incorporates Collective Mindfullness Process. (adapted from Weick et al.1999) The preceding sections briefly described popular models and approaches that serve as points of convergence and divergence for the RE paradigm. However, they do not entirely show the bright path that lead researchers to believe that there is a superior way, namely via the RE paradigm, to represent today’s complex, non-linear, and sociotechnical organizations and corresponding accidents. Bory’s et al (2009) description of “The Five Ages of Safety” illuminates that path and serves as further background of RE. 2.4 The Five Ages of Safety Hale and Hovden (1998) classified the three ages of industrial safety. Bory’s et al (2009) build on work by Hale and Hovden (1998) to classify industrial safety into five “ages” as numbered below. The first three are attributed to Hale and Hovden, the remaining two are established by Borys et al (2009). This chronology leads directly to the need for the RE 37 approach and allows for the background discussion of the topics of culture and climate, and human factors and human error that are utilized in RE. The Five Ages are: 1. The Technology Age: lasting from the nineteenth century until after the World War II 2. The Human Factors Age (starting around 1979) 3. The Safety Management Age (starting around the late 1980’s) 4. The Integration Age 5. The Adaptive Age 2.4.1 The First Age of Safety - The Technology Age The first age of safety that Hale and Hovden (1998) identify coincides roughly with the occurrence of the industrial revolution, from the nineteenth century until just after WW II. In this era the emphasis was concerned with “…the technical measures to guard machinery, stop explosions and prevent structures collapsing” and that the attitude of factory inspectors at that time (late 1900’s) was that other causes of accidents could not be “reasonably prevented”, meaning that things like personal behavior and management influence with regard to culture and individual shortcomings (e.g. accident proneness) were beyond the influence of safety inspectors. As the industrial revolution progressed, the worker moved from a position such as craftsmen (or blacksmith, or other guild-type occupation) where they controlled their own pace of work and were responsible for their own safety; to a management controlled (usually the shop foreman) manufacturing scenario where the emphasis was on little else than to produce low unit-cost items that justified the high fixed-cost capital equipment of the day. Aldrich (1997) quotes a 1910 New York State Compensation Commission report that stated “Previous to the introduction of machinery into modern industry industrial accidents were relatively few and unimportant.” Aldrich notes that while in the early twentieth century industrial technology reduced the overall 38 number of workers needed in a particular trade, and thus reduced the exposure of workers to unsafe conditions, safety eroded within industries because of the increased pace and unfamiliarity brought on by the technology as well as increasingly complex organizational structures formed by the giant corporations that did not know how to exercise sufficient control to ensure worker safety. The emphasis, Aldrich notes, was almost exclusively on production. Aldrich (1997) reports that production responsibilities of the nineteenth-century were strictly placed on the shop foreman. Correspondingly, safety, or the lack of safety was also left to the discretion of the foreman. This resulted in an uncertain and dangerous situation for workers and led to “…high labor turnover, poor morale, and …occasional…mass strikes.” As the twentieth century dawned and companies became economically impacted by the public policy pressures of regulation, poor public image, and higher costs due to accidents, companies turned more to systematic management techniques and the emergence of safety departments. Aldrich (1997) alludes to the safety culture and climate of the late nineteenth and early twentieth century as harsh and mainly relying on the whims of the shop foreman. Furthermore, he states that “Dangerous practices were part if the craftsman’s code; they were traditional, and they reflected his manliness.” He further notes that some safety devices were not used simply because they were new. Employees were basically on their own with regard to what work procedures to follow and what personal protective equipment to use. However, as manufacturing schemes became more sophisticated and complex, and activities became tightly coupled it was difficult for employees to coordinate with one another with respect to safety. Aldrich (1997) posits that in the First Age safety initiatives came from management and not from labor. A prime example of this top-down realization that safety was the responsibility of management was the “Safety First” movement started at United States Steel in 1906 (Aldrich 39 1997). This was an active attempt to get the working man involved in safety and demarcated the beginning of formal safety programs in industry. By the early 1920s the Safety First movement was gaining ground in many industries but lagged in others due to a lack of top management “buy-in” (Aldrich 1997). To garner executive support safety proponents of the time sought to establish a ”… correlation between safety and production.” President Herbert Hoover established the “Hoover Commission” on waste in industry that studied this link and was published under the title “Waste in Industry” in 1921. The book discussed construction waste in a chapter titled “The Building Industry.” The study found the chief sources of waste in the building industry to be irregular employment, inefficient management, and wasteful labor regulations. Accidents are listed as a secondary cause of waste and are estimated to “…involve losses up to 10% of the labor cost in addition to the human loss of lives and energy.” with the average loss at about 2.25% of labor costs” Accident costs per year totaled around an estimated economic loss of $120,000,000 (in 1920s dollars) taking into account work stoppages, labor replacement, and extended loss of the crew. The report attributes the cause of accidents to be mainly that of carelessness of the workman and lack of ordinary safeguards. The report estimated that overall 12,000,000 labor days could be saved by implementing safety measures. However, the savings in dollars and labor days was based on expert opinion and not scientific studies. Nonetheless, it was a longstanding belief of safety engineers in the early twentieth century that production and safety were inextricably linked and that injuries were symptoms of inefficiency (Aldrich 1997). 2.4.2 The Second Age of Safety - The Human Factors Age Research that defined the second age of safety was conducted around the period between WW I and WW II. Studies revolved around the human component of work, guided by the theory of 40 accident proneness and accident prevention by better “…personnel selection, training and motivation…” (Hale and Hovden 1998). Hale and Hovden (1998) note that the technical and human-based studies operated on separate paths until the 1960’s and 1970’s when advances in probabilistic risk assessment and ergonomics came into vogue and the two paths, namely, the technical and human factors, merged to create a deeper understanding of the accident phenomena. The International Ergonomics Association (IEA) defines human factors (n.b. the terms “human factors” and “ergonomics” are used interchangeably although “human factors” appears to be the preferred term in the U.S.) as “…the scientific discipline concerned with the understanding of the interactions among humans and other elements of a system, and the profession that applies theoretical principles, data and methods to design in order to optimize human wellbeing and overall system performance” (IEA 2011). Human Factors is a systemsoriented approach and can be applied to many areas of human activity. Human Factors mainly consists of three distinct areas of inquiry, physical, cognitive, and organizational. “Physical ergonomics is concerned with human anatomical, anthropometric, physiological and biomechanical characteristics as they relate to physical activity.” “Cognitive ergonomics is concerned with mental processes, such as perception, memory, reasoning, and motor response, as they affect interactions among humans and other elements of a system.” “Organizational ergonomics is concerned with the optimization of sociotechnical systems, including their organizational structures, policies, and processes.” (IEA 2011). Just as in the technical age, the focus of the human factors age concentrated on isolating an element of the system and trying to correct or to eliminate its deficiency or hazard to the system. 41 Thus a major thrust of the human factors efforts focused on the elimination of human error on both the system and individual levels. 2.4.2.1 Human Error Aldrich (1997) stresses that safety vastly improved in the in the First Age. This was largely due to the safeguarding of equipment and the realization by management that proactive safety management was vital to the financial bottom-line and to avoid labor unrest as well as negative public scrutiny. However, accidents continued to occur and researchers felt the models and methods of the First Age did not adequately explain the continued problems with safety in a rapidly changing technological industrial world. To explain and understand the problems of the mid twentieth -century safety researchers and practitioners turned their attention to the human in the system. On the surface it would seem that defining what is meant by the term “human error” would be a simple task. Hollnagel (2007) gives a simple and commonly cited definition of human error as follows: “…an incorrectly performed human action, particularly in cases where the term is used to denote the cause of an unwanted outcome.” Hollnagel’s colleague Dr. David Woods writes that “Human error is a very elusive concept” (Woods et al 2010). The initial difficulty when discussing human error comes in defining the term across various academic disciplines and among numerous industry practitioners, regulators, and investigators. Hollnagel (2007) discusses some of these differences: From the human factors perspective “…the human operator is viewed as a system component for which successes and failures can be described in much the same way as for equipment”…”In behavioral science…the starting assumption is that human behavior is essentially purposive and that it therefore can be fully understood only by reference to subjective goals and intentions… in social science, the origins of failure are usually 42 ascribed to features of the prevailing socio-technical system so that management style and organizational structure are seen as the mediating variables influencing error rates.” James Reason (2000) approaches the notion of human error from two perspectives. The first is from the “person” approach. Here the errors of individuals with regard to such things as moral weakness and forgetfulness are emphasized. The second (and preferred) is a “systems” approach that focuses on the “… conditions under which individuals work and tries to build defenses to avert errors or mitigate their effects.” Woods et al. (2010) describes common outlooks of human error as two mutually exclusive worlds “colliding.” The first world is populated by “…erratic people who degrade an otherwise safe system,” and safety is created by protecting the system from “unreliable” people. The other (preferred) world consists of “…people who create safety at all levels of the socio-technical system by learning and adapting to information about how we can all contribute to success and failure.” In other words, the latter world helps workers cope with complexity to be safe. Hollnagel (2007) stresses that it is important to distinguish between process and product when defining and discussing human error. On the product side, it is relatively easy to determine if an error was made. For instance, a finger was severed when ripping a piece of lumber. The product here, the loss of a finger is easy to determine. However, the process leading up to that loss, picking up the lumber, placing it on the rip fence, and guiding it through the saw, may have been repeated hundreds of times without an accident. The point here is that identifying a deficient process is not a “go – no go” kind of observation. Hollnagel posits that “…whether a process is right or wrong is normally a matter of degree rather than of absolutes.” There is not generally agreed upon definition of “human error” in the literature. Some definitions focus on the product, that is, the unwanted event that occurred, and some focus on the 43 process, or both (Hollnagel 2007). Hollnagel notes that the term “human error” is generally used in practice in three very different ways. In the first ““human error” denotes the cause of something.” For example, Saurin, Formosa, and Cambria (2004) report that human error is attributed as the cause of anywhere from around 30 to 96 percent of all construction accidents. The second meaning focuses on the “…action or process itself, whereas the outcome or the consequence is not considered.” For example, a crane operator might forget to check a load chart for a lift. This may or may not result in an unintended outcome, depending on the item lifted. Finally, the usage of “human error” sometimes used to denote the outcome of an action.” Woods et al. (2010) defines (or more clearly approaches) human error in the following statement: “the label “human error” is a judgment made in hindsight. After the outcome is clear, any attribution of error is a social and psychological judgment process, not a narrow, purely technical, or objective analysis.” This definition conveys a trend by many safety researchers (Dekker 2005, Hollnagel 2004, Perrow 1984) to eliminate the notion of “human error” altogether. They claim it is a false construct built upon false logic and of thinking of humans in mechanistic terms. The manner in which humans reason when investigating an accident is cited as one reason the idea of human error is incorrect. The Law of Causality, that every cause has an effect, is twisted into its reverse when investigators reason from effect to cause. This also assumes that a cause exists and can be found. In complex systems there may be multiple causes or the cause may never be pinpointed. Furthermore, Dekker (2005) points out that after the fact investigators act as if they are able to move backward in time (the “tunnel” as he calls it) and differentiate all of the contextual elements and complex details that the person based their actions on. In other words, the assumption is that the human acted in a rational manner, in the economic sense of the word, and made the optimal decisions while the accident unfolded. In reality, 44 humans make the best possible decisions that may or may not be optimal, given the situation in the course of “normal” work (Perrow 1984, Simon 1997). Additionally, due to limited resources of time and money, accident investigators typically stop an investigation when the first cause is found. This is known as the “stop rule.” Because humans are found at all levels of an accident it is not difficult to find a human error to cast the blame upon and stop the investigation there. Finally, some authors point out that blaming the human in the system, and only the human, takes the pressure off of companies to look further into the deficiencies of the system. For instance, there may be massive retooling necessary to provide a safe working environment. However, it is easier and more cost effective to blame a human error than make corrections. Another rejection of the term “human error” is that it is borne out of the technical age and is a misnomer. In the technical realm a hazardous component can be identified by risk analysis methods such as an event or fault-tree. Humans, the naysayer’s state, are not analogous to machines. The view that a human is rational and machine-like was reinforced in early cognitive work that compared the human decision-making capabilities to that of the computer. Here the human was treated as an information processing system (IPS) with clearly defined mental processes similar to that of a computer (Hollnagel and Woods 2005). Essentially, this view treated the human in the system in the same manner as a mechanical component. In this view, the human decision-making process and actions should match that of the machine or environment in which they are acting. This thinking is along the lines of the rational decision maker in economics, who has the mental capabilities and information resources to vet every option to make an optimal choice or course of action. In reality, resources are limited and are context-dependent (Hollnagel and Woods 2005). Humans make the best decisions given the 45 available resources (e.g. time and money) and every decision is context dependent. Simon (1997) called this ‘satisficing.” In general, much of the research in safety in the second age assumed that everything always went right and if a disruption occurred it was due to the a malfunctioning or maladaptive component. It did not matter if the component was mechanical or human, either one could be replaced (Hollnagel and Wods 2005). While this worked well for the mechanical side, it neglected that human variability, resourcefulness, and flexibility were key components that made systems successful (Hollnagel and Woods 2005). The Resilience Engineering community does not categorically rule out the possibility of human error occurring. However, on an individual basis human error should only be designated when the following three conditions are present “…a clearly specified performance standard or criterion against which a deviant response can be measured,” an event or an action that results in a measurable performance shortfall such that the expected level of performance is not met by the acting agent”, and finally, “there must be a degree of volition such that the person had the opportunity to act in a way that would not be considered erroneous” (Hollnagel 2007). The Resilience Engineering community would like to see the term “human error” disappear unless the preceding three conditions are met. It is a judgment made in hindsight as Woods points out. Furthermore, it hinders communications given that there is no common understanding or definition of the term, it is a problem for measurements and statistics because we do not know exactly what we are counting, and finally, it is a hindrance for learning more about the accident given that the search for a cause is abandoned once the erring human is identified (Hollnagel 2007). 46 2.4.3 Third Age of Safety – Management Systems and Culture Even though the second age of safety research brought about a clearer understanding of safety and the interaction among technology, individuals, and the organizational system, there was still dissatisfaction among safety researchers with the methods developed to assess and investigate accidents. Hale and Hovden (1998) point to the spectacular disasters of the mid 1980s such as the explosion of the Challenger space shuttle, the meltdown of the Chernobyl nuclear reactor, and the chemical release in Bhopal, India as heralding in the age of management as a focus of safety research. While these types of accidents had occurred previously the focus had been on technological and human factors and not on structure, i.e., management factors (Hale and Hovden 1998). The predominant view of the organizations in which these accidents occurred in (i.e., NASA and large multi-national and government run agencies) was that these “…welldeveloped, often highly-bureaucratic, safety systems …. had been thought to be, until then, to be safe-proofed against such major disasters” (Hale and Hovden 1998). That management had influence in the safety process was not a new idea in the 1980s, Heinrich et al. (1980), and later Bird (Heinrich 1980) among others recognized that management played an important role in the worker and organizational safety, but their theories had little or no scientific basis and were characterized by Hale and Hovden as little more than “…accumulated common sense and as general management principles applied to the specific field of safety.” The idea that management could influence safety led to the idea that a certain climate and culture could be developed and fostered in the organization and among the organizational associates. The two notions are discussed below. 47 2.4.3.1 Culture and Climate In the preface of his 2010 book “Organizational Culture and Leadership,” Edgar Schein, a leading researcher in the study of organizational culture, discusses his frustration with how complicated the research of culture has become and how he sometimes feels overwhelmed by the amount of research and consulting in the burgeoning field. Guldenmund (2000) observes that “Organisational culture and climate are complex concepts” and cites several authors regarding the elusiveness of finding consensus definitions and categories for the concepts of culture and climate. This occurs despite the fact that the concept of culture is over 100 years old (Schein 1999). Schein (2010) defines the culture of a group “…as a pattern of shared basic assumptions learned by a group as it solved its problems of external adaptation and internal integration, which has worked well enough to be considered valid and, therefore, to be taught to new members as the correct way to perceive, think, and feel in relation to those problems.” Here culture is a viewed as a product of social learning. Notwithstanding the difficulties described above, the phenomenon of culture is pervasive, everyone is involved in several cultures and sub-cultures throughout their lifetime and it affects how business is conducted. Schein (1999) states that “Culture matters because it is a powerful, latent, and often unconscious set of forces that determine both our individual and collective behavior, ways of perceiving, thought patterns, and values. Organizational culture in particular matters because cultural elements determine strategy, goals, and modes of operating. The values and thought patterns of leaders and senior managers are partially determined by their cultural backgrounds and their shared experience.” Schein (2010) identifies four cultures. The macroculture consists of nations, ethnic and religious groups, and occupations that exist globally. 48 Organizational cultures include private, public, nonprofit, and government organizations (n.b. Corporation cultures are a subset of organizational cultures). Subcultures are occupational groups within organizations. Finally, microcultures are microsystems within or outside organizations, such as “small coherent units within organizations, such as surgical teams or task forces that cut across occupational groups…and are different from occupational subcultures.” Being aware of the spectrum of cultures is essential to understanding a particular culture because they are all interconnected. Schein (1999, 2010)) warns not to oversimplify the concept of culture by simplifying its definition to such trite sayings such as it is “the way we do things around here,” or as “the company climate.” Although somewhat valid, such sayings merely reflect manifestations of the culture. To understand culture we must realize that it exists at several levels of the organization and that only by digging deeper will one begin to understand the prevailing cultural outlook. These levels consist of the things we can easily see (e.g., artifacts such as architecture and interpersonal relationships), the espoused values (e.g., literature describing how “safety comes first”), and finally level three, the basic underlying assumptions that drive the culture. Here, Schein posits, lies the “ultimate source of values and actions” of the firm which are embedded in the “unconscious, taken-for-granted beliefs, perceptions, thoughts, and feelings” of the employees. To understand the artifacts and espoused values Schein feels that one must first understand the intricacies of level three and only then will the artifacts and espoused values make sense. Corporate culture is stable and resistant to change because it provides meaning and predictability to daily life. However, he believes that cultures can be transformed if needed. The concept of “climate” in relation to “culture” is also mentioned in the safety literature quite frequently and, just as exposed above, “climate” is as difficult as “culture” to define. 49 Kuenzi and Schminke (2009), adopt Schneider and Reicher’s (1983) definition of organizational work climates as “a set of shared perceptions regarding the policies, practices, and procedures that an organization rewards, supports, and expects.” They further posit that organizational climate is a property of the unit but lies in individual perceptions. Kuenzi and Schminke (2009) also discuss the similarities and differences between culture and climate. Culture and climate both explore how individuals make sense of their environments and both involve a shared experience. 2.4.3.1 Safety Culture Guldenmund (2007) reports that the term “safety culture” came into use around 1986 and that there are multiple meanings and no universal definition of the phrase “safety culture,” as would be expected from the general discussion of the culture and climate above. Furthermore, she reports that there is no consensus on the dimensions that make up a safety culture, and that they “commitment by management and workforce, leadership style and communication, individual responsibility, management responsibility, risk awareness and risk-taking” (citing McConnell 2004). Chenhall (2010) identifies some of the components of “safety culture” from various authors as “safety system” (Choudhry et al 2007) “safety climate” (Choudhry et al. 2007) “safety management system” (Diaz- Cabrera et al. 2007) and “socio-technical system” (Grote & Künzler 2000). Chenhall (2010) reports that safety culture indicators are classified as either formal or informal and, citing Rao (2007) “The formal norms in a safety culture are characterized as written organizational safety policies and procedures, such as OSHA regulations, whereas the informal norms are not documented.” From this Chenhall concludes that even if a culture has elements of a formal “safety culture” in place and it is lacking the informal portion then safety is 50 “…not likely part of the culture.” She also promotes Schein’s view that one must look beyond artifacts and espoused values to find the underlying culture. Manuele (2008), when discussing ANSI – Z10 also holds this view. Schein sums up his current thinking on the differences between culture and climate in a preface to a compilation of cultural studies thusly, “my advice to readers is to view both climate and culture as abstractions that lead them to taking a useful perspective toward human behavior in complex systems. It is the perspective that is important, not a particular research result or a broad generalization about how important climate or culture is to some practical phenomenon.” 2.4.4 Fourth Age of Safety – The Integration Age In a brief portion of their text “Human Safety and Risk Management” Glendon et al. (2006) build on Hale and Hovden’s (1999) work by suggesting a fourth age of safety. They ponder the then current (2006) state of safety and risk management and conclude that it might be called the “integration age.” They speculated that this age takes on “…some characteristics of HROs...” They base this view on the previous three ages by making an analogy to MacLean’s “triune brain theory” that suggests that the human brain is actually three brains in one and that each part developed according to evolutionary needs and was linked to and retained parts of the previous growth. In a similar way, views on safety progressed according to the needs of the worker and industry. They state “Characteristics of the successive ages of safety may well not supplant earlier ways of thinking and acting (i.e., the cultures) of previous eras; rather they are more likely to build on previous structures, so that the contemporary collage of safety philosophies and practices remains rooted in the technical era, but has suffused this with layers of human factors applications and management systems.” 51 Although they are short on examples of the integrationist approach, Glendon et al (2006) cite Havold (2005), and envision this period (Age) as taking a “safety orientation” outlook that unites safety climate and safety culture and includes the factors of safety rules, management commitment to safety, safety behavior, communication, work situation, job satisfaction, competence, management priorities and organizational risk, satisfaction with safety activities, reporting culture and supportive environment, and fatalism. 2.4.5 Fifth Age of Safety – The Adaptive Age Borys et al. (2009) introduce the possibility that a fifth age has emerged in safety, the ‘adaptive age’ that essentially presents the case for the existence of Resilience Engineering. They claim that the adaptive age “…transcends all other ages without discounting them, whilst introducing the concept of ‘adaptation’, the adaptive age goes beyond simply integrating the past.” The adaptive age is meant as a means to “…take us beyond the contemporary ways of thinking about managing OHS that typically focus on OHS management systems (OHSMS), safety culture and safety rules.” Like the other shifts in safety thinking this outlook stems from the limitations of existing approaches to understand and assess existing systems. Borys et al. (2009) cites that Robson et al.’s (2005) research of OHSMSs that found that there is insufficient evidence in the peer-reviewed literature to suggest that they are either effective or ineffective. Borys et al. also cite several researchers who feel that OHSMSs are complex paperwork burdens that do not reflect conditions at the workface and are primarily rule-focused. In the verbiage of Borys et al., OHSMSs (and other safety approaches) are not to be discounted but transcended by an adaptive culture (more specifically the existence of social construction sub-cultures as discussed in The Age of Safety Management section) along with the concepts of collective 52 mindfulness (as presented in the discussion of HRO’s) and a new perspective regarding safety as embodied in the RE approach, which is discussed in the following section. 2.5 Current Understanding of Resilience Engineering The notion of RE has evolved in other domains (e.g. aviation, nuclear industries) as a way to overcome the limitations of existing accident analysis and risk assessment models that are used to manage safety. It is a proposal to cease from relying primarily on hindsight to explain the cause of accidents and to explore the sources of resilience that prevent and mitigate accidents. Many accident causation models are sequential, such as Heinrich’s Domino Model and Reason’s Swiss Cheese Model (Hollnagel, 2004). The physical structure of devices such as fault-trees and event trees (i.e., the structural view) promote the idea that accident causation is linear. Additionally, regulatory standards are often just bolstered incrementally to cover the latest crack in the regulation as exposed by the latest accident. Safety is often managed by error tabulation and probabilities, for example, setting a goal to reduce falls by 33%. In contrast, RE is more focused on unearthing the positive quality of resilience rather than on managing by error counts such as the number of fatalities and injuries. RE cites research that explains that “Untoward events more often are due to an unfortunate combination of a number of conditions, than to the failure of a single function or component” (Woods and Hollnagel 2006). This outlook promotes the non-linear functional view, as opposed to linear structural views, that looks to the interdependencies among system components. In this outlook control is considered both on the activity level and on the system level, as previously described in the Functional Resonance Analysis Method. A more subtle point of the functional versus structural view is that many accidents rarely re-occur in the same fashion but are a confluence of seemingly unrelated events, each necessary but only jointly sufficient to create an accident. This view is in line with 53 the notion that work is complex and that accidents and performance variability concerns cannot be confined to a single component but emerge from a confluence of demand induced pressures (Hollnagel 2004). From this perspective, developing foresight, or trying to imagine what might go wrong and developing strategies to defeat failure, is a more valuable skill than relying exclusively on hindsight when extrapolating accident investigations to accident assessment models. To avoid failure we must anticipate key aspects of the future to imagine what might go wrong. This notion, as conceived by Westrum, is embodied in the idea of “requisite imagination.” This prescient activity “…is a means for the designer to explore what can affect design outcomes in future contexts” (Adamski and Westrum, 2003). The use of this method by designers and front line workers can foretell routes to disaster, as well as to success. In light of the above discussion, safety management must then be reactive and proactive (Hollnagel, 2008), reactive to respond to those threats that have materialized or are imminent and proactive to bolster gaps in the organization’s safety or to compensate for gaps in the design. Resilience engineering recognizes that despite even outstanding planning efforts, performance conditions are always underspecified (Hollnagel 2008). The front-line worker must always make adjustments in the course of operations given the context of underspecification of operational conditions, changing environmental conditions and the intensity of demands. In other words, there will always be performance variability due to the need to respond to demands imposed on the system. Resilience engineering aims to dampen the variability that may contribute to adverse events and to amplify the variability that leads to positive outcomes (Hollnagel 2008). Hollnagel’s (2004) Efficiency-Thoroughness Trade-Off (ETTO) Principle captures an aspect of this notion. In attempts to optimize performance goals people work to be as thorough as they 54 can be (i.e., follow the rules) given the prevailing environmental conditions and circumstances. However, there is also pressure to be efficient. People and organizations that are not efficient may become, respectively, unemployed and unprofitable (or bankrupt). Also, those who are not sufficiently thorough possibly endanger safety and may also cease to exist economically. One reason thoroughness is sometimes shunted is that in the quest to optimize work processes people skip seemingly unnecessary steps in work tasks. In construction this is manifest in the phrase “We have always done it this way” when discussing field operations with, for example, a subcontractor, when in fact, field conditions may necessitate that operations be revised from what has always been done. The shortcut, (e.g,. always “doing it this way”) is the norm in work rather than the exception given that work environments are relatively stable places and that accidents are rare events (Hollnagel, 2004). People may take certain aspects of their work for granted and skip seemingly inefficient steps. This can sometimes lead to an accident. An example might be to neglect to “tie-off” a ladder to save time. Excuses can range from “We never have ladders slip” to “I was only going on the roof for a minute.” The event of an accident and its associated costs can wipe out efficiency gains from shortcuts. Closely allied with the ETTO Principle is the outlook that failure is the temporary inability to effectively cope with complexity under demanding conditions. A situation that RE addresses, and is closely allied to complexity, is the all too common production and efficiency tensions inherent in industrial work. Research has shown that workers implicitly choose production over safety concerns when a trade-off is available and therefore act in a riskier manner than they normally would. An example of RE in construction would involve knowing when to relax production pressures by, for example, reducing overtime hours, adding additional crews, 55 subcontracting extra work in critical periods, or simply knowing when to slow down production so that safety is not endangered. This work explores the RE constructs to better understand how disruptions affect construction projects. RE is a new perspective on safety for complex socio-technical systems. Traditionally, improvements in safety have been based on hindsight – and asked “what went wrong” in accident analysis and “what could go wrong” in risk assessment. For instance, after a major accident involving loss of lives, an incremental change (a tweak to the regulations or building codes) or a barrier is implemented (fall protection if working over 6’) after statistics show a trend of injuries or fatalities. This behavior can be thought of as hindsight thinking and is commonly associated with traditional views of safety. Hindsight thinking colors how we think about failure and safety. In the traditional view, failure is characterized as arising from a breakdown or malfunctioning of normal systems. Safety is defined as “freedom from unacceptable risk.” Both of these approaches require the analyst or investigator to think about how accidents happened and what went or could “go wrong.” In general – it is a reactive approach and the “negative” approach to understanding accidents. This approach has been successful in saving lives and preventing injury. However, we seem to have reached a plateau in the effectiveness of this way of doing business in construction safety as discussed in Chapter One. RE embraces the traditional approach but posits that it is only part of the picture – to get a better understanding of safety we need to ask “what can go right” for risk assessment and “what went right” for accident analysis. A proactive and “positive” approach is needed to gain a complete understanding of safety. RE proposes that we observe work as it is normally performed on a day-to-day basis and look at how humans in the system “make ends meet” in the face of under-specification of operations, constantly changing conditions, and unrelenting 56 demands and stressors placed on the system. We should observe what makes systems resilient, how to engineer resilience, and how to maintain and manage the resilience of a system. Resilience is a quality of the system. It can’t be counted – it is something that the systems does (“acts in a resilient manner”) rather than something a system has (it would be wrong to say a system has “10 units of resilience”). Therefore, managing resilience is a kind of process control. In the RE view, failures arise from adjustments made by people to cope with underspecification of a system or a process, and safety is defined as “the ability to succeed under varying conditions.” Defining safety this way includes the reactive and proactive approaches. The term “performance variability” is used to describe the ways in which individual and collective performances are adjusted to match current demands and resources, in order to ensure that things go right. Thus a key feature of a resilient system is its ability to adjust its performance. Adjustments can, in principle, be reactive, concurrent, and proactive and are described below (Hollnagel 2009):  Reactive adjustments are the most common and happen in the aftermath of an event (i.e. “lessons learned from a major change or disruption). This is an incomplete approach given that the adjustments made may not be suitable for the unique and uncertain events of the future.  Concurrent adjustments are basically fast reactive adjustments that take place while the situation is developing.  Proactive adjustments means that the system can change from a state of normal operation to a state of heightened readiness, and possibly also act, before something happens. 57 2.5.1 Background and Definitions of Resilience Engineering Although the term resilience is well known in various academic domains such as ecology and engineering, the term “Resilience Engineering” is relatively new as applied to safety, beginning to appear in the literature around the end of the last century (Woods et al. 2007). However, the elements that characterize resilience engineering have been brewing for many decades. Resilience engineering borrows from many different areas of organizational and safety research such as High Reliability Organizations (HROs), Normal Accident Theory (NAT), and other systems approaches. Researchers in the field are mainly involved in the areas of healthcare, nuclear power, aviation, and aerospace, and among others that are characterized by complex, high-risk and high-visibility industries. Resilience Engineering draws inspiration from work by industrial psychologists, sociologists, anthropologists, and other safety theorists such as James Reason, Jens Rasmussen, Scott Sagan, Donald Norman, and Charles Perrow. It utilizes and respects the effective methods, models and techniques developed over the years by various industries and academic disciplines with the caveat that they “…must be looked at anew and therefore possibly used in a way that may differ from what has traditionally been the case” (Hollnagel, 2008). Resilience engineering is strongly influenced by the discipline of Cognitive Systems Engineering (CSE). CSE is a systems approach which was formulated in the early 1980’s to study complex system failures such as the Three Mile Island nuclear power generating facility release of radiation and as a way to overcome the limitations of previous safety models (Hollnagel and Woods 2005). 58 A cognitive system is defined as one that “…can modify its behavior on the basis of experience so as to achieve specific anti-entropic ends…they [cognitive systems] are able to maintain order in the face of disruptive influences…specifically…to control what it does” (Hollnagel and Woods 2005). CSE focuses on analyzing Joint Cognitive Systems (JCS). JCS’s are a human-machine coagency where humans and machines are described as “equal partners,” humans are not described as if they are machines nor are machines given human attributes, the analysis is on what the JCS does, not what it is, and how performance is controlled (Hollnagel and Woods 2005). Machines are expressed as artifacts in CSE. An artifact is something devised for a specific purpose, for instance a screwdriver is an artifact, so is a corporation. On a construction project one can envision many JCS’s as well as many JCS’s nested within others. The largest may be the JCS of the stakeholders and the organization. With regard to resilience the emphasis is in how the JCS copes with surprise (unexpected events) and error (Woods and Hollnagel 2006). Finally, the CSE outlook considers all work as cognitive; everything we do requires our brain (Hollnagel and Woods 2005). This is especially true in the construction industry. A resilient system is able to maintain control when faced with disruptions in the form of unexpected events. A system is said to be in control if it is able to mitigate or eliminate unwanted internal or external variability with respect to the demands placed on the system, especially pressing time concerns such as schedule acceleration or increased tempo of projects (Hollnagel et al. 2006). The term “Resilience Engineering” was formally applied in a conference of safety scientists and researcher in 2004. Over the last several years the term has been differently understood by researchers and has been updated as the understanding and maturation of the scope focus of the 59 new discipline has emerged. A few of the definitions are provided below and are listed chronologically to illustrate the evolution and different understanding of this discipline. Some definitions include: • “The intrinsic ability of an organization (system) to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/or in the presence of a continuous stress" (Hollnagel et al 2006). • "How well a system can handle disruptions and variations that fall outside of the base mechanisms/model for being adaptive as defined in that system" (Woods and Hollangel 2006). • Westrum (Hollnagel et al. 2006) looks at resilience from three different vantage points: o Resilience is the ability to prevent something bad from happening, o Or the ability to prevent something bad from becoming worse, o Or the ability to recover from something bad once it has happened. • "Its ability effectively to adjust its functioning prior to or following changes and disturbances so that it can continue its functioning after a disruption or major mishap, and in the presence of continuous stresses" (Hollnagel et al 2008) • “RE is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions” (Hollnagel et al. 2011). The last definition is used for this work. It reflects a key feature of Resilience Engineering, which is for a resilient system to be able to adjust its performance reactively, concurrently, and proactively. In the time continuum of an accident this outlook seems crucial. 60 The use of the term ‘engineering’ is not meant in the traditional engineering sense of the word involving disciplines such as civil, mechanical, or electrical nor does it exclusively mean engineering controls. In the resilience engineering perspective engineering controls are necessary, but not sufficient to ensure safety. A better definition of engineering for resilience purposes is “…to arrange, manage, or carry through by skillful or artful contrivance.” Granted, this could just as well apply to engineering controls but the overwhelming emphasis of resilience engineering is sociotechnical; that is the focus is on people and how they interact with the artifacts (i.e. something devised to help people perform work; machines and/or organizations) found in the working world and on better control of systems. Although the term resilience is well known in various academic disciplines the moniker of ‘resilience engineering’ is relatively new, starting to appear in the literature around the end of the last century (Woods et al 2007). However, the elements that characterize resilience engineering have been brewing for many decades. Indeed, resilience engineering borrows from many different areas of organizational and safety research such as High Reliability Organizations and similar approaches. Researchers in the field are mainly involved in the areas of healthcare, nuclear power, aviation, and aerospace, among others characterized by complex, high-risk and high-visibility industries. Resilience engineering draws inspiration from work by industrial psychologists, sociologists, anthropologists, and other safety theorists such as James Reason, Scott Sagan, Donald Norman, and Charles Perrow. Resilience engineering uses and respects the effective methods and techniques developed over the years by various industries with the caveat that they “…must be looked at anew and therefore possibly used in a way that may differ from what has traditionally been the case.” (Hollnagel 2008). 61 Some have described resilience engineering as a paradigm shift in the ‘Kuhnian sense” (Woods and Hollangel 2006), referring to Thomas Kuhn’s well-known "The Structure of Scientific Revolutions" published in 1962. By proposing a new outlook and vocabulary for system safety it may well be, as Kuhn stated “A revolution occurs when a community changes its lexicon.” However, in the course of scientific (or any type of) discovery there are those who, early on, become embroiled in discussions concerning the use of a phrase such as ‘paradigm shift’ and enmeshed in conducting assessments of the validity of the researcher’s use of the term instead of seeing the possibilities in the new paradigm. The concept of Resilience Engineering is presented here in the mildest revelational terms possible, it is left to future generations to ponder if Resilience Engineering is truly a paradigm shift or is to be cast upon the high pile of existing safety models. As Kuhn stated “…if a new candidate for paradigm had to be judged from the start by hard-headed people who examined only relative problem-solving ability, the sciences would experience very few major revolutions." (Kuhn 1996). The primary goal of this paper is to introduce the reader to resilience engineering and explore the possibilities of applying it to construction industry safety. In general, RE is concerned with any sociotechnical aspect of production but automation is of special interest given its prominent use in industries such as aviation and nuclear power production. Woods and Hollnagel (2006 prologue) posit that automation grew out of the ‘error counting’ paradigm where the worker was merely considered an unreliable part of the system and a liability to safety. The system would seemingly perform better without the human mucking up the works. However, as with any man-made object, a human is involved somewhere in the system, either as a designer or to interact in some manner with the automation. The automation may only be as good as the designers’ or developers’ ability to anticipate what 62 disturbances may occur. Automation has been referred to a “team player” in the sociotechnical outlook because it improves the response to disturbances in the work setting and helps the system to adapt (Nemeth et al 2009). Paries (2010) explains that automation aims to reduce the uncertainty in the system by reducing variety, diversity, deviation, and instability. For instance, it can reduce fatigue, a kind of performance variability, in airline pilots by handling monotonous tasks. So, in one respect, automation can be seen as an aid to resilience in that it decreases performance variability by standardizing routine or complex functions (McDonald 2006). However, it also has the side effect(s) of reducing autonomy, creativity, and reactivity. Additionally, it may lead to increased reliance on rules and less training than normally done. Wiener termed the phrase ‘clumsy’ automation to refer to the new demands with respect to communication and coordination that automation imposes on a flight crew but does not support well (Sarter et al. 1997). In the AEC sector there is currently no counterpart to the sophistication of the automation of the auto pilot found in the airline industry. There are pockets of advanced automation found in areas of building safety and security. For instance, sprinkler and other fire-suppression systems employ a great deal of monitoring and reaction without constant supervisory control. The current popularity of green building has spawned an interest in indoor environmental systems that monitor and adjust variables related to building performance. Automation in the construction process extends to administrative practices. Drawing and code review have been automated as to virtually eliminate paper documents and constant human supervision. Some sophisticated construction fabrication shops can also extend the automation function to interface with factory floor computer aided manufacturing systems. A full discussion of automation is beyond the scope of this work. 63 2.5.2 The Four Premises of Resilience Engineering “In agreement with Perrow's notion of Normal Accidents, the Resilience Engineering approach uses the understanding of normal performance as a premise to explain that accidents emerge from normal system performance, rather than resulting from technical, human or organisational failures” (Macchi and Hollnagel 2010). .Correspondingly, Resilience Engineering is based on the following four premises (Hollnagel et al 2011): 1. Performance conditions are always underspecified. Individuals and organizations must therefore adjust what they do to match current demands and resources. Because resources and time are finite, such adjustments will inevitably be approximate. 2. Some adverse events can be attributed to a breakdown or malfunctioning of components and normal system functions, but others cannot. The latter can best be understood as the result of unexpected combinations of performance variability. This is illustrated via the Functional Resonance Analysis Method (FRAM). 3. Safety management cannot be based exclusively on hindsight, nor rely on error tabulation and the calculation of failure probabilities. Safety management must be proactive as well as reactive. 4. Safety and field operations management are inseparable and do not operate independently. No conflict or tension should exist between these functions. Safety must therefore be achieved by improvements to the operations (i.e. by engineering a better operations process) rather than by simply constraining operations (i.e. by barriers, more regulations, etc.). 64 2.5.3 The Four Abilities of a Resilient Organization In order to be resilient, an organization must possess the four basic abilities of response, anticipation, monitoring, and learning. The mix of the four cornerstones depends on the context of the analysis. Some situations may require more response capabilities while another may require more anticipation. However, an organization is not considered resilient, in the RE outlook, if all of the abilities are not present in some form. Resilience Engineering thus emphasizes function over structure and ability over capacity. The four abilities are (Hollnagel et al. 2011) are explained below. 1.Knowing what to do, or being able to respond to regular and irregular variability, disturbances, and opportunities either by adjusting the way things are done or by activating ready-made responses. This is the capability to address the actual. 2.Knowing what to look for, or being able to monitor that which changes, or may change, so much in the near term that it will require a response. The monitoring must cover the system’s own performance as well as changes in the environment. This is the capability to address the critical. 3.Knowing what to expect, that is, or being able to anticipate developments, threats, and opportunities further into the future, such as potential disruptions or changing operating conditions. This is the ability to address the potential. 4.Knowing what has happened, or being able to learn the right lessons from the right experience - successes as well as failures. This is the capability to address the factual. The analysis is at the organizational level given that an individual can’t reasonably be expected to possess all four abilities on a sustainable basis. 65 The four abilities discussed above can be used to develop a profile of the resilience of an organization. The profile is created by asking “probing “questions about each of the individual resilience abilities then plot these on a star-shaped grid. This is termed the “Resilience Analysis Grid,” Or RAG. The profile can be used to assess and also to bolster the resilience of an organization, and thus to better understand how disruptions affect safety and resilience. 2.5.3.1 Learning Individual and organizational learning from experience is the cornerstone for dealing with the factual, or knowing what has happened. Learning should be a continuous activity and involve all levels of the firm, from the CEO to field labor. Because fatalities and injuries are relatively rare events the opportunity to learn from them is limited. The resilient firm will take opportunities to learn from near-misses which occur more frequently than fatalities (Groenewig 1998). This outlook assumes that an atmosphere of true collaboration exists and that associates are able to voice legitimate concerns without retribution from others. Hollnagel et al. (2011) thinks that resilience is about “how systems learn to modulate their adaptive capacities to continuously update their fitness relative to an environment of changing pressures and opportunities.” In traditional safety thinking learning has occurred from that which has gone wrong. This is reasonable given that knowledge of that which has gone wrong in the past is essential for to prepare for and eliminate the things that contributed to the untoward event so that it does not occur again. In the contrarian approach to safety, the Resilience Engineer asks “Is this the best way to learn?” Accidents are relatively infrequent events. Additionally, accidents are usually different from one another and the information learned may not be useful for a one-time event (Hollangel 2010). 66 The Resilience Engineering perspective is not to discard the learning that occur post event, but to also learn from what goes right in the normal course of work. Studying that which goes right provides more opportunities for learning given that things go right more often than they go wrong. A basic principle of Resilience Engineering is that failures are the flip side of success and that they both have their origin in performance variability (Hollnagel et al 2006). Learning and responding are related. Response relies on the ability to learn. If the environment remained static and processes stayed constant – in other words if the world remained stable, there would be no need to continue to learn other than one cycle of pre-defined responses. However, everything is ever-changing and dynamic and responses must be continually updated. Learning occurs when observation and evaluation of the efficiency of the responses is vetted from time to time. Learning and monitoring are related. Learning helps the observer discover and evaluate the indicators for monitoring. The choice of indicators to use and which to discard is vetted in learning for efficiency. Learning and anticipation are related. Learning helps the organization develop a new model of the future and what adaptations may be needed (Hollnagel et al. 2011). Knowing what to learn is not an easy task and is an imprecise art. Pitfalls of the traditional safety thought – the “negative” view” color investigations and assessments. Woods and Cook (2002) recommend looking for alternative explanations. It is also influenced by prevailing safety models and making situation conform that narrow view (Hollangel et al. 2011). Hollnagel notes that it is important to look beneath the surface and gather evidence about how a system functions (borne form the “normal” view as championed by Perrow) as well as direct causes, even though it may be a protracted and lengthy process. Incidents are a good source of learning material if the culture supports this approach. In general, for incidents to be effective for learning a fair and 67 just culture (e.g., a reporting culture) is helpful. Culture and the different occupations are also important to examine as the perceptions of what is safe and what is risky varies among organizations and occupations of individuals. The probing questions for learning are presented below in Table 2.1. 68 Analysis item (ability to learn) Selection criteria Learning basis Data collection Classification Frequency Resources Delay Learning target Implementation Verification/ maintenance Is there a clear principle for which events are investigated and which are not (severity, value, etc.)? Is the selection made systematically or haphazardly? Does the selection depend on the conditions (time, resources)? Does the organisation try to learn from what is common (successes, things that go right) as well as from what is rare (failures, things that go wrong)? Is there any formal training or organisational support for data collection, analysis and learning? How are the events described? How are data collected and categorised? Does the categorisation depend on investigation outcomes? Is learning a continuous or discrete (event-driven) activity? Are adequate resources allocated to investigation/analysis and to dissemination of results and learning? Is the allocation stable or is it made on ad hoc basis? What is the delay between the reporting the event, analysis, and learning? How fast are the outcomes communicated inside and outside of the organisation? On which level does the learning take effect (individual, collective, organisational)? Is there someone responsible for compiling the experiences and making them 'learnable'? How are 'lessons learned' implemented? Through regulations, procedures, norms, training, instructions, redesign, reorganisation, etc.? Are there means in place to verify or confirm that the intended learning has taken place? Are there means in place to maintain what has been learned? Table 2.1: Probing Questions About the Ability to Learn 2.5.3.2 Monitoring The critical coping mechanism refers to the importance of monitoring and assessing the system so that surprises do not catch stakeholders off guard in the near term. A construction company may perform an honest assessment and realize that they are operating dangerously close to a safety breakdown and should relax production goals or garner other resources to alleviate the 69 problem. Surprises are abundant in the construction industry; the trick is to not let them catch you off guard. Organizations need metrics to monitor safety. In traditional safety management the indicators of the safety performance have been lagging measurements. For instance, fatalities, days from work (DFW), near misses, to name but a few are recorded and used as indicators of how safe or unsafe an organization is or is not. These measurements are used because they are objective, relatively easy to gather and to quantify (Hollnagel et al. 2011). The reader is directed to Hollnagel et al.(2011) for a description of why this method of accounting is not useful in managing safety. In short, these type of measurements are outcomes – or a product of a seemingly unsafe condition. In the Resilience Engineering perspective we are interested in the process of work and understanding what workers do to “make ends meet” in the normal course of operations. Thus, the focus of RE efforts is on understanding the process (not product) of normal work and amplifying what “works” and dampening what does not. Wreathall (2009) states that while indicators are crucial but underdeveloped at the present time. To reach the goal of being “proactive” data must be gathered from intermediate activities as well as the output to develop indicators. This will allow adaptation that may influence an untoward outcome. Indicators should also be developed for changes in the environment that may impact the system. Examples include financial crunches or material shortages. Faint signals as indicators should not be ignored in a resilient systems. Faint signals are hints of coming trouble in a system that, after the fact, may be recognized as early warnings. Wreathall (2009) proposes a guide to indicator selection. Preferred indictors (in order of preference) are summarized below: • Objective: They are based on observable and non-manipulative sources. 70 • Quantitative: They are measurable and can identify when changes in performance occur. • Available: They can be obtained from existing data. • Simple to understand/represent worthy goals/possess face validity. • Related to/compatible with other programs. Indicators are called leading if they can provide information that forestalls an unwanted outcome such as an accident or financial damage/ruin. Indicators are called lagging if the measure is a reflection of the output and nothing can be done to change the outcome, that is, it is a past performance measure. Wreathall (2009) warns that the levels and time-scales of the system should be considered when labeling an indicator as either leading or lagging. For instance, the lagging indicator of short-term staffing turnover may indicate fatigue in certain situations but may also be a leading indicator for systematic change. The probing questions for monitoring are presented in Table 2.2. 71 Analysis item (ability to monitor) Indicator list Relevance Indicator type Validity Delay Measurement type Measurement frequency Analysis / interpretation Stability Organisational support How have the indicators been defined? (By analysis, by tradition, by industry consensus, by the regulator, by international standards, etc.) When was the list created? How often is it revised? On which basis is it revised? Is someone responsible for maintaining the list? How appropriate is the mixture of 'leading', 'current' and 'lagging indicators'? Do indicators refer to single or aggregated measurements? For 'leading' indicators, how is their validity established? Are they based on an articulated process model? For 'lagging' indicators, what is the duration of the lag? How appropriate are the measurements? Are they qualitative or quantitative? (if quantitative, is a reasonable kind of scaling used?) Are the measurements reliable? How often are the measurements made? (Continuously, regularly, now and then?) What is the delay between measurement and analysis/interpretation? How many of the measurements are directly meaningful and how many require analysis of some kind? How are the results communicated and used? Are the effects that are measure transient or permanent? How is this determined? Is there a regular inspection scheme or schedule? Is it properly resourced? Table 2.2: Probing Questions about the Ability to Monitor 2.5.3.3 Anticipation Irregular events, or the understood but challenging one-off event(s) that are unexpected but not impossible, are the impetus for dealing with the potential to anticipate disruptions, pressures, and their consequences. Woods et al. (2010) frames anticipation in terms of adaptive capacity and warns that this must always be on the minds of managers. Failing to do so could result in a vulnerable system that 72 may face sudden collapse. Adaptive capacity may include buffers or reserves of the system. Woods et al. (2010) identifies six patterns that anticipate that the adaptive capacity of a system is falling in terms of dwindling buffers or reserves indicating that a shift in operations is in order to avoid failure. They are summarized below. First, “…resilient systems are able to recognize that adaptive capacity is falling or inadequate to the contingencies and squeezes or bottlenecks ahead.” An example in natural systems is the pattern of how quickly or slowly the system recovers form disruptions – progressively slower recoveries may mean that the system is near a tipping point. The analyst should always be aware of what kind of disruptions the system can handle and if the disruption influence the interdependencies among functions. This type of vigilance may also indicate the system has untapped reserves of resilience that may surface if failure is avoided. The second pattern is that “Resilient systems are able to recognize the threat of exhausting buffers or reserves.” Basically, this means that managers should avoid employing all resources if possible when meeting a challenging event or other disruption. For instance, urban firefighters avoid “all hands” calls to be able to adapt to rapidly changing field conditions. Along the same lines, hospitals keep beds open in emergency wards in case of a bed “crunch.” The third pattern identified is “Resilient systems are able to recognize when to shift priorities across goal-trade-offs.” An example of a goal trade-off is the ETTO mentioned previously. In this pattern it is important for the analyst to identify where the system is positioned in the goaltrade-off space, is the position appropriate for the context, and can the system migrate to a more favorable location in the continuum? A resilient system will know when to ease, or sacrifice, production goals when safety is endangered. 73 The fourth pattern is “Resilient systems are able to make perspective shifts and contrast diverse perspectives that go beyond their nominal system position.” This entails studying, from a system’s view, how functional interdependencies, as perhaps identified in FRAM fashion, may impact the systems. Additionally, cross-scale interactions in the system (Hollnagel et al. 2006) that consist of how blunt-end decisions, policy’s, resource allocation, and so forth affect sharpend behavior. Woods calls this “downward resilience.” Sharp-end behavior can affect learning and impact anticipation analysis and needs to be communicated to the strategy level. Correspondingly, this is termed “upward resilience.” The fifth pattern is “Resilient systems are able to navigate interdependencies across roles, activities, levels.” Woods et al. (2010) warns that “Without the ability to carry out this form of anticipation, systems are at risk of the adaptive breakdown pattern of working at cross-purposes or being locally adaptive but globally maladaptive.” Finally, the sixth pattern is “Resilient systems are able to recognize the need to learn new ways to adapt.” This speaks to the relationship between anticipation and learning. As Woods et al. (2010) points out, it would be difficult to anticipate intelligently without reflecting on how the system works and what has previously gone right and wrong. The probing questions for anticipating are presented in Table 2.3. 74 Analysis item (ability to anticipate) Expertise Frequency Communication Assumptions about the future (model of future) Time horizon Acceptability of risks Aetiology Culture Is there expertise available to look into the future? Is it in-house or outsourced? How often are future threats and opportunities assessed? Are assessments (and re-assessments) regular or irregular? How well are the expectations about future events communicated or shared within the organisation? Does the organisation have a recognisable 'model of the future'? Is this model clearly formulated? Are the models or assumptions about the future explicit or implicit? Is the model articulated or a 'folk' model (e.g., general common sense)? How far does the organisation look ahead? Is there a common time horizon for different parts of the organisation (e.g. for business and safety)? Does the time horizon match the nature of the core business process? Is there an explicit recognition of risks as acceptable and unacceptable? Is the basis for this distinction clearly expressed? What is the assumed nature of future threats? (What are they and how do they develop?) What is the assumed nature of future opportunities? (What are they and how do they develop?) To which extent is risk awareness part of the organizational culture? Table 2.3: Probing Questions about the Ability to Anticipate 2.5.3.4 Responding Responding is concerned with how the system behaves in “real-time.” This is the actual capability of the system to deal with the demands of the current disrupting situation. “At the ‘sharp end’ of the system, ‘responding to the situation includes assessing the situation, knowing 75 what to respond to, finding or deciding what to do, and when to do it.” Stakeholders at the ‘blunt end’ contribute by ensuring resources are available (Paries 2011). Paries (2010) posits that there are two strategies associated with readiness to respond, proactively or reactively. The proactive strategy involves anticipation of potential disruptions and predefined responses. The reactive approach “is to generate, create, invent, or derive ad hoc solutions.” He promotes a holistic approach that seeks to “establish (now) and maintain (tomorrow) a readiness to respond (at any time in the future). The response and anticipation cornerstones are closely related. Paries (2010) points out that the Woods idea of cross-scale interactions is an important piece of the response puzzle. At the global level analyst “may anticipate occurrences that are too rare to be even thought of at local scales, while local operators will anticipate situations that are much too detailed to be tackled at a larger scale.” Hollnagel also recognizes that it is important to recognize when situations are encountered in ‘real-time’ that fall outside of the range of anticipated variations. The point being that the systems will not adapt properly. To adapt properly a ‘real-time’ resilient system then has to monitor its boundaries. The sequencing involves monitoring the current degree of control of the system and then anticipating the amount of control needed in the immediate future. Woods et al. (2010) summarizes this by stating “To be resilient, a system always keeps an eye on whether its adaptive capacity, as it is currently configured and performs, is adequate to meet the demands it will or could encounter in the future.” Paries (2010) notes that it is impossible to anticipate everything and that a resilient system “must be both prepared and be prepared to be unprepared.” Paries feels that something may be lost in the attempt to anticipate every possible or probable event given that no two accidents are 76 alike. Some feel that it may be more beneficial to develop certain competencies to deal with situations encountered rather than domain-specific skills. The probing questions for responding are presented in Table 2.4. Analysis item (ability to respond) Event list Background Relevance Threshold Response list Speed Duration Resources Stop rule Verification Is there a list of events for which the system has prepared responses? Do the events on the list make sense and is the list complete? Is there a clear basis for selecting the events? Is the list based on tradition, regulatory requirements, design basis, experience, expertise, risk assessment, industry standard, etc.? Is the list kept up-to-date? Are there rules/guidelines for when it should be revised (e.g. regularly or when necessary?) On which basis is it revised (e.g. event statistics, accidents)? Are there clear criteria for activating a response? Do the criteria refer to a threshold value or a rate of change? Are the criteria absolute or do they depend on internal/external factors? Is there a trade-off between safety and productivity? How is it determined that the responses are adequate for the situations they refer to? (Empirically, or based on analyses or models?) Is it clear how the responses have been chosen? How soon can an effective response begin? How fast can full response capability be established? For how long can an effective response be sustained? How quickly can resources be replenished? What is the 'refractory' period? Are there adequate resources available to respond (people, materials, competence, expertise, time, etc.)? How many are kept exclusively for the prepared responses? Is there a clear criterion for returning to a 'normal' state? Is the readiness to respond maintained? How and when is the readiness to respond verified? Table 2.4: Probing Questions About the Ability to Respond 77 2.6 Managing Performance Variability –“Making Ends Meet” The Resilience Engineering outlook “…stresses the role of performance variability to ensure the normal functioning of socio-technical systems” and that “To improve system safety it is necessary to understand and to manage performance variability” (Macchi and Hollnagel 2011). Hollnagel (2004) argues that the performance variability is not from the complexity or demands but rather from the adaptations required by humans and organizations (social entities) to control the complexity and meet the demands on the systems. In other words, to “make ends meet.” Performance variability is not found in machines and technology (Hollnagel 2009). Looking at Woods et al (2007) remark that ““Resilience represents the ability of a system to adapt or absorb disturbances, disruptions, and changes and especially those that fall outside the textbook operational envelope” it is clear that some parts of a socio-technical system are equipped to handle some disturbances and perturbations. However, many systems and sub-systems are underspecified and it is impossible to identify all potential working scenarios or working conditions. Additionally, variability will also ensue from unpredictable activities such as inputs, resources, and late or incomplete instructions (Hollnagel 2004, 2009; Macchi and Hollnagel 2010). In discussing performance variability, Hollnagel (2004, 2009) draws upon the work of Perrow who argued in “Normal Accidents” that complexity cannot be significantly reduced; the alternative is to try to manage the variability of the system. From the Resilience Engineering perspective accident prevention is about managing performance variability; “Managing something requires being able to observe or detect it, being able to determine when it is getting out of hand, and being able effectively to introduce countermeasures or mitigating actions” (Hollnagel 2004). 78 In the Resilience Engineering view performance variability has two sides. In the traditional (or engineering) sense safety was ensured by “designing and enforcing barriers to reduce the number of human errors and to mitigate their consequences, or in other terms to reduce discretion and variability” (Re and Macchi 2010). Here the discretion of the worker or crew to adapt and react to demands is constrained. However, Re and Macchi (2010) point out that Hollnagel recognized that, in the norm, discretion and variability are also the sources of success and continued functioning. Hollnagel et al (2008) state that “…failures represent the flip side of the adaptations necessary to cope with the real world complexity rather than a failure of normal system functions. Success depends on the ability of organisations, groups and individuals to anticipate risks and critical situations, to recognise them in time, and to take appropriate action; failure is due to the temporary or permanent absence of that ability.” Re and Macchi (2010) argue that recognizing the duality of performance variability by Hollnagel is a “transition point from an approach where humans are considered the weak and unreliable components of a socio-technical system to an approach where humans’ contribution to the functioning and to the safety of a system is mainly positive.” Hollnagel (2004) notes that the spectrum of performance variability differs given the system under consideration and the goals of the analyst. In, general, a system is variable if it changes over time. Hence, the rate of change of the system is important. Systems can consist of several subsystems, as evidenced by construction work where work is distributed to different trades and specialists, therefore, variability can take place on several time scales simultaneously. Hollnagel breaks down the variability in the subsystems as typical moment-to-moment, working environment, and organization variability. Typical moment-to-moment variability (i.e., short term fluctuations of resources, demands, and working conditions) takes place on the time scale of 79 a second or minute level. The working environment provides the next level of variability. This is concerned with the demands placed on those in areas such as the military, open-heart surgery, a construction site. This variability may be slower than moment-to-moment variability but can occur rapidly. The final category of performance variability is in the organization and is metaphorically described as the “slow drifting to new norms and emerging, tacit standards for performance” (Hollnagel 2004). Macchi and Hollnagel (2010) and Hollnagel (2009) describes the reasons for performance variability, they are presented below as summarized from Macchi. The reader is directed to Macchi’s dissertation for a full discussion of the reasons. 1. Physiological and/or fundamental Psychological factors. This class of factors have an influence on perception and vigilance. 2. Higher level Psychological factors. Ingenuity, creativity, adaptability, etc. and their effect on human performance have been investigated by Human Resources management and selection studies. To improve safety it is necessary to choose the right people. 3. Contextual factors. Extensive lists of contextual factors (such as Hollnagel’s Common Performance Factors (CPCs)) have been compiled to account for the detrimental effect context may have on human reliability. 4. Social factors. Meeting personal or social expectations, and complying with informal work standards, are examples of how organizational culture influences human performance at work. 5. Systemic factors. The need to stretch resources in order to meet performance demands or the need to substitute goals when dealing with unpredictable events is reasons why performance variability is influenced by systemic factors. 80 The Efficiency – Thoroughness Trade Off principle constitutes a useful framework to understand performance variability induced by systemic factors (Hollnagel 2004, 2009, Macchi and Hollnagel 2010), and the focus is on this systematic factors as discussed in the next section. 2.6.1 Explanation of Performance Variability – The Efficiency Thoroughness Tradeoff (ETTO) Hollnagel (2009) postulates that the principle of the efficiency-thoroughness trade-off (ETTO), is a way to describe human and organizational performance variability. The ETTO principle brings to light the fundamental human condition that, because resources are limited, people and organizations act in ways that favor efficiency. The resource of “time,” because it underlies all that is done, is especially stressed in ETTO conversations and considerations. Conversely, thoroughness is required because accident events impact efficiency and perturb the system. The ETTO occurs because of certain rules (discussed later) that concern individual, organizations and social behavior in a work context. The ETTO can be used to reasonably explain how work is done and success is achieved and failure sometimes occurs as a failure to adapt to the demands of a system. At its core it represents the heuristics (or rules of thumb) that people use to make decisions needed to complete their work. The trade-off or choice between being thorough or efficient is a decision that the frontline worker is tacitly asked to make every day in order to complete work given that it is impossible to maximize both simultaneously. The ETTO principles aim to find the balance in context of the task at hand. Macchi and Hollnagel (2011) posit that this balance is found by taking into account a subjective evaluation of available resources and time, individual personality traits, social habits, practices, safety culture etc., social and organizational pressure, and the tendency to save time and resources in case of unexpected events. Hollangel (2009) states “For a recurrent work situation most people will naturally 81 choose the more efficient mode of operation as long as it, in their experience, is just as safe as the alternative.” Woods and Hollnagel (2006) cite research into production/safety trade-offs in laparoscopic surgery that has found that the decision to value production (i.e. efficiency) over safety (thoroughness) is implicit and unrecognized. Macchi and Hollnagel (2011) states that “From the ETTO perspective the understanding of human performance requires the acknowledgement that humans take sacrificing decisions, use mental models and apply heuristics.” These take into account the time pressures involved in most situations along with the scarcity of other resources, such as information. This point of view is summarized below and the reader is directed Macchi’s and Hollnagel and Hollnagel (2011, 2009) texts for a full discussion. Sacrificing decisions are a revision of Herbert Simon (1997) hypothesis of satisficing. Satisficing is a portmanteau that combines “satisfy” with “suffice” and is “understood as the attempt to achieve a minimum level of a particular variable when making a decision” (Macchi and Hollnagel 2011). With regards to the ETTO principle and in a Resilience Engineering perspective, “The sacrificing decision maker is unable to maximise the benefit due to the complexity or intractability of the working environment” and “The intractable nature of complex socio-technical not have a complete understanding of the situation, the potential consequences of their actions and they have not cognitively explored all the available alternatives” (Macchi and Hollnagel 2011). Mental models are used by individuals to make sense and simplify their interactions with the world. Macchi and Hollnagel (2011) cites Johnson-Laird’s theory of mental models and “how a person holds a mental working model of the phenomenon he/she interacts with. To encompass the scope required to support the human understanding of a situation, mental models must be 82 simpler than the real-world phenomenon they represent. In this way a person can base his/her understanding on a check of salient characteristics rather than checking every detail. Mental models therefore provide an effective way to cope with the complexity of the world based on knowledge and experience of already encountered situations. Problems arise if a situation is misjudged and a response plan is implemented for a situation which is not as it was thought. An important contribution of the mental models theory is the acknowledgement that people’s reasoning and behaviour is primarily influenced by the content-relatedness and form of the information presented rather than a logic reasoning.” Heuristics, or “rules of thumb” are used by people to reduce the complexity of the world around them. In the ETTO jargon, they save time and are sufficiently thorough by relying on past events and reasoning. Heuristics are used to quickly recognize similar situation and to judge uncertainty. The heuristics of similarity matching and frequency gambling are used in the former. The heuristics of representativeness, availability, and anchoring and adjustment are used for the latter. By promoting the ETTO principle, Hollnagel intends to shift the fundamental outlook of risk assessment. In particular, instead of researchers and investigators trying to determine how a component or subsystem may fail or an unwanted outcome may occur; they should be asking “How and when the variability of normal performance, i.e., the adjustments that people must make to accomplish their work, can lead to adverse outcomes” (Hollnagel 2009). Also, instead of asking how human error might occur the question should be how likely is it that a person or an organization will make an ETTO? As established previously, ETTOs will always occur. The concern should be with how humans in different parts of the subsystems are practicing ETTOs and how they may combine to cause unintended and unwanted outcomes (i.e., an accident). 83 2.7 Functional Resonance Analysis Method (FRAM) FRAM is a RE based safety assessment and accident analysis method that builds on and complements traditional risk analysis methods by providing new insights and a deeper understanding of the actual functioning of the system (Herrera et al 2010). The premise behind adopting the FRAM is that the current methods, models of safety risk assessment and accident analysis are insufficient to gain a further understanding of complex socio-technical systems. The FRAM assumes that some accidents result from unexpected combinations of normal performance variability and that accidents are prevented by monitoring and damping this variability (Hollnagel 2004, 2009, Macchi 2011). The FRAM provides a way for the analyst to identify and visualize the dynamic interactions within a socio-technical systems approach and to gain a better understanding on non-linear dependencies, performance conditions and variability, and their resonance across important functions or activities. The FRAM and Resilience Engineering applications are utilized in complex socio-technical modeling and have the common assumptions that systems are always underspecified and local adjustments are needed by those at the workface to “make ends meet” (Macchi 2011). The FRAM is based on four principles (Hollangel 2009,Macchi 2011, Herrera et al 2010): 1. The principle of equivalences of successes and failures : Hollnagel (2009) quotes the philosopher Ernest Mach, who stated in 1905 that “Knowledge and error flow from the same mental sources, only success can tell one from the other.” 2. The principle of approximate adjustments: In the discussion above regarding performance variability it discussed that underspecification always occurs and are therefore unpredictable. Procedure and resources must be adapted to the situation. From the perspective of Resilience Engineering performance variability is both normal and necessary. 84 3. The principle of emergence : “The variability of normal performance is rarely large enough to be the cause of an accident in itself or even to constitute a malfunction. But the variability from multiple functions may combine in unexpected ways, leading to consequences that are disproportionally large, hence produce a non-linear effect. Both failures and normal performance are emergent rather than resultant phenomena, because neither can be attributed to or explained only by referring to the (mal) functions of specific components or parts” (Hollnagel 2010) 4. The principle of functional resonance: “FRAM replaces the traditional cause and effect relation by the principle of resonance. This means that the variability of a number of functions every now and then may resonate, i.e., reinforce each other and thereby cause the variability of one function to exceed normal limits. The outcome may, of course, be advantageous as well as detrimental, although the study of safety has naturally focused on the latter.) The consequences may spread through tight couplings rather than via identifiable and enumerable cause-effect links” …”The resonance analogy emphasizes that this is a dynamic phenomenon, hence not attributable to a simple combination of causal links. This principle makes it possible to capture the real dynamics of the system’s functioning (Woltjer & Hollnagel 2007), hence to identify emergent system properties that cannot be understood if the system is decomposed in isolated components” (Macchi 2010). The FRAM model describes a system’s functions and the potential couplings among functions. The model does not describe or depict an actual sequence of events (i.e., a scenario). A scenario can be described by an instantiation of the model. The instantiation is a “map” of how functions are coupled under given – favorable or unfavorable - conditions. The approaches differ slightly for risk assessment and accident analysis. For risk assessment, the steps consist of the following, which is greatly condensed (Macchi 2011, Hollnagel 2004, 2008, 2012): 85 1. Clarify the purpose of modeling and describe the situation being analyzed. In the prospective use of the FRAM, the purpose is to develop an overall understanding of the couplings and dependencies among the (foreground and background) functions of the system. 2. Identify the essential functions that are necessary (and sufficient) for the intended performance to occur (when 'things go right'). Characterize using the six basic aspects (Input, Output, Preconditions, Resources, Time, and Control). Taken together, the functions are sufficient to describe what should happen (i.e., the everyday or successful performance of a task or an activity). The foreground basically means that which is occurring at the workface, i.e. “normal work.” “A function is an activity of the socio-technical system towards a specific object” “The principle that guides the identification of functions is the need to achieve a description of the normal activities performed by the socio-technical system being analysed” (Macchi and Hollnagel 2010). The function is represented by a hexagon that is sometimes called a “snowflake” and is illustrated in Figure 2.5. The function is described by six aspects, time available, input, preconditions, resources, output, control and are described in the Figure. 86 Figure 2.5: Function/Activity representation and aspects (Hollnagel 2008) 3. Characterize the variability, first as the potential of the functions described by the model, and then as the (possible) actual variability for a set of instantiations of the model. Consider whether the actual variability will be what one should expect ('normal') or whether it will be unusually large ('abnormal'). 4. Identify the dynamic couplings (functional resonance) that likely will play a role during an event. These comprise an instantiation of the model which can be used to predict how an event will develop and whether control can be lost. “The aim of this step is to determine the possible ways in which the variability from one function could spread in the system and how it may combine with the variability of other functions. 87 5. Propose ways to monitor and dampen performance variability (indicators, barriers, design / modification, etc.) In the case of unexpected positive outcome, one should look for ways to amplify, in a controlled manner, the variability rather than for ways to dampen it. It is worthwhile to consider how the FRAM differs from how the production and safety combination is traditionally handled on construction projects. The authors experience in the areas of residential, commercial, military, and municipal construction is that, in general, the two are discussed separately, if at all. The venue for discussing production and safety includes preconstruction or preparatory meetings (in the case of the USACOE QC/QA system) meetings where production means, methods and materials are discussed in terms of the CPM schedule and how staffing and resources will be utilized to meet the production goals of the general contractor and ultimately the owner. The specifications are also reviewed as applicable to the trade. Safety is then discussed primarily in terms of compliance with regulations. The group produces a document such as an “activity hazard analysis” (AHA) that has a singular focus on that particular trade and usually pays little attention to the work of other trades. Safety plans are often generic and not site-specific. Additionally, safety plans are often non-existent or outdated and sitespecific plans are not drafted unless specified in the contract. Then, even with such specification, they may not be completed unless requested by the GC or owner’s representative. The attendees of these meetings typically consist of the owner’s representative (the QA personnel), the general contractor’s QC representative, the general superintendent, and the trade foreman and project manager. Occasionally, a dedicated safety officer from the trade or GC will attend. The discussion is heavily tilted toward “work-as-imagined” and not “work-as-done” as the meeting participants are managers and not those who actually perform the work, with the 88 exception of the field superintendent or foreman. Additionally, only the trade under scrutiny, for example the steel erector, is included in the conversation. In the traditional approach to safety there are a few important underlying assumptions. One is that accidents will occur in the same way that they occurred before, hence, the emphasis on regulatory compliance (e.g. OSHA). Safety regulations do help individuals and firms avoid repeating scenarios that have led to past accidents. However, they are very focused in nature (e.g. extend the ladder three feet above the roof line and secure it to the structure) and do not account for the occurrence of interactive unwanted events that may lead to an accident. For instance, the concurrent events of a ladder falling and a trench cave-in on the project may be a possibility. Additionally, accidents rarely happen in the same manner and in many instances the conditions surrounding the event are quite different. Another underlying assumption is that by removing the component (or potential hazard) that may fail, for example the “tied-off” ladder, safety is ensured. This harkens back to the linear thinking associated with Heinrich’s Domino Model whereby if the offending component (domino) is removed the system will be safe. The overwhelming emphasis of the safety portion of the meeting is on what can go wrong and not on what goes right and how we can amplify and learn from the adjustments that workers make in the course of normal work to avoid accidents. The FRAM offers a structured format in which to improve upon and add value to the traditional way of discussing the production and safety interaction of construction activities. First, it offers the opportunity for a richer discussion of how work will be performed in a safe manner. This necessarily entails inviting those associates that will actually be performing the work to select and instantiate the functions and activities. For example, in steel erection the crane operator, riggers, connectors, and foreman would be involved in the production and safety 89 conversation in addition to the traditional attendees. They would discuss the means and methods for production in terms of the six aspects of input, output, preconditions, resources, control, and time. The emphasis is on what should happen in the course of everyday ‘normal’ case and then how performance variability may affect that either positively or negatively. Unwanted events are discussed in terms of how they can be eliminated or dampened. This discussion emphasizes how the team may act proactively to affect safety by changing a production aspect, for example allowing more time to rig a structural steel member, and to realize how the bounded system interacts and may resonate out of control. Discussing regulatory requirements and contract specifications is still an important part of the conversation and included in the FRAM because it represents prior learning of how a system might fail. The author knows of no U.S. construction firm currently using the formalized or deliberate FRAM approach to risk assessment. No mention of FRAM in construction is made in either the popular or scholarly literature. This new idea has just recently (in 2012) been formalized in a dedicated text by Erik Hollnagel, the originator of the FRAM. Additionally, it has been presented at Lean Construction, American Society of Civil Engineers, and Construction Safety Council conferences only by the author. 2.8 The Resilience Analysis Grid (RAG) To assess resilience Hollnagel et al. (2011) proposes that four sets of “probing questions” be connected to the resilience capabilities can be used to develop a Resilience Analysis Grid (RAG). The answers to these questions can be used to construct a resilience profile by aggregating the ratings for each basic capability and coming up with a single rating for each capability. The probing questions presented in the tables above are domain independent and thus should not be used without confirming their relevance. Hence, the questions used could 90 either be a subset of these, reformulated ones, or completely new ones. As mentioned previously, Hollnagel et al. (2011) states that the relative weight or importance of the four capabilities may differ between domains, in other words, the mix of responding, learning, anticipating, and monitoring may vary from situation and/or organization. The procedure for filling out a RAG consists of the following steps: 1. Define and describe the system for which the RAG is to be constructed 2. Select a subset of relevant questions for each of the four capabilities 3. Rate the selected questions for each capability on a Likert type scale 4. Combine the ratings to a score for each capability, and for the four Capabilities combined 2.9 Other Understandings of Resilience Engineering Given that RE is an emerging field, researchers have concocted different understandings and/or explanations to describe the concept. Two prominent approaches are described below, that of Woods and Wreathall (2008) and Jackson (2009), and Madni and Jackson (2009). 2.9.1 Stress-Strain Analogy for Resilience Engineering Woods and Wreathall (2008) borrow the concept of stress-strain plots from material science to characterize and assess the resilience of a system. Figure 2.6 illustrates a typical stress-strain plot. Varying demands placed on a project (e.g. production and labor demands as described above) stand-in for stress (normally on the y-axis) on the plot. Strain is analogized to describe how the system adapts (or stretches), using available capacity (e.g. working overtime, renting additional excavation equipment) to the stress applied. Strain is plotted on the x-axis. The defining characteristic (i.e. parameters and regions) of the typical stress-strain plot, also known 91 as the state space, distinguishes the organization (or unit of analysis) as resilient in terms of adaptability. The state space also acts as a harbinger for management so that they can calibrate the organizations true status with perceived status. Typically, managers overestimate the state of safety, in other words firms believe that they are acting in a safer manner than reality indicates. Figure 2.6: Stress-strain state-space (from Woods and Wreathall 2008) The uniform portion of the curve (the elastic region) corresponds to times and situations where the organization handles demands easily, stretching to accommodate them. Here the risks are anticipated by building in capacity to avert extraordinary failure or disruption. In other words, the company has adequately foreseen disruptions that may impact failures. Plans, procedures, and flexibility in operations are the bellwether of the uniform region. In general, the demands are well-known and accounted for, making stretching easy in this region. The yield height (the inflection point of the curve where elasticity ends and plasticity begins) of the 92 uniform response curve captures the first-order adaptive capacity of the firm. The linear region is also called the on-plan performance area. The yield height can be adjusted by adding capacity in the uniform region or by changing the range of demands the curve can accommodate. One example could be additional training such as high-rise rescue training, drills, and simulation on a multi-story building. Thinking in terms of adaptation and capacity can help the construction manager foresee risks and adapt plans accordingly. Beyond the uniform region lies what Woods and Wreathall call the extra region (x-region for short). In material science this is known as the plastic region. Here the demands encountered become more difficult to accommodate and the firm’s reaction to them is non-linear. Demands are imposed upon the firms that go beyond what was anticipated in the on-plan performance area. In this region safety and production efficiencies may be compromised as ‘gaps’ appear in the organizations that exceed the first-adaptive capacity. Resources, or second-order adaptive capacity, must be garnered to avoid reaching the failure point. Commonly, experienced groups or individuals at the workface can recognize when they are operating in the plastic region and take actions to cope with increasing demands, in other words they begin to fill in the ‘gaps’ caused by lack of capacity and foresight. These actions are indicators that the firm is in the xregion. This is reflected in the upswing portion of the x-region that corresponds to extra capacity added to meet demands. This action is fraught with potential problems. Other bottlenecks and constraints may appear and the tempo of the project increases. For instance, adding additional crews will incur more supervision. If additional superintendents or other supervisors are not added as the tempo increases in the x-region then safety might suffer as other areas on-site are neglected. 93 If the demands imposed in the x-region begin to exceed second-order adaptive capacity then the curve begins to acquire a negative slope and heads toward the failure point. To avoid failure, the firm may decide to re-structure. The re-structuring occurs at a point in the x-region, prior to failure but at a location where there is time to rally the requisite resources to rescue the project. This could entail comprehensively re-planning the project. Resilient firms will anticipate the restructuring phase and adjust capacity accordingly. Firms that are not resilient may erode safety margins and endanger personnel. This highlights the importance of calibration, or knowing where a firm is situated with respect to the state space system. Mis-calibrated firms don’t realize that they are in the x-region or perhaps heading toward failure. 2.9.2 Madni and Jackson’s Resilience Engineering Framework Madni and Jackson (2009) build on early Resilience Engineering concepts to develop a conceptual framework for Resilience Engineering. They also draw upon some of their previous work in “architecting” system design. Jackson (2009) explains architecting in resilience terms thusly: “When we say that resilience can be architected, we mean that systems can be defined and the elements of the system can be arranged for which resilience will be an emergent property, that is to say, a property not possessed by the individual system elements.” The idea is that architecting “considers all aspects of defining the structure of a system” (Jackson 2009) as opposed to the perceived limiting scope of engineering. They view resilience as “a multi-faceted capability of a complex system that encompasses avoiding, absorbing, adapting to, and recovering from disruptions” (2009). Their disruption schema was discussed earlier in this work in the section on disruptions. The notions of avoidance, absorption, adaptation, and recovery are similar to Hollnagel’s four cornerstones (learning, responding, anticipating, and monitoring) of a resilient system although in a slightly 94 different manner in some respects and similar in others. Their conceptual framework for Resilience Engineering is based on four key “pillars” of disruptions, system attributes, methods, and metrics. The reader is referred to Madini and Jackson (2009) for details of their framework. 95 Chapter 3: Methods 3.1 Introduction Construction research has been defined by the Associated Schools of Construction (ASC) (Syal 1998) as “…any scholarly activity that expands the knowledge base in the field of construction. This may include: a) development of new knowledge, b) refinement of existing knowledge, and c) the transfer of knowledge from other fields to construction.” This research most closely resembles and focuses on the transfer of knowledge from other fields to construction given that RE is applied in various industries such as aviation, nuclear generation, and sea fishing. In a paper titled “Construction Research Agenda : Focus Area and Topics,” Syal (1998) categorizes construction research types as either “Based on the type of problem and the applicability of the solution” or “Based on the nature of the topic and the methodology.” The former category is further subdivided into basic, applied, or a combination of basic and applied research. He states that the latter category has the most relevance to the construction industry and can be delineated as survey-based, experimental, exploratory/developmental, and descriptive research. Syal does not dictate nor offer particular methods to be used within or across the categories with the possible exception of the survey-based research. Using Syal’s typology, this work cuts across three of the four subcategories of the latter category mentioned above. It is descriptive research, which is, according to Syal (1998), a “description of an unexplained or unknown problem or practice, including its causes, history, evolution, and possible solutions.” The description is found in the comprehensive literature review which outlines the history and evolution of the chronic safety problem in the United States from the industrial revolution to the present. The literature review also discusses the 96 unknown problem of disruptions in both general industry and construction. Finally, it introduces RE, as a possible solution to the managing of disruptions and increasing of safety on projects. The conceptual framework in this work can be thought of as exploratory and developmental. Syal (1998) describes this sub-category as “development of solutions for construction problems without experimentation but by utilizing existing expertise in other aspects of construction (highway to residential construction) or from other fields (manufacturing, business management, computer science, etc.).” RE borrows heavily from other fields as mentioned above. This work also explores the notion that RE is a way of understanding and explaining disruptions on construction projects that may lead to accidents. It develops a framework based on general RE principles applied to the construction project and the experience and 25-plus year expertise of the author. Finally, this work is experimental. It uses computer-based testing in the form of a hybrid simulation consisting of agent-based and discrete event modeling to test, gauge and verify a portion of the conceptual framework, namely the ETTO Principle. This part of the work is also developmental given that no tools, computer-based or real-world, exist to test RE principles. Given the unique nature of this topic and that it cuts across several different research categories, innovative combinations of methods to accomplish the research were used. Compounding the difficulty of choosing a method based on a safety-related topic is that it is fraught with moral and legal issues as described below. Before RE is implemented in actual field operations there needs to be much discussion among safety experts about how it would be implemented in the field. As a starting point, this work proposes a conceptual framework to guide that discussion. 97 3.1.1 Conceptual Frameworks When used as a noun, the word ‘concept’ refers to something, a thought or a notion, conceived in the mind. When used as an adjective it refers to the organization of an idea around a central theme or idea. The Conceptual Framework represented in this work is built around the central theme of RE as described by the RE scholarly community and how it can be used to better understand and manage disruptions that may contribute to accidents. The author conceptually translates these ideas to the construction context based on the literature and his experience and education in the construction industry. Thus this research can be considered empirical given that it is partially based on the observations and experience of the author. It provides a base upon which future researchers may accept or reject the notions of RE and how the author envisions it as applied to a construction project. Smythe (2004) defines a conceptual framework thusly: “A conceptual framework is described as a set of broad ideas and principles taken from relevant fields of enquiry and used to structure a subsequent presentation. When clearly articulated, a conceptual framework has potential usefulness as a tool to scaffold research and, therefore, to assist a researcher to make meaning of subsequent findings. Such a framework should be intended as a starting point for reflection about the research and its context. The framework is a research tool intended to assist a researcher to develop awareness and understanding of the situation under scrutiny and to communicate this. As with all investigation in the social world, the framework itself forms part of the agenda for negotiation to be scrutinised and tested, reviewed and reformed as a result of investigation.” Smythe (2004) feels that conceptual frameworks provide clear links from the literature to the research goals and questions, inform the research design, provide reference points for discussion of literature, methodology, and analysis of data, and contribute to the trustworthiness of the study. 98 How to go about creating conceptual frameworks is not well understood or documented in the literature. Many times the words ‘theory’ and ‘conceptual’ are used in conjunction with the word “framework” to describe the same thing or same approach to understanding how the world ‘works.’ However, Shields and Tajalli (2006) cite Abraham Kaplan and John Dewey to explain the difference between a theoretical framework and a conceptual one in empirical research. Shields and Tajalli posit that theory is used as a tool to structure inquiry. However, there is an intermediate step in forming theory that is often overlooked and this is the “behind the scenes” development of the procedures for forming concepts and hypotheses that should be exposed. In essence, the creation of the conceptual framework is an intermediate step to the formation of a theory to understand reality. Theory is ultimately used to organize the exploration of the problem under consideration. Shields and Tajalli (2006) state “A theory conforms to the facts and is a way of looking at the facts.” However, Kaplan notes that “conceptual frameworks are out in the open and are still conjectural or hypothetical. They are not truth; rather, a systematic way (still subject to reason) to organize inquiry.” Shields and Tajalli (2006) explain that Dewey (1938) likens conceptual frameworks to maps that help navigation through experience or the experiential world and represent and abstract from reality and that “When accurate, maps enable navigation within reality.” Shields and Tajalli (2006) summarize the connective functions of conceptual frameworks to theory, the purpose of the literature review, and the interplay of personal work experience by stating: “These (conceptual) frameworks help students connect forward into the problem and give direction on how to collect and analyze data. They also have a connective function backward to the literature and larger theoretical frameworks (i.e., neo-classical economics, organizational theory). Students are expected to justify their framework by connecting it to the scholarly literature (or an existing public affairs framework). 99 A literature review enables the student to get to know the topic, connect the larger literature to their work experience, and refine the research question or problem. The literature review may also reveal where previous inquiry has stopped. Conceptual frameworks are built upon the premise and practice of a careful, thoughtful, and reflective review of the literature. Students are thus expected to draw upon the wisdom and insights of the literature and their experiences to develop a plan or map to guide their inquiry. A good map helps one reach an unknown destination more quickly and with less anxiety.” Smythe (2004) echoes the notion that the researchers’ ‘life-world experience’ is a part of the development of the conceptual framework. However, Smythe cautions that the life-world experiences of the person developing the framework should not be attributed a power that it does not have and that the bounds of the researcher should be considered. 3.1.2 Method The Method is illustrated in Figure 3.1. It is guided by the Objectives and, in general, follows the scientific method using deductive reasoning by applying the general principles of RE to the specific problem of how to manage disruptions to construction processes that may compromise safety. Deductively, it links the premise that RE may be useful to explain disruptions that occur in project-based construction endeavors to the Conceptual Framework and then to a Conceptual Model for simulation of how disruptions may set the stage for accidents with the Efficiency Thoroughness Trade Off (ETTO) Principle as a backdrop. Objective one is met by a thorough literature review of constructions safety and production, given that this relationship is a major premise of RE. The background RE is then explored. Objective two is met by constructing a Framework of RE for project-based construction operations and is based on the Literature Review and the author’s industry experience. The conceptual simulation model is presented followed by a discussion of the results of the simulation and future work based on the Framework. 100 The Evolution of Industrial Safety Popular Accident Models 1) Sequential 2) Epidemiological a) Haddon’s Model 3) Systems a) Normal accidents b) HRO’s Objective 1: Literature Review a) Domino Model Current Understanding of Resilience Engineering Conceptual Construction Framework of Resilience Engineering Based on the Literature Review Objective 2 Conceptual Simulation Model Objective 3 Analysis and Discussion Figure 3.1: Method 101 3.2 Objective 1 Objective 1: Abstract the concept and underlying theories of RE and explore RE deployment in non-construction industries for use in formulating Objectives 2 and 3. RE is “…a field in the midst of defining itself and its relationship to other fields, and this includes identifying and defining the phenomena which researchers in the field intend to investigate” (Mendonca, 2008). Because scant research has been conducted in the field a critical first step is to establish the first principles of RE as they are currently understood. This task includes an investigation into the roots of the new field as well as the trajectory of safety and production thought, theory, and understanding. This objective was accomplished by a comprehensive literature review. The literature review provided a comprehensive review of RE. Because RE is concerned with safety and production, relevant areas of these disciplines were included in a general context. The main thrust of the literature review emphasizes understanding RE in the context of safety and production. The review fulfilled traditional literature review goals (i.e., gaps in the literature reviewed, consensus and debates revealed, future areas of study, etc.) and also comprehensively discussed the RE concepts and principles in an effort to synthesize and abstract these concepts and enable objective 2. The most important topic to address was the explanation of RE. Objective 1 provided the foundation for Objectives 2 and 3. 102 3.3 Objective 2 Objective 2: To present a RE conceptual framework for construction safety. Informed by the work completed in Objective 1 and the author’s 25 years’ experience, Objective 2 addressed the question “How can RE be conceptually applied to reduce disruptions on construction projects?” The underlying assumption is that fewer disruptions may lead to a safer jobsite. Developing a conceptual framework for RE is a challenging intellectual undertaking. When something is described as resilient that description entails an entity or object that is dynamic and protean. There are many examples of using a conceptual approach in the earlier stages of ideas, such as RE, the earlier stages to describe an emerging phenomena in the literature. The conceptual framework sought to answer the Research Questions of “What elements of RE may help a construction project avoid, survive, and recover from disruptions?” and “How can we apply RE, in a formalized way, to construction operations?” This is accomplished by filtering out the essential elements of RE in the literature and applying them to the construction industry. The development first approached the challenge of applying RE to the construction industry by offering strategies to the executive or management level that may implement RE. Then specific guidance was given, based on the Resilience Analysis Grid (RAG) on how the four abilities might be implemented in the field. The framework also attempted to capture the complexity of the construction safety problem. As argued previously, construction projects are “complex systems.” A complex system is one which has, within itself, a capacity to respond to its environment in more than one way, and to select among the options in some way (Miller and Page 2007). Simon (1997) defined a complex 103 system as one made up of a large number of individual parts that have many interactions. A key concept of complex is that systems emerge to something that is greater than the sum of its parts. The framework, as well as the conceptual model and simulation, may best be described as a “map” to incorporate RE in construction environments and as a way to understand and analyze perturbations and disruptions in production processes. The “map” analogy is appropriate as described by Shields and Tajalli above. The elements of the framework consisted of a systematic and methodical process of addressing the distinct temporal phases in the occupational accident timeline that occurs in a context of production systems. In designing the production operations, a process of anticipation of threats is deployed, as well as provisions for avoidance of these threats. In recognition that anticipation will not be exhaustive given the random nature with which factors combine to create occupational accidents, here RE focuses on what should be done during an accident to reduce its impact. The final phase is post-accident and the recovery steps needed to stay on track with respect to progress towards system objectives (e.g., production). In this dissertation RE was primarily placed in the production context and the protection of workers during this phase. The conceptual framework was needed to further knowledge in the area of the construction safety/production mix. New approaches are needed that still respect and utilize existing methods. RE may prove to provide such an approach. However, the field is currently evolving and needs translators with domain specific knowledge to guide practitioners. 3.4 Objective 3 Objective 3: To explore RE implementation in construction production settings using hybrid computational methods. 104 In an ideal world, the RE conceptual framework developed in Objective 2 would be tested on an active job site to determine if it is effective for improving construction safety. We live in an imperfect world and this course of action is not possible. Introducing an untried safety program to suspecting and unsuspecting workers is, at least, fraught with moral, legal, and economic hazards. Morally, one cannot knowingly put others in situations where they may be at risk. Legally, beginning a new safety plan without prior simulation or by other means of “dry runs” may be viewed as disregarding the principle of due diligence. Finally, economically, if workers were injured or worse from the new approach it could mean financial ruin for all parties involved. Evaluating this new paradigm to elevate safety becomes a “catch-22” for the construction researcher; new methods to alleviate the chronic problem of safety cannot be tried because they may present unacceptable hazards to the worker while worker safety continues to stagnate because new methods are difficult to vet. One way that is available to researches is conceptual computer modeling and simulation. Simulation is a product of the model “...that imitates a real or imaginary dynamic system” (Martinez 1996). Done correctly, this approach does not contain a moral hazard, detours any legal difficulties, and is economically attractive given that the cost resides in the software and programming. Epstein (2008) posits that the goal of building models should be to move beyond implicit mental models, where assumptions are hidden, and to create models that explicitly present assumptions so that others may recreate them and test the assumptions. Models also are useful in that they can act as focal points for others to discuss and understand the problem at hand. Epstein lists 16 ways, in addition to predictive value, that models are useful, these are to: 105 1. Explain 2. Guide data collection 3. Illuminate core dynamics 4. Suggest dynamical analogies 5. Discover new questions 6. Promote a scientific habit of mind 7. Bound (bracket) outcomes to plausible ranges 8. Illuminate core uncertainties. 9. Offer crisis options in near-real time 10. Demonstrate tradeoffs / suggest efficiencies 11. Challenge the robustness of prevailing theory through perturbations 12. Expose prevailing wisdom as incompatible with available data 13. Train practitioners 14. Discipline the policy dialogue 15. Educate the general public 16. Reveal the apparently simple (complex) to be complex (simple) No model will do all of these things. Of particular interest to this research is number 11: “Challenge the robustness of prevailing theory through perturbations.” An accident or incident is essentially a perturbation, or disruption, the magnitude of which corresponds to the ability of the system to absorb it. In essence, this is a measure of the resilience of the system. It is hypothesized that a resilient system will deal with perturbations in a handy way and recover quickly. The simulation created is subject to several disruptions. 106 Modeling has been described as an act of artful approximation (North and Macal 2007). Modelers cannot (and should not) produce an exact recreation of the events to be modeled. Only the salient points of an actual event or situation should be abstracted and detailed, other items should be approximated, otherwise the model may become muddled and ineffective. Various modeling techniques have been developed and refined over the years to determine the appropriate level of detail for models, to define the desired end state of the model so that progressive refinement may occur, and to develop criteria for determining the overall success and effectiveness of the modeling project. in general, models can be broadly classified as deterministic or stochastic (North and Macal 2007). Prominent approaches include discreteevent (DE) simulation, agent-based modeling (ABM), and blended (or hybrid) modeling. Each will be discussed briefly below as well as the hybrid software product Anylogic. 3.4.1 Discrete Event Modeling Discrete Event (DE) Modeling, which is sometimes called Process-Centric Modeling, is commonly used while simulating queuing, manufacturing and similar systems. Here the modeler models a process as a series of separate events, as opposed to the continuous view, as they unfold over time and at discrete points, or states in the process. “In discrete-event simulation, it is assumed that the state of the system changes instantaneously at specific times marked by events” (Martinez 1996). The occurrence of a discrete event triggers another event or chain of events as the simulation progresses though time. Discrete Event Modeling, which became popular in the 1960’s, is often used in conjunction with statistical techniques such as the Monte Carlo Method (North and Macal 2007) and other statistical techniques. Martinez (1996) states, “Most construction processes can be effectively modeled using discrete event simulation.” 107 3.4.2 Agent Based Modeling ABM is useful for describing systems that are open, complex, and with distributed control and resources. Economists, sociologists, anthropologists, political scientists, and others have applied ABM to specific and general problems (Epstein and Axtell 1997, Watkin et al, 2009). Most construction projects seem to fall under this umbrella. Agent based simulation reveals the global behavior of a system as structures and patterns emerging as a result of repetitive and competitive local interactions between agents and their environment (Axelrod & Tesfatsion, 2005). In general, ABM is considered a bottom-up modeling approach. Watkins et al (2009) describe ABM as: “... a computer simulation technique that allows the examination of how system rules and patterns emerge from the behaviors of individual agents. ABM creates artificial agents that represent individuals that have the ability to perceive and interact with each other and their environment. Based on their interactions, the agents can make autonomous decisions. The goal of the simulation is to track the interactions of the agents in their artificial environment and understand processes through which global patterns emerge, for contingencies. “ Axelrod and Tesfatsion (2005) posit that researchers who use ABM should pursue four main goals: empirical, normative, heuristic, and methodological. Empirical questions revolve around the phenomena of large scale regularities in the absence of central control. Normative goals revolve around the use of the ABM model as a “...laboratory for the discovery of good designs.” In other words, if an ABM works, or resembles the real world in a cogent way, can we then introduce other agents or alter the environment to create a “better world”? Heuristics involves asking, “How can greater insight be attained about the fundamental causal mechanisms in social systems?” The hope here is to envision causal relations beyond first-order effects. In The fifth discipline, Peter Senge warns that these secondary effects are oftentimes inadvertent and can 108 slow down the success of the system (Senge, 1999). “A fourth goal is methodological advancement” or to put it more generally, how will the next generation of researchers benefit, methodologically, from the models created in the present? 3.4.3 Multi-Scale Modeling Each of the singular approaches mentioned above has unique benefits. However, as North and Macal (2007) point out, “…no single modeling approach can be said to be the best approach for addressing all types of problems. There is no generic modeling technique.” However, they further note that “..it often is useful to combine one or more modeling approaches, employing each technique for that part of the model where it makes the most sense to do so, considering the unique capabilities and recognizing the limitations of each modeling approach.” In other words, different modeling methods are better suited to different levels of abstraction. North and Macal call this model blending, others use the term multi- method modeling (Sadsad and McDonnell 2007) and hybrid systems modeling, using hybrid to refer to the union of discrete and continuous systems (Borshchev and Filippov 2004). Until recently, no singular platform was available to combine ABM, SD, and DE modeling. Modelers who combined methods devised their own software or awkwardly combined proprietary or non-proprietary software packages to combine approaches. Perhaps the reason that hybrid methods have been underutilized resides in the reality that construction process and influences occur at differing temporal and spatial scales. For instance, high-level decisions that are made at executive levels, such as company policy and governmental regulation are resolved at a slower time scale than project-level decisions. Obversely, field-level decisions mostly occur on a quicker temporal scale. Spatial scales in both are obviously different. Sadsad (2007) describes this systems conundrum as follows: 109 “Multi-scale systems modelling and simulation represents a system in terms of different scales (spatial and temporal) of resolution (Bassingthwaighte, 2006) and suited to characterising the functions and desirability of organisational forms (BarYam, 2006). This modelling technique has the ability to adaptively switch between different levels of abstraction during real-time simulation (Bassingthwaighte, 2006).” 3.4.4 Anylogic Software The only known software program that can seamlessly integrate SD., ABM, and DE methods is the AnyLogic development environment (Borshchev and Filippov 2004, Anylogic 2013). Anylogic uses the Java Eclipse framework, which enables it to be used over a wide range of operating systems. Anylogic utilizes a language dependent (Java) application programming interface (API) that allows interoperability with office and corporate software, geographical information system (GIS) datasets, and custom modules written in Java, depending on the product option chosen, the cost of the software becomes increasingly steeper as the functionality increases. AnyLogic 6.4 features an optional built-in optimizer and enables animations that can be exported as java applets. Anylogic does not require an in-depth understanding of Java programming and has a user-friendly interface. For Objective 3 a conceptual hybrid model consisting of agent-based and discrete event modeling was developed using the Anylogic software. This software was chosen mainly given that it provides a superior platform to illustrate the RE premise that safety and field operations management, or production, are inseparable and do not operate independently. Additionally, RE is characterized as having a sociotechnical perspective, providing another reason to merge DES and ABM. The DES feature of Anylogic provides a way to graphically illustrate the ‘technical’ that is the workflow and its associated queues and capacities. However, the DES feature does not easily allow the ‘socio’ portion of the sociotechnical system to be modeled. For this the 110 ABM feature of Anylogic is utilized to mimic the behavior of the agents (or crews) in the system. Representing the technical (i.e. that is the process) alongside of the socio (i.e. representing the behavior of the agents). Providing both seeks to gain a richer understanding of the emergence of the system as well as to illustrate to others how the two methods interact and complement one another. Alternatives to illustrate the model and carry-out the simulation considered in addition to the Anylogic software included programming or coding the scenario in the Python or Java language as entirely an agent based model and incorporating rudimentary elements of the discrete process, using a less powerful ABM software such as Netlogo or including the production process as another agent. Other researchers have built elaborate computer architectures that can combine the two methods using existing platforms or an existing platform such as DES and augmenting it with ABS by changing the coding, but this involves extensive intervention to coordinate the two programs. Anylogics’ drag and drop architecture, along with minimal need for Java coding allows the researcher to focus on the problem at hand and not computer programming. Finally, Anylogic was chosen to meet future research needs as systems dynamics may be added to the simulation at another time. The author knows of no other work in the construction industry that uses Anylogic software in a hybrid manner to discuss risk assessment. Additionally A crucial RE idea, the ETTO Principle, will be simulated. The principle of the efficiency-thoroughness trade-off (ETTO), is a way to describe human and organizational performance variability. The ETTO principle brings to light the fundamental human condition that, because resources are limited, people and organizations act in ways that favor efficiency. The ETTO principle was chosen to better understand how people and production systems interact under the influence of disturbances. In particular, as mentioned in the literature review, 111 how performance variability is affected in construction processes is affected by the adjustments that people must make to accomplish their work, and how these adjustments can lead to adverse outcomes. The “efficiency” will be simulated by agent behavior in the production scenario that maximize production in each trade (or locally optimizes) without regard to other parts of the system. The “thoroughness” is simulated by agents that lookahead to ensure that the trade downstream is not overwhelmed by work and adjusts their production speed accordingly. To accomplish this, the agents monitor the queues surrounding their trade and adjust their work speed accordingly. The production scenario is stochastic and is populated by the agents and subject to disruptions. The disruptions are internal and external. The internal disruptions will be stochastic and the external disruptions will be randomly set by the modeler. The conceptual approach as championed by Robinson (2007b) was used to first develop a conceptual model. This was followed by the actual coding of the model. Details of this approach are in Chapter 5. Four simulation experiments were created subject to increasing production pressure that measured the throughput of each simulation. First, a production scenario consisting of five trades was created without disruptions or agents. In effect, this is akin to a critical path method schedule where the project is under ideal conditions. The production line is subject to increasing production pressure induced by the modeler. The second simulation introduces internal and external disruptions to the system and measures throughput. The third simulation introduced agents that acted in a thorough manner, as described above, to the second experiment. Finally, Experiment 4 recreated Experiment 3 but here the agents only sought to maximize their production –they are “efficient” in ETTO terms. 112 Chapter 4: Resilience Engineering Conceptual Framework for Construction Safety 4.1 Purpose and Features of the Framework This Chapter addresses Objective Two of this work which is “To develop a Resilience Engineering conceptual framework for construction safety.” This Objective is in the shadow of the Goal of this work, which is “To explore schemes and methods to understand, harness, and foresee disturbances that arise from demands placed on the construction operations of projectbased organizations that deliver the built environment.” The conceptual framework seeks to answer the Research Questions of “What elements of RE may help a construction project avoid, survive, and recover from disruptions?” and “How can we apply RE, in a formalized way, to construction operations?” In short, RE is seen as a way to better manage disruptions that occur in a construction project setting. 4.1.2 Linking Disruptions, Resilience Engineering, Safety, and Production As a clarification, the relationship among disruptions, RE, safety, and production is outlined and briefly explained here. The literature review contains detailed explanations of each area described. The argument for linking these areas is outlined below and then briefly described and defended: • Disruptions can/may cause accidents and affect production • By definition and design, RE is proposed as a formalized approach to understand disruptions • Two of the four premises of RE deal directly with the relationship between production and safety 113 4.1.2.1 Disruptions can/may cause accidents As stated in the literature review, the literature on disturbances in construction operations as it affects safety is non-existent. Some work has been done on specific disturbances as change orders (Ibbs et al 2007) but, as mentioned previously, these are concerned mostly with post ex facto financial and legal claims on productivity alone. For the most part, change orders would be considered to fall within the base mechanism/model for being adaptive but should be considered on an individual basis for impact on safety and production. In other fields, for instance aviation (Madini and Jackson 2009, Jackson 2010) and manufacturing (Barasso and Wilson 1999, Toulouse 2002), researchers agree that there is a direct link between disruptions that impact production systems and safety. Jackson (2010) boldly asserts that “Accidents are the result of disruptions….” Several researchers (Barasso and Wilson 1999) summarize the categories for “Consequences of Disturbances” as: (having) no effect, presence of risk factors, hazardous situation, minor and serious accident, catastrophe, fatal, nonfatal lost time, and nonfatal non-lost time. Although no research exists that directly ties disruptions to accidents in the area of the built environment, it is not a large leap in logic, given the research done in other fields and the opinions of safety experts, to suggest that various disruptions may influence safety on the construction site. Mitropoulos et al (2005) offer a view that also proffers that the link between production and safety should be strengthened. They proposed strategies to deal with exposures and errors that are inherent in the production process. In their view construction is a system and safety should focus on the work factors of production and how these interact with other causal factors that trigger and ultimately release hazards on the construction site. Theirs is a decidedly prescriptive approach based on actual job conditions and worker behaviors. They recommend 114 that practitioners reduce task unpredictability and that improve error management capabilities to stay within safe operating boundaries. 4.1.2.2 RE is proposed as a Formalized Approach to Understand Disturbances Accepting that disruptions may be either the direct cause of accidents and/or set the stage for hazardous conditions in construction, then a formalized method to deal with these disruptions’ is needed. This work proposed that the emerging field of RE as a worthy candidate to do so in the complex world of construction. The first clue that RE may be a guide to understanding disruptions is found in the current working definition of RE: “RE is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions” (Hollnagel et al. 2011). Furthermore, each of the four basic abilities of the RE paradigm (responding, monitoring, anticipating, and learning) are concerned with dealing with disturbances. Finally, the basic premises of RE are also concerned with how to think about disturbances. For instance, a disturbance can be thought of as an ‘underspecification of performance conditions’ (Premise One). Also, multiple disturbances may lead to ‘unexpected combinations of performance variability’ (Premise two). 4.1.2.3 Two of the four premises of RE deal directly with the relationship between production and safety Premise three includes the statement that safety management must be proactive as well as reactive. This means that the production process should be proactively changed or modified, or, in other words, sufficiently flexible to adjust to avoid disruptions that might endanger safety. 115 Premise four, that “Safety and field operations management are inseparable and do not operate independently” means just what it states. Decisions about the production process should not be made without concern about safety, and vice versa. Traditionally, the emphasis in industry has been on meeting production goals with safety as a secondary, or after the fact concern. RE seeks to put safety and field operations on equal footing. The goal is reliability and resilience of operations to meet production goals. The Framework is presented in the remainder of this chapter. The unique perspectives or mindset that an organization would have to take to adopt are first presented. These consist of the need to consider and strive to make RE a quality of the system, making RE a strategy for the firm, creating a culture that is amendable to RE, and the need to view systems functionally. Then, the four main abilities of RE are discussed in terms of the probing questions from the RAG. Guidance in meeting the questions and implementing RE is given based on the probing questions. Finally, a graphical model of RE is developed using Yilmazs’ Sociocognitive Framework for Engineering Work Systems. 4.2 The Elements of a Project-Based RE Construction Project 4.2.1 Perspectives Practically any new outlook, or paradigm, involves a perspective shift on the part of the potential adopters and eventual users of the new approach. These perspective shifts lay the foundation for thinking about how the new initiatives fit into or replace current operations. Additionally, they challenge the stakeholder’s mental model(s) of how the system traditionally operates. For instance, those pursuing a sustainability agenda agree that design is best performed in an integrative manner rather than separately. Lean Construction proponents favor a collaborative 116 approach among the trades for onsite planning and execution. In general, new approaches, including RE, require that stakeholders revise their mental models about how they plan and organize work. Some of the different perspective involved in applying RE include thinking about resilience as a quality of the system, approaching RE with intent (i.e. a strategy), pursuing a certain cultural outlook, and describing systems as a collection of functions rather than structurally. Other perspective shifts involved with RE are contained in the four premises of RE, such as embracing performance variability in the normal course of work and understanding failure as a temporary inability to deal with complexity, as explained in the literature review. 4.2.1.1 Resilience is a Quality of the System In a perfect world RE would not be needed. The need for resilience implies some threat to the system operation, namely, in the form of some degree of disruption or perturbation. It is posited that resilience is a desired quality, or an inherent feature, of the system that is useful to combat disruptions and that it can be engineered into construction project operations. Understanding that RE is a desired “quality” of the system is perhaps the starting point to discuss the mindset needed to implement resilience into construction operations. This means that the goal of a resilient system to instinctively act in a resilient manner when an untoward event occurs. However, in construction some planners and managers strive for the desired quality of only being reliable, that is, accurately estimating the outcome, for instance within a certain time frame or budget, when certain activities are performed on a project. This is evident by the preparation of construction schedules, estimates, and other planning efforts. In fact, certain approaches to construction production, notably Lean Construction, primarily strive for reliability. Having reliable operations is thus a desirable quality. It makes the engineering of 117 resilience less difficult given that it reduces function variation. Reliability is an elusive quality on construction projects. For instance, the Percent Plan Complete (PPC) is a measure of reliability in Lean Construction projects. A PPC of eighty percent or higher is considered excellent but does not occur except in firms that are well-versed in the Lean approach. It seems then, in practice, that reliability is not enough to deal with the inevitable disruptions that the project will encounter. A project must also be resilient to sufficiently handle disruptive events. At this point the reader may pause and ask “How does a firm or project develop the quality of being resilient?” That is, how does a person or organization make RE an integral part of everyday operations? Even more abstractly, how does one go about making the fundamental changes to develop any desired quality? The abstract answer to this question is beyond the scope of this framework as the task here is to present the idea of RE and identify the elements that seem useful to implement RE in a firm. However, many texts exist in popular literature that deals with organizational change that may guide the reader. The steps to developing organizational and project qualities may mirror those that one uses to develop personal qualities and include things such as discipline, practice, and commitment to achieving that quality. 4.2.1.2 Resilience is a Strategy Much like the decision to pursue a green building or a Lean Construction approach to building, the choice to implement RE in construction operations is intentional and may even be termed a strategic move. It will require commitment by top executives and others who control and allocate project resources. There is an investment of time (e.g., more involved planning) and money (e.g., additional resources) needed to engineer resilience just as there is for engineering sustainability and reliability. As mentioned in the Literature Review, Madni and Jackson (2009) take the view that resilience comes at a cost and that resilience should be only infused in “crucial 118 leverage points” in the corporate structure. There is nothing in the literature that precludes the techniques of RE from being vetted in a pilot project or even a sub-project. This may be one way of capturing the costs associated with a program to implement RE on a larger scale. Identifying the ‘crucial leverage points’ of a project might be accomplished by an introspective analysis of the firm’s ability to deal with disruptions. One way to do this is via the RAG. After creating a baseline, Hollnagel et al. (2011) recommends using the RAG as a management tool for ongoing improvement of resilience. 4.2.1.3 Culture for Resilience Engineering The existing or potential climate and culture of a construction project are important aspects to consider in implementing RE. However, climate and culture are elusive concepts and are not easily open to formulaic applications to any industry. Additionally, there is no definitive RE climate or culture perspective. In depth coverage of these concepts are beyond the scope of this work. However, given that learning is one of the four basic abilities of RE organization, it may be worthwhile to the firm considering a RE approach to risk management to explore a cultural approach that promotes learning. Dekker (2007) builds on the notion of Reason (2008) in suggesting that a ‘Just Culture’ that can “balance learning from incidents with accountability for their consequences” as described in the Literature Review. Here, incidents are defined as a disruption in the work flow of a lesser degree than an accident. A Just Culture also promotes the RE position that both ‘root-cause’ thinking and ‘human error’ are an obsolete notions in complex sociotechnical systems. Dekker (2007) suggests some ways that an organization might foster a Just Culture with respect to safety. First, determining just what constitutes an incident and who, with respect to 119 expertise, gets involved in the aftermath (i.e., front-line supervisor or other personnel). Additionally, Dekker suggests normalizing and legitimizing incidents by: o Viewing them as opportunities to learn o Abolishing financial and professional (e.g. license suspension) penalties for those that are involved in the incident(s) o Monitoring and attempting to prevent stigmatization of those involved in an incident o Implementing, or reviewing the effectiveness of, any debriefing programs or critical incident/stress management programs the organization may have in place to ensure that workers can view the incidents as ‘normal’ operational events o Creating a staff position that deals with incidents apart from the front-line supervisor. Also, eliminate any punitive actions that might impact the performance review of the worker. o Beginning the Just Culture perspective as part of the indoctrination of employees so that they understand that reporting incidents is part of the organizations learning o Informing workers of their rights and duties in the event of an accident as well as the organizations standard operating procedures with respect to accident investigation. This eases the anxiety and tension of the worker. Additionally, if reliability is considered to work in conjunction with resilience to maintain systems control, and is a desired quality as stated above, then the culture of a typical HRO might well be part of the RE culture. As described in the Literature Review, a “culture of reliability” that distributes and instills the values of care and caution, respect for procedures, attentiveness, 120 and individual responsibility for the promotion of safety among members throughout the organization” (Boin and Schulman 2008) might be adopted by the resilience seeking organization. Borys et al (2009) sums up the RE outlook on by saying: “The adaptive age challenges the view of an organisational safety culture and instead recognises the existence of socially constructed sub-cultures. The adaptive age embraces adaptive cultures and resilience engineering and requires a change in perspective from human variability as a liability and in need of control, to human variability as an asset and important for safety. In the adaptive age learning from successful performance variability is as important as learning from failure.” This view of culture seems particularly applicable to construction given the polyglot of trades and associated sub-cultures. 4.2.1.4 View Systems Functionally Viewing systems as a set of coupled and mutually dependent functions, or the functional view, may be, conceptually, the most difficult concept for the adopter of the RE approach to grasp and understand. However, it is an essential way of thinking in a RE perspective. The notion put forth here is that by thinking in functional terms the analysts are given the opportunity to speak about and discuss production and safety in the same conversation. This follows from the basic premise of RE that production and safety are inseparable topics. Thinking in a functional manner necessitate that the broad nature of the work being undertaken is understand as it occurs at the workface. It also requires that the FRAM premises as described in the Literature Review are adopted. The FRAM methods suggest that those most intimate with the work discuss and map how the functions are related. Then they can discuss how the variability of the functions and how they may become out of control. 121 A detailed discussion of the FRAM is beyond the scope of this work and the reader is directed to the Literature Review for a discussion of FRAM, for the purposes of this work a general knowledge of FRAM is and the importance of the functional approach is sufficient. The structural perspective is well ingrained into the way construction operations are planned and managed. Speaking of the structural approach Hollnagel et al. (2011) states: ”Systems are usually defined with reference to their structure, that is, in terms of their parts and how they are connected or put together. Common definitions emphasize both that the system is a whole, and that it is composed of independent parts or objects that are interrelated in one way or another. Definitions of this type make it natural to rely on the principle of decomposition to understand how a system functions, and to explain the overall functioning in terms of the functioning of the components or parts – keeping in mind, of course, that the whole is larger than the sum of the parts.” In the RE approach the functional perspective is the preferred approach, as Hollnagel et al. (2011) posits about the use of the functional perspective: “It is, however, entirely possible to define a system in a different way, namely in term of how it functions rather in terms of what the components are and how they are put together. From this perspective, a system is a set of coupled or mutually dependent functions. This means that the characteristic performance of the system - of the set of functions - cannot be understood unless it includes a description of all the functions, that is, the set as a whole. This delimitation of the system is thus not based on its structure or on relations among components (the system architecture). An organization, for instance, should not be characterized by what it is but by what it does. Neither should it be characterized by the people who are in a given place (on the organizational chart or in reality) but by the functions they perform." In construction, the traditional model of how work is accomplished is conveyed via the CPM schedule (n.b considered structural given the linear term “path”). Apart from a general safety plan prepared prior to the commencement of project work, safety is traditionally discussed as CPM activities become imminent in the field. Then an ‘Activity Hazardous Analysis’ (AHA) (or a similar sounding named document) is prepared as a risk assessment tool that focuses on the work at hand. Safety is rarely examined in the context of the whole project, or in a functional perspective. The CPM schedule reflects ‘work-as-imagined’ from those distal to the work face 122 as opposed to “work-as-performed” in the field and ignores the complexity of the modern construction project. It is of little use to disruption and perturbation analysis given its narrow focus. It may be useful in that it identifies key activities and could serve as a springboard for identifying functional areas. The thoughtful reader will note that, perhaps, not all phases or parts of the construction project are complex. For instance, and depending on the project, the beginning and near completion stages of a construction project are devoid of the complex interactions that occur in other stages of the project. Likewise, certain projects may simply not be as complex as others. RE approaches are still well suited for these endeavors, albeit on a smaller scale. The functional approach to systems analysis suits the fourth premise of RE which is to consider that production and safety are inseparable. When used in the context of the FRAM, functional thinking can be used as a tool to both assess risk and to control production. Thinking from a functional perspective, instead of strictly thinking in structural terms, enriches the understanding of disruptions. It requires thinking in terms of not only how the individual functions under consideration may vary but also how other functions may vary because of the instability of the chosen function. Thus, by necessity, the analyst must have an understanding of the entire system and how work is actually accomplished at the ‘sharp-end.’ This makes the distinction between a system and its environment, or the system boundaries less important (Hollnagel et al. 2011). It is a more flexible approach than strictly relying on CPM. Understanding how work is accomplished at the workface by implementing functional thinking and the FRAM will require the planners to ask mutually important and critical questions such as “How tightly are the functions coupled?” and “What are our ETTO’s?” in addition to considering the aspects of the function. 123 Hollnagel et al. (2011) summarizes the differences between the two approaches and in the following quote. In doing so he alludes to the notion that a system is reliable, that its variability of functioning is acceptable as long as it is in control, while at the same time acknowledging that this is not always the case, and that a system also needs to be resilient. "The differences in perspective become clear when a system is defined in terms of how it functions rather than in terms of its architecture and components. In this case the question is whether the functioning achieves its purposes. But this cannot be simplified to a question of whether the system is in a 'normal' state or a 'failed' state. It is instead a question of the variability of functioning and whether the outcome is acceptable under the existing conditions. But as soon as we say variability, we also acknowledged that any 'failure' will be temporary, hence reversible. We should consequentially try to understand how likely the variability of multiple functions may interact to produce an unintended - and in most cases unwanted - outcome." The use of the FRAM requires managers to ‘get their hands dirty’ and work collaboratively with the trades in production planning while simultaneously considering safety. A function that is in control is both reliable and safe by definition. 4.2.1.5 The Four Abilities To be considered resilient, a project must have the four main abilities of RE as discussed previously. The mix of the four varies and depends on each individual firm’s unique make-up. They can be assessed and managed using the RAG’s probing questions adapted to construction. The following sections look at the basic elements of the abilities of response, anticipating, monitoring, and learning and some techniques of how they might be applied to a construction project. The abilities work synergistically and have overlap. 4.2.1.5.1 Responding In RE parlance, responding is dealing with the ‘actual’ disrupting situation at where it affects the core business process (Paries 2010). In construction work this is at the workface. 124 Paries (2010) presents the anatomy of responding as consisting of assessing the situation and asking if it is in or out of the realm of variability control. If the disruption is potentially out of control, what adaptation, if any, is required to meet the situation? Knowing what to respond to and when to respond, as well as what defenses and resources are available to meet the disruption, are critical to an effective response. Responses should be both proactive and reactive. Proactive responses anticipate the disruption and have pre-defined responses. Reactive: generate responses create, invent, or derive ad hoc solutions. The ‘probing questions’ that Hollangel (2010) has developed as part of the RAG offers an opportunity to ‘reverse engineer’ the goals of RE. The probing questions for the ‘Ability to Respond’ are shown in Table 4.1 with ‘guidance’ suggestions to aid the practitioner. Construction Project Guidance the Analysis item - Ability to Respond Event list Is there a list of events for which the system has prepared responses? Do the events on the list make sense and is the list complete? Guidance: An event list a proactive approach to responding. Many events are anticipated by OSHA 29 CFR 1926 (e.g. Hazard communication ). Other examples of an event(s) include high-rise rescue of workers, workers trapped in excavations, and an auto accident in a highway work zone that may go beyond the guidance provided by compliance with OSHA regulations, such as multiple events. ‘Making sense’ is based in the context of the project and the experience Table 4.1: Construction project guidance analysis item - Response 125 Table 4.1 (cont’d) of the analyst(s). A completed list may be a utopian state, given that disruptive events may combine in unique ways. The analyst might consider the simultaneous occurrence of two or more events, for instance, a worker dangling from a high-rise structure and a trench cave-in are not unthinkable in the event of an earthquake. Selection of the events should be dynamic as the situation of a construction site can vary dramatically as work evolves. Background Is there a clear basis for selecting the events? Is the list based on tradition, regulatory requirements, design basis, experience, expertise, risk assessment, industry standard, etc.? Guidance: Most likely the event list is based on a mix of all of the above, with ‘meeting minimum regulatory requirements’ taking precedence. What may be more important here is the process of selecting the events. In the spirit of the functional outlook they should be based on ‘work-as-performed’ rather than ‘work-as-imagined.’ This implies that front-line supervisors be involved in event selection along with higher-level resource allocators. Relevance Is the list kept up-to-date? Are there rules/guidelines for when it should be revised (e.g. regularly or when necessary?) On which basis is it revised (e.g. event statistics, accidents)? Guidance: An initial event list might be generated at the beginning of the 126 Table 4.1 (cont’d) project and updated periodically as work progresses. For instance, it could be reviewed and updated at pre-work meetings in lieu of or in addition to an AHA. As suggested it could be triggered by an accident. However, it may be more useful if triggered by an incident review prior to an accident. Threshold Are there clear criteria for activating a response? Do the criteria refer to a threshold value or a rate of change? Are the criteria absolute or do they depend on internal/external factors? Is there a trade-off between safety and productivity? Guidance: This is closely allied with the monitoring ability and the use of passive and active indicators of performance. In most cases the traditional criteria for responding in construction is the event of an accident or injury. Resilience aims to avoid this situation. Some typical threshold values, or indicators might be excessive hours worked or a high turnover of employees. Both external and internal factors will most likely be involved as working conditions change (i.e. weather) and depending on the skill of management in dealing with site workers. There are no studies related to Safety/Production trade-offs, however, events such as schedule slippage may trigger increased monitoring or workers and crews as they try to ‘make-up’ schedule deficits. Response list How is it determined that the responses are adequate for the situations they refer to? (Empirically, or based on analyses or models?) Is it clear how the responses 127 Table 4.1 (cont’d) have been chosen? Guidance: Conceivably, all three approaches are possible in a construction context. Empirically, small pilot programs could help to assess the effectiveness of responses. Analyses are part and parcel or the FRAM and should be done in a group setting. Finally, simulation models, such as the demonstration included in this work, could be used as could BIM to assess response scenarios. Resource allocators from upper-level management should be involved in this activity as they can speak to the availability of needed financial resources. Upper-level management is important in this aspect as well as speed, duration, and resources as discussed below. The engagement of stakeholders outside of the immediate project planning team should be engaged for the response list. Possible groups include fire and rescue departments, vendors (e.g. high-rise crane service providers), and Hazardous Material response teams. Speed How soon can an effective response begin? How fast can full response capability be established? Guidance: This aspect underscores the importance of updating the event list and resilience planning. Resources, such as excavators and cranes, and 128 Table 4.1 (cont’d) personnel, are added and deleted on a construction site daily. The analyst should also consider resources such as nearby construction projects that have resources that could be utilized in an emergency condition. Importing resources can add time to the response. Duration For how long can an effective response be sustained? How quickly can resources be replenished? What is the 'refractory' period? Guidance: This should be part of the above discussions. 'Refractory' period refers to the time needed for recovery. Resources Are there adequate resources available to respond (people, materials, competence, expertise, time, etc.)? How many are kept exclusively for the prepared responses? Guidance: Resource allocation has been discussed above. Additionally, the analyst needs to consider resources such as psychiatric counseling after a traumatic event as well as debriefing time after an unwanted or disastrous event. Stop rule Is there a clear criterion for returning to a 'normal' state? Guidance: Essentially, this is the signal or directive to return to ‘business’ after a disrupting event. Verification Is the readiness to respond maintained? How and when is the readiness to respond verified? 129 Table 4.1 (cont’d) Guidance: This aspect speaks to the discipline involved in keeping items such as the event list and available resource list current in a dynamic environment. In a truly resilient organization this is carried out by any associate or stakeholder under the principle of ‘cross-checking’ without regard to authority or seniority level. 4.2.1.5.2 Anticipating The guidance to construction analysts in this section consists of some strategies of how to anticipate potential disruptions via Wood’s ‘patterns’, time-frames of anticipation, and briefly – the who and what of which disruptions to anticipate. Additionally, the ‘probing questions’ of anticipation are reviewed and commentary is offered as it relates to construction are discussed. 4.2.1.5.2.1 Patterns in Anticipation Woods et al. (2010) patterns as discussed in the Literature Review offer a formulaic way to anticipate disruptions. A few of the patterns particularly relevant are discussed in the context of construction as follows with the patterns underlined: Resilient systems are able to recognize that adaptive capacity is falling or inadequate to the contingencies and squeezes or bottlenecks ahead: Here Woods is speaking in production terms (e.g., bottlenecks) about resilience and safety. This reiterates the suitability of the FRAM as a 130 vehicle to discuss safety and production in the same conversation. Additionally, the FRAM can aid the analysts in examining how disruptions can combine in unanticipated ways. Adaptive capacity, in the context of a construction project, refers to the capacity to handle disruptions. For instance, to make-up schedule time lost to disruption additional shifts may be added. An initial production rise may be expected and observed; however, subsequent drops in production due to fatigue may signal a loss in adaptive capacity. Resilient systems are able to recognize the threat of exhausting buffers or reserves. Again, Woods is speaking in production terms about resilience and safety. In construction production planning it is deemed prudent to keep a backlog of work to turn to in the event of a disruption or other unplanned event. The link to safety may be when backlogs are exhausted and work is ‘pushed.’ This might result in overcrowded work areas (creating unsafe conditions) and even further production loss. Thinking in terms of keeping buffers and reserves is not limited to the workface. Organizationally, a construction company may want to limit taking on additional projects to be able to respond to or avoid disruptions. Resilient systems are able to recognize when to shift priorities across goal-trade-offs. The example of the ETTO is most prominent in RE literature. In construction evidence or indicators of this may consist of increased incidents of safety regulations on site. For instance, repeated instances of neglecting to ‘tie-off’ or, in the instance of adding additional shifts above, consistently working more than ten hour shifts or other excessive overtime, especially in harsh (e.g. cold weather) conditions. Resilient systems are able to make perspective shifts and contrast diverse perspectives that go beyond their nominal system position. This pattern is closely tied to monitoring. Woods (2006) terms these sub-ability ‘cross-scale’ interactions and characterizes it as ‘upward’ and 131 ‘downward’ resilience. Simply put, decisions at one level of the organization, strategic or operational, may affect behavior at the other. The goal is to be sensitive to both ends and make adjustments based on ‘work-as-performed’ rather than ‘work-as-imagined.’ Downward resilience refers to strategists’ directives to the front-line with regard to conveying clear goal structures, communicating intent behind goals, and allocating adequate technology to reach the goals (Tjorhom and Aase 2011). Failure to do so may result in sacrifice decisions at the workface. Upward resilience refers to front-line workers using their experience and flexibility to deal with situations not specified in the downward directives. In the context of the work situation it may be necessary to make a sacrifice decision or otherwise deviate from ‘downward’ directives. Repeated deviations may result in changes to the operating directives or resource allocation. The key is an open channel of communication between the two and a willingness to learn from one another. Also implicit in the pattern is that the executive level needs to monitor the workface for changes to directives and the need for adequate resources. Also, the ability of learning from each end is implicit. 4.2.1.5.2.2 When to Anticipation When to begin to start thinking about disruptions in the life-span of a project is a fundamental question. Sheard and Mostashari (2008) gather input from several sources to suggest the time horizon shown in Figure 4.1. They note that Westrums’ (2006) typology is most commonly cited and suggests that anticipation include response preparation before an event, survival during the event and recovery after an event. It should be noted that the literature points that anticipation includes training to handle unforeseen situations or ‘being prepared to be unprepared’ and handling disruptions in an ad hoc basis. 132 However, it seems that anticipation of disruptions in construction could benefit from the expanded time scale that Sheard and Mostashari present below. Figure 4.1: Time Periods (Sheard and Mostashari 2008) Long term prevention could begin in the planning and early design stages. For instance, designing steel structures with lanyard attachments points is an example of foresight. Short term avoidance is the period preceding work and could be discussed as part of the FRAM or at preconstruction meetings. Intermediate term coping means anticipating responses to disruption events when they occur. For instance, what is our response to an excavation failure in real-time? To the greatest extent possible, maintaining project goals, or system functionality, should be the aim here. Coping with the ongoing trouble refers to dealing with the aftermath of an event. For instance, after a serious mishap involving fatalities workers may need grief counseling to regain confidence in the project safety controls. Finally, long-term recovery speaks to the need to 133 return to full functionality after an untoward event. In some instances this may not be possible and reorganization is required. This might involve the replacement of a violating or willful contractor or redesign of a project. 4.2.1.5.2.3 General comments on Anticipation The ‘probing question’ related to anticipation of the RAG present an opportunity to offer guidance and general discussion related to RE and anticipation. This information is shown in Table 4.2 below. Construction Project Guidance Analysis item - Ability to Anticipate Expertise Is there expertise available to look into the future? Is it in-house or outsourced? Guidance: There are good arguments for hiring an outside organization (i.e. a consultant) to look for threats (i.e. disruptions) and opportunities (i.e. ways to benefit from and thrive from disruptions) to the project. A fresh outlook can ‘see’ things that insiders miss or are accustomed to. However, no one knows the business as well as insiders. Growing and nurturing an awareness within the project of what might go right or wrong should be a strategic goal of upper management. Frequency How often are future threats and opportunities assessed? Are Table 4.2: Construction project guidance analysis item - Anticipate 134 Table 4.2 (cont’d) assessments (and re-assessments) regular or irregular? Guidance: This may depend on many factors such as the availability of resources and the complexity and time-frame of the project. Changing environmental conditions (e.g. Financial markets, political factors) may necessitate more frequent updates Communication How well are the expectations about future events communicated or shared within the organization? Guidance: Weekly on-site safety meetings are one way to disseminate news of possible future events along with ways to respond. Assumptions about Does the organization have a recognizable 'model of the future'? Is this the future (model of model clearly formulated? Are the models or assumptions about the future) future explicit or implicit? Is the model articulated or a 'folk' model (e.g., general common sense)? Guidance: The ‘model of the future’ is first a strategic activity. The goal should be to develop a clear, coherent, and easily understandable direction for resilience improvements and sustainability. Then is should be communicated to all levels of the project/firm. Time horizon How far does the organization look ahead? Is there a common time horizon for different parts of the organization (e.g. for business and safety)? Does the time horizon match the nature of the core business process? 135 Table 4.2 (cont’d) Guidance: Production concerns on a construction project are typically discussed in a six-week lookahead time frame. For field matters this may be a good starting point. Other things, such as less immanent threats, may need a longer time horizon by upper management. The use of the FRAM, where safety and production are discussed simultaneously, is recommended to synchronize time horizons. Acceptability of risks Is there an explicit recognition of risks as acceptable and unacceptable? Is the basis for this distinction clearly expressed? Guidance: In construction OSHA safety regulations are the guiding factors for acceptable risk behavior. RE aims to include that outlook and ask that such things as ETTO’s are included in the analysis. Aetiology What is the assumed nature of future threats? (What are they and how do they develop?) What is the assumed nature of future opportunities? (What are they and how do they develop?) Guidance: Aetiology means studying the cause of things. This activity ties in closely with the learning ability and may be useful in a retrospective analysis as how to analyze anticipations and patterns of failure and success. Culture To which extent is risk awareness part of the organizational culture? Guidance: Please refer to the discussion of the ‘Fair and Just Culture.’ 136 4.2.1.5.3 Monitoring In RE monitoring refers to where the project is operating at in terms of production and safety. Sometimes this is referred to as ‘drift,’ as in drifting close to or out of safe operating ranges or boundaries. This was mentioned in Chapter 1 in the discussion of ‘plateaus.’ Woods and Wreathall (2008) characterized the boundaries of the operating ranges as a stress-strain curve. Whatever metaphor is used the emphasis is on realistically determining is the project is operating within ‘safe’ production and safety ranges. The ideas of ‘mindfullness’ and sensemaking,’ borrowed from HRO theory, are useful in monitoring. By being aware of the situation and seeing the ‘big picture’ (sensemaking) the monitor(s) are more likely to recognize when the project is going off-plan and make corrections. Lay (2011), writing in area of turbine maintenance work, offers some strategies for monitoring conditions in the field. First, the project stakeholders need to be aware of risk profile changes. This is termed ‘Pinging.’ Some of the indicators for profile changes that are construction related include: • Multiple issues taking the crews attention • Progress stalling, schedule impacts, multiple delays • Mood of project leadership changes • Multiple quality and safety incidents, even if minor; an increase in errors • Common tasks not performed or performed late (such as getting permits) 137 • Special situations with the potential to change worker’s moods (such as working over holidays) or risk level (severe weather) on site • Decline in communication • High fatigue • Site housekeeping is poor Lay notes that other practices in monitoring include training the entire organization in RE techniques, pinging, likely error situations, and error likely ‘climates’ on site. Some error likely climates she recognizes to monitor that apply to construction work are: • Intimidating field leadership style • Poor communications with field leadership • Field leaders who were distal from the work (e.g. stay in trailer while concrete is placed) • Shift competitions and crews not working as a team • Customer directing or overly involved in field service scope of work • Leaders not familiar with current practices and cultures (contract employees) • Leaders who weren’t open to help Another dimension of monitoring includes being aware of the trade-offs being made, especially the production – safety tradeoff mentioned previously. The ‘probing questions’ related to monitoring are heavily invested in identifying and analyzing lagging, current and leading indicators for the ability to monitor. Unfortunately, the 138 construction industry currently relies heavily only on lagging indicators. Common industry indicators have been defined by regulators, they include: fatalities, accidents, and DFW – all lagging. It is beyond the scope of this work to develop current and leading indicators for construction. 4.2.1.5.4 Learning Hollnagel (2011) provides guidance in the RAG probing questions for organizational learning that may apply to the construction industry. However, the list must be ‘reversed engineered’ to obtain a checklist that may aid learning. Looking at the ‘probing questions’ for learning the following items are abstracted: • The organization should have a systematically clear principle for which events are investigated and those that are not. This concerns which disruptions to learn from and which to ignore. • Learn from success as well as failure. In other words, study ‘normal’ work and performance variability as well as failures. The underlying assumption is that accidents are rare events that do not provide many opportunities for learning while the study of what ‘goes right’ provides many opportunities. This implies ‘continuous’ learning and not just when an unwanted event occurs. o Learning from actual work practices and conditions dovetails with ‘upward’ and ‘downward’ resilience as discussed previously. Both ends of the organizational 139 spectrum must attune themselves to how work is actually being performed in the field so that effective responses may be crafted. • Provide formal support (i.e. Resource, personnel, time) for organizational learning activities such as support for data collection and analysis. • Communicate what has been learned in a timely manner to all affected stakeholders. • Develop a means to verify (or confirm) that the intended learning has taken place. Ensure that the learning is sustained as long as it is still relevant. GameDay exercises are another way for individuals and organizations to learn how to better adapt and respond effectively in the face of disruptions. These have been used by organizations that have large-scale Web operations, such as Amazon.com. A GameDay simulation purposely exposes critical systems to disruptions to uncover flaws and dependencies. Participants are alerted that the system will be stressed but not told, the nature of the disturbance(s). It could be a major power failure, data corruption, a fire in a data center, or a combination of unwanted events. To graphically encapsulate the ideas presented in Section 4-2 Yilmaz’s Sociocognitive Framework for computer software development was adapted to the RE case. The justification for the use of this model is presented below as is the Sociocognitive Framework for Engineering Work Systems. Then the graphic representing the RE at the workface is presented and explained. 140 4.3 Developing a Graphical Model for RE in Construction: The Software Development Process and The Sociocognitive Framework for Engineering Work Systems 4.3.1 The Software Development Process The software development process and the processes used to construction the built environment have much in common. Software development is “a knowledge acquisition activity (Armour 2003) that involves the transformation of the user needs into a software product that realizes the requirements elicited from these needs (Yilmaz 2007). Yilmaz (2007) states that software processes include a set of policies, procedures, and technologies within an organizational structure to produce and maintain software products. “The process involves knowledge acquisition activity phases, during which teams of engineers collaborate and coordinate within the constraints imposed by the management, as well as organizational norms, technology, culture, and policies” (Yilmaz 2007). Special characteristics of software development that are similar to construction (or should be) include being goal-directed and adaptive, improving over time, and being human-centered. Human activity is at the core of software development, just as it is for the built environment. Humans are the decision-makers and have control at multiple levels and locations of the work systems. Models of software development must incorporate strategic change along with adaptive human, team, organizational, and cultural factors. To this end Yilmaz (2009) proposes the sociocognitive framework for engineering work systems. The reader is directed to the Yilmaz (2009) paper for details of the model. This Yilmaz sociocognitive framework, along with the 141 FRAM and RE principles in general, serves as the inspiration for the RE Model for projects as explained below. 4.3.3 The Resilience Engineering Model for Projects Figure 4-2 summarizes the information contained in section 4-2 and serves to illustrate one vision of how RE and the FRAM might be implemented at the workface as well as how the executive, operational, and production functions might interact. It is inspired by Yilmaz’s SocioCognitive Model as described above. Paraphrasing Yilmaz’s description of his model, the RE adaptation seeks to present the conceptualization of the critical elements of each level (e.g., operation, executive), as well as different aspects that simultaneously co-exist (e.g., social dimension, human behavior dimension, organizational dimension) in the context of construction project processes. The executive, or ‘blunt end’ (named the strategic in the Yilmaz Model) works synergistically (as shown by the double arrow) with the operational level. On a construction project it may be populated with high level managers and owners of the general contractor and subcontractors. As in the Yilmaz Model, it is responsible for monitoring, controlling, and adapting the operational level via dynamic model updating. This necessarily entails monitoring the boundaries of the system to determine if it is operating near levels that endanger safety. For instance, is the firm operating in the uniform, extra, or plastic regions as described by the Stress-Strain state space in 142 Figure 4-2: Resilience Engineering at the Project Workface (inspired by Yilmaz 2007). 143 Figure 2-6. It is also responsible to (re)organize the social and physical structure of the organization. Here high-level strategic functions are initiated that affect production level activities and outcomes. The traditional functions of planning, organizing, controlling, monitoring, and setting up reliable processes are initiated in broad strokes at the executive level and the details are carried out the operational level. Additionally, the RE perspectives as described in Section 4.2 are added at the executive level. Executive management buy-in of the four perspectives is crucial. As stated previously, engineering resilience into the system comes with a cost and may need to be strategically implemented at only key junctures and activities if not project wide. Upper-level management holds the resources needed to add resilience to a project. Not shown in the model are the organizational inputs as described by Yilmaz above such as resources (budgets), culture, norms, and objectives. Also, the environmental stressors (as inputs) placed on the project, as described by Rasmussen in Chapter 1 are not shown. The Operational level works synergistically with both the Executive and Production level to carry out the details of the strategy as defined by the Executive level in order to facilitate production. On a construction project the Operational level may be populated with project managers of the various trades and general contractor or construction manager who are tasked to carry out the strategy of the executive level. As in the Yilmaz Model, it consists of the organizational, social, and informational (communication) subsystems as described above. A detailed discussion of the subsystems is beyond the scope of this work but closely mirror Yilmaz’s description above. The communication subsystem was renamed ‘information’ (which includes ‘communication’) as described above but encompasses a broader perspective, and to 144 include emerging trends such as ‘Building Information Modeling (BIM), than simply ‘communication.’ The Production level, or those activities at the work face, carries out the strategy and detailed planning of the Executive and applies it to the Operational level. It is populated by the various trades that are needed to complete the work. It is directly impacted by disruptions, broadly represented as ‘Type A’ and ‘Type B’ disruptions as explained in the literature review. The trades may interact in such a way to reveal ‘emergence,’ a basic principle of the FRAM and described in the literature review. This principle explains how the performance of the trades may combine in unexpected ways to affect safety and productivity. The principle of approximate adjustments is also in play at the workface. This is seen in the behavior of the interacting trades as they react to disruptions in the face of ever increasing demands. This is manifest by the ETTO, a behavior output of the system, along with productivity levels and ultimately cost. Based on the output, adjustments are made in the executive and operational areas, as shown by the arrows leading out of the output box. The FRAM ‘snowflakes’ are overlaid onto Figure 4-2 to illustrate how the four basic capabilities are coupled within the Framework. The functions have been placed on the diagram where they impact the Framework most. For instance, the basic ability of ‘response’ is embedded in the Production level given that responses are triggered by external and/or internal events and are facilitated by the monitoring function (Hollnagel 2011). Constant monitoring is placed between the operational and production levels given that monitoring primarily requires attention by the Production and Operational levels, with assistance by the Executive level as described above in the state-space discussion. Both anticipation and learning are embedded in 145 the Operational/Executive area given that they are both heavily influenced by past events and can benefit from the experience of the Executive and Operational personnel. Finally, a rough instantiation (or coupling of the four basic abilities) is presented on the diagram. For instance, the information subsystem (consisting of plans and procedures) in the Operational level is an input to the aspect ‘condition’ of the basic ability of ‘response.’ Other instantiations are as shown. 146 Chapter 5: Resilience Engineering Conceptual Model Simulation 5.1: Introduction This chapter addresses the Research Question “How can we begin to model and simulate disruptions in construction operations using RE principles?” and Objective 3 of this work, which is “To explore RE implementation in construction production settings that experience disruptions in a formalized way using hybrid computational methods.” To do so the process of developing the model was conceptualized using a formalized method developed by Robinson (2007b) that is adapted for this purpose. Then the hybrid simulation model was coded using the Anylogic software. Finally, the basic understanding obtained by the simulation is briefly discussed. The model and corresponding simulation does not attempt to capture and encompass the entire RE spectrum, represented in Figure 4.2 Here the focus is on looking at only the ETTO in a simple production setting population by workers and managers. The ETTO is an important part of RE and is used to understand performance variability as described in the literature review. The approach is outlined more fully below. Computer simulation is used as a means to further explore RE. The hybrid method of discrete event and agent-based modeling was used to illustrate some RE characteristics. Computer simulation was chosen due to the fact that RE is untested in the construction industry. Physically implementing RE, an emerging and untested approach to safety in the construction, in a field setting would introduce a moral hazard with regard to the safety of the workers. Computer simulation offers a first step towards possible field implementation without jeopardizing worker safety. The Research Question and Objective 3 are approached abstractly and conceptually to produce the computer simulation. It is abstract for two reasons. First, to simplify the process 147 that is investigated so that it may be viewed as applying across a broad range of construction activities that experience disruptions. Second, it is abstracted in order to focus on the nature of the disruptions and RE concepts rather than a particular construction activity or sequence of activities, thus simplifying the model and simulation. The characteristic of simplification is desired to keep the model from becoming overly complex. In general, simple models run faster, are more flexible, can be developed faster, require less data, and are easier to interpret (Robinson 2007a). The Question and Objective 3 are approached conceptually in a similar vein to the Framework developed by this paper. As presented earlier by Smythe (2004), speaking generally, a “…conceptual framework should be intended as a starting point for reflection about the research and its context. The framework is a research tool intended to assist a researcher to develop awareness and understanding of the situation under scrutiny and to communicate this. As with all investigation in the social world, the framework itself forms part of the agenda for negotiation to be scrutinised and tested, reviewed and reformed as a result of investigation.” Robinson (2007a, 2007b), talking specifically about computer simulation, contends that the aim of a conceptual computer model should be a creative endeavor that is used to communicate, debate, and agree upon a final coded suitable simulation model. Robinson is essentially advocating writing the detailed specifications for a simulation prior to any coding. Addressing simulation modeling in particular, Robinson (2007a, 2007b) published a series of two articles that develop the background for a conceptual modeling framework. He (2007a) contends that conceptual modeling is “…probably the most important aspect of a simulation study.” However, it is also the least understood aspect of simulation modeling. In general, the literature gives little detailed guidance on model creation and initiation of the model or a 148 formalized process of how to begin to model. Robinson (2007b) quotes Pidd in saying that modeling is a process of “muddling” through. Robinson’s framework provides a disciplined guide with regard to making decisions about a simulation model for a specific project. Although it was developed with the discrete event method of simulation in mind he posits that the framework has a wider applicability to other modeling methods. The remainder of this chapter consists of two parts, first, the background of conceptual modeling as presented by Robinson (2007a,b) is presented. Then the simulation model is presented that addresses the Research Question under consideration as well as Objective three. 5.1 Background of Conceptual Modeling In his papers Robinson discusses the lack of a formalized approach to conceptualizing simulation models. In the first paper (2007a) he defines the meaning of conceptual modeling and the requirements of a conceptual model. In the second paper (2007b) a framework for conceptual modeling is presented. The basic ideas and important definitions of the papers are presented here and Robinson’s framework is used to explain how the simulation was developed for this work. The reader is directed to these papers for details and justifications for Robinson’s approach. Robinson (2007a) states that "Conceptual modeling is the process of abstracting a model from a real or proposed system. It is almost certainly the most important aspect of a simulation project." He further states that it is more of an art than a science. The developers involved in a conceptual modeling project consist of the client, the modeler, and the owners. Depending upon the situation, these roles may be taken up by one person or a team of individuals working together to develop the simulation model. A conceptual model is “A non-software specific description of the computer simulation model (that will be, is or has been developed), describing the objectives, inputs, outputs, content, assumptions and simplifications of the model.” In short, 149 it is the specifications for the model to be developed without regard to how it will be coded. Conceptual modeling is “…is about determining the right model, not how the software will be implemented” and is not software specific. Conceptual modeling is simply the process of creating the conceptual model and requires the following activities: 1. Understanding the problem situation (a precursor to conceptual modeling), 2. Determining the modeling and general project objectives, 3. Identifying the model outputs (responses), 4. Identify the model inputs (experimental factors), and 5. Determining the model content (scope and level of detail), identifying any assumptions and simplifications. Taken together, the preceding activities make up the framework of conceptual modeling and are shown within the ellipse in Figure 5.1 (with the exception of item 1, understanding the problem situation). The details of the framework will be explained below. Robinson (2007a,b) also discusses requirements for a conceptual model, however, these are mostly client related and not appropriate for the discussion in this paper. However, they would be applicable is a concrete real-world problem is encountered by the researcher. The requirements consist of validity, credibility, utility and feasibility. The reader is directed to Robinson’s papers for further details of the requirements. Robinson (2007a) emphasizes the need to keep the model simple as described above. 5.1.1 Developing the Conceptual Model The research question and objective associated with the model and simulation are explored using Robinson’s Framework (2007a,b) as outlined above and illustrated in Figure 5.1. 150 Figure 5.1: Robinson’s conceptual model in the simulation project life-cycle. The conceptual model framework is contained within the ellipse. The double arrows indicate the interaction and iterative nature of items outside of the ellipse with each other and the framework (Robinson 2007a) 5.1.2 Understanding the problem situation The first step in conceptual modeling is to develop an understanding of the problem situation. This is driven by a need to improve a problem situation. There are generally three scenarios that exist when trying to model a real world problem: • The problem situation is clearly understood and expressed. This scenario is the easiest for the modeler but is rarely encountered. If the problem situation is clear, there most likely is no need for the simulation. 151 • The problem situation is apparently well understood and expressed, although it is not. This speaks to the client’s perception of the problem as compared to reality. Through the process of modeling the modeler may unearth or uncover different deficient areas than identified by the client. The experience of the modeler and their ability to ask probing questions helps rectify this situation. • The problem situation is neither well understood nor expressed. Here the modeler must be very adroit and engage the simulation team at framing the issue at hand. Formal problem structuring methods such as cognitive mapping and causal loop diagrams (as used in systems dynamics models) may be used. However, as several researchers suggest that the simulation itself be used as a problem structuring method. As Robinson (2007b) states, “The idea is not so much to develop an accurate model of the system under investigation, but to use the model as a means for debating and developing a shared understanding of the problem situation. Validity is measured in terms of the usefulness of the model in promoting this debate, rather than its accuracy.” 5.1.3 Determining the Modeling and General Project Objectives Determining the model objectives is the key to model development. This step is broken down into three parts. First the overall aims, or what the organization wants to achieve with the model, are explored. This is an important step and is continually reevaluated as it is the ultimate aim of the simulation modeling. If the model does not contribute to the overall objectives of the firm or study listed in this step then it may be of little value to the user. 152 Second, the modeling objectives are listed. These should be conceived as those that can be achieved from the development and use of the model. The modeling objectives should strive to answer the question ‘by the end of the study what do you hope to achieve?’ They are expressed in terms of achievement, performance, and constraints. Achievement is what we hope to accomplish by developing the model (e.g. increase throughput). Every endeavor is bounded by some kind of restraint and this is no exception. An example of a constraint is the budget; another is available space to perform an activity. Finally, the general project objectives are listed. Here the modeler further clarifies the nature of the model and its use given that this impacts the conceptual design. Consideration should be given to the time scale, run-speed, visual display, ease-of-use, and model/component reuse. 5.1.4 Identifying the Model Outputs The third step in developing the conceptual model is to identify the model outputs or responses. Robinson (2007b) notes that it does not matter whether the responses (Step 3) are considered first or the output (Step 4), however, he states that it is easier to consider what the model intends to achieve rather than the inputs at the conceptual stage. The responses serve the purpose of identifying whether the modeling objectives have been achieved or not. The responses follow directly from the statement of modeling objectives. If the objectives are not being met then the model may need to analyzed at a deeper level by the modeler and subject matter experts. Along with response identification, the modeler should also consider data reporting (i.e. graphical and/or numerical reports) and use of the model (e.g. for learning or other purposes). 5.1.5 Identify the Model Inputs The model data that can be changed in order to achieve the modeling objectives stated on Step 2 are referred to as the model inputs or experimental factors. They are the means by which the 153 stated objectives will be achieved. They may be quantitative, such as a change in capacity of a resource, or qualitative, such as change to the model structure. An advantage of using a modeling method such as the discrete event (or process-centric) method is that experimentation of situations which are difficult to model or predict otherwise, such as customer arrival rates, can be completed fairly easily. When understanding of the system or process is an objective then the list of experimental factors needs to be more subtle. Identification of the most important factors by the domain experts is crucial in this effort. Finally, the methods of data entry should be considered ion this step. 5.1.6 Determining the Model Content The model’s content consists of two main areas, the identification of the scope and the model’s level of detail. The scope identifies the boundaries of the model and the level of detail of the depth of the model. In this step the assumptions and simplifications are identified and the data requirements are clarified. Robinson (2007a) notes that at this juncture, that is prior to beginning Step 5, the use of simulation as the appropriate vehicle for modeling should be questioned. The modeler should ask the question “Is simulation the right approach for the problem situation at hand?” He also notes that steps one through four are applicable to any modeling approach. From Step five forward, the conceptual modeling framework presented here is specific to simulation. Robinson’s Framework discusses simulation specifically in terms of discrete event models but allows for the expansion of the Framework to include other methods. The parts of the discrete event model are referred to as components; the most widely used components are entities, activities, queues, and resources. 154 To determine the level of scope first the entities, activities, queues, and resources are identified. Robinson (2007a) suggests using a three-step approach. Step one identifies the model boundary by referring to the experimental factors and responses identified in Steps three and four of the Framework. Step two consists of identifying the entities, activities, queues, and resources in the real system that lie within the model boundary as well as their connections. Step three is to assess the components included in the model in terms of validity, credibility, utility, and feasibility; this assessment is not included in this dissertation and the reader is directed to Robinson’s paper (2007b) for a full discussion of these criteria. 5.1.7 Determining the Model Level of Detail This sub-step simply delineates the decisions involved in determining the level of detail for each entity, activity, queue, and resource for the conceptual model. Again, the modeler should asses these decisions against the validity, credibility, utility, and feasibility with the modeling stakeholders for a real-world problem. Table 5-1 provides details for consideration of each component. 155 Table 5.1: Template for consideration of level of detail by component type (from Robinson (2007b)). The assumptions and simplifications should also be explicitly stated in this Step five. Assumptions are made when there are uncertainties or beliefs about the being modeled. Simplifications are used to allow more rapid model development and ease of use. The data requirements are also stated in this step. The level of detail table provides a list of data requirements. There are three types of data required: contextual data, data for model realization, and validation data. Only the first two are discussed in this work, contextual data and realization data. Contextual data consists of things such as the layout of the process. Model realization data, or things such as cycle time and mean time between failure (MTBF) data, are 156 obtained from the level of detail table. Finally, validation data is culled from historical records of existing systems. Note that validation is beyond the scope of this dissertation. 5.2 A Conceptual Framework to Describe Disruptions in the Construction Process This dissertation is driven by the objective to gain a better understanding of disturbances in the construction process. Jackson and Madni (2009) claim that disturbances lay the groundwork for accidents and poor safety in general production work. The claim by the author of this work is that RE is a means to better understand and deal with single or multiple disturbances in a construction setting. If disturbances are handled in a better way then perhaps accidents and fatalities may decrease. In this section the conceptual framework described above is applied to an abstract construction process to gain a better understanding of disruptions and RE with the goal of stimulating communication and debate regarding the use of RE to understand disruptions. Robinson’s (2007a,b) framework as described in Figure 5.1 is followed to develop the model and is amended to include agent-based simulation for the purpose of this dissertation. It is also amended to suit the research activity at hand which is conceptual and abstract. The five steps of conceptual modeling are now presented followed by the model coding and a discussion of the experiments. 5.2.1 Step 1: Understand the Problem Situation The problem situation under consideration is that disruptions are not well understood in the context of construction production processes. Furthermore, the understanding of multiple disruptions, including concurrent and sequential disruptions is not well understood. Understanding disruptions is important given that some researchers have correlated their 157 occurrence with unwanted occurrences on construction projects, namely accidents. The emerging paradigm of RE may help researchers and practitioners better understand disruptions. The model developed uses some RE concepts in the simulation to aid communication and negotiation among practitioners to gain a better understanding of disruptions and to ultimately handle them better. In terms of the scenarios discussed above, the disruption problem is neither well understood nor expressed. Fatalities and accidents continue to occur on construction sites in disproportionate numbers. 158 5.2.2 Step 2: Determine the Objectives Project Aim The overall aim of the simulation is to learn from the development and implementation of this model. It will experimentally simulate disruptions in an abstract construction operation setting using the ETTO Principle in a formalized way using the hybrid computational methods of agent-based and discrete event modeling. System throughput will be used to compare the experiments. Modeling Objectives • • • • To address the Research question and Objective 3 To understand the performance variability of the production process subject to disruptions of varying intensity and occurring at the same or dissimilar times. • This will be assessed generally by system throughput • The disruptions will vary in intensity and occurrence. They are either internal to the system or originating externally. The impact of the disturbance will be reflected in a loss of resources for the particular activity. To represent and better understand the ETTO. To use some RE concepts in the simulation to aid communication, negotiation, and scrutiny among researchers and practitioners to gain a better understanding of disruptions and to ultimately handle them better. General Project Objectives • Time-scale: six month project, 8 hour working day, holidays are ignored. • Flexibility: limited level given that the model is abstracted and extensive model changes are not expected (i.e. the process remains the same over each experiment) • Run-speed: real-time for illustrative purposes and virtual for debugging • Visual Data: simple 2D animation – this model is mainly for performing experiments and the corresponding results. Graphics are only needed for model testing and diagnosis of problems during experimentation. • Ease-of-use: simple interactive features given that the model is for use by the odeler. Figure 5.2 : Listing the Project Aim and Objectives 159 5.2.3 Steps 3 and 4: Identifying the Model Outputs and Inputs The proposed the model outputs (responses) are identified as follows: Outputs (to determine achievement of objectives) Throughput over the time-scale Use: • Learning and understanding Data reporting: • Raw data (throughput) • Bar Chart of daily throughput • Mean, standard deviation, minimum and maximum daily throughput Figure 5.3: The model outputs The proposed model inputs (experimental factors) are identified as follows: Experimental Factors • • • • Baseline model of the production train with no disruptions or agents Add disruptions to the baseline model Add agents with “thorough” behaviors that perceive production pressure to the baseline model Add agents with “efficient” behaviors that perceive production pressure to the baseline model Figure 5.4 : Identifying the model inputs 160 5.2.4 Step 5: Determining the model content The model content (scope and level of detail), identifying any assumptions and simplifications are determined as follows: Component Entities Detail Include/exclude Quantity Activities Include Arrival Pattern Exclude Attribute Exclude Routing Exclude Other Include Quantity Include Nature (X in Y out) Exclude Cycle Time Breakdown/repair Include Set-up/changeover Exclude Resources Included Table 5.2: Model Content 161 Justification This is a linear production scenario, therefore one unit “chunk” of work at a time is considered. Adds unneeded complexity to the simulation Adds unneeded complexity to the simulation Adds unneeded complexity to the simulation Display style (bag in animation) Only one activity at a time is considered Adds unneeded complexity to the simulation MTBF is considered as attached to a resource and thus an activity Adds unneeded complexity to the simulation Resources are removed as a result of the disruption Table 5.2 (cont’d) Queues Resources Shifts Exclude Routing Quantity Exclude Include Capacity Exclude Dwell Time Exclude Queue Discipline Exclude Breakdown/repair Exclude Routing Exclude Other Exclude Quantity Include Where required Include Shifts Exclude 162 Adds unneeded complexity to the simulation Ditto Schema calls for deletion/recovery of resources as a result of the disruption/recovery Not needed in this model This is determined by process times Adds unneeded complexity to the simulation This impact is accounted for in the resource Default to “first in, first out” Adds unneeded complexity to the simulation Resources are aggregated Resources are attached to each activity to simulate performance variability Adds unneeded complexity to the simulation Model Assumptions and Simplifications Assumptions • All trades have the same delay times • The queues are infinite • Each trade has the same (5) number of resources • No holidays Simplifications • Entities are readily available and not disrupted (i.e. Supply chain disruptions are neglected) Table 5.3: Model Assumptions and Simplifications Data Requirements • Delay times for trades • Resource capacities • Work schedule • Resource schedule • MTTF, MTTR • Number of crews • Project length Table 5.4: Data Requirements 5.2.5 Coding the Computer Model 5.2.5.1 Experiment 1 Simulation Figure 5.5 illustrates Experiment 1. This was used created as a baseline with which to compare the results of the other experiments. Experiments 2 and 3 expand this model to include disruptions and then agents respectively. The elements of the model are explained below. 163 Figure 5.5: Experiment 1 The model begins with a source that generates the entities. In this model the entites are considered a unit of work that is delivered to the jobsite daily. The reader might assume that they are something like modular housing units or precast panels that are deliverd to the field and unloaded from the delivery truck and installed. The delivery may include anywhere from 1 to five units per day. That is, 5 modular living units to install or 5 panels of precast concrete for fascia. Each experiment begins with one unit delievered to the jobsite at 7 am each Monday morning and progresses to five units per delivery per simulation run. Then prodcution pressure is increased to include Monday and Tuesday deliveries, starting again at Monday morning at 7 am with one delivery and then one delivery Tuesday morning at 7 am. This sequence is repeated in each simulation run for every day of the week until five units are delivered each morning at 7 am. The schema is realized in Anylogic as shown in Figure 5.6. Figure 5.6: Sample Arrival schedule for work entities. Each workday begins with a delivery of 1 to 5 units. This figure indicates a delivery schedule of five units delivered at 7 am each day of the work week. 164 The entities are processed by a series of trades or activities that complete the installation. This is modeled as a “Service” object in Anylogic. Here the entity enters, seizes a resource from the resource pool, is delayed (or processed), and then released after the process time. Also embedded in the service object is a queue where the entity waits for a resource. In this model the queues are assumed infinite for simplification. Each trade has a corresponding amount of aggregate resources (e.g. “resource_A”), this could be thought of as machinery, people, or anything else needed to complete the work. The number of available reources effectively determines the capacity of each trade. For this model the resources are abstracted and aggregated for simplification. Each trade is assigned 5 units of resources. In this model the capacity of each trade is set to the maximum possible allowed by the software but is limited by the resources available. Each incoming entity seizes a resource unit, delays it, then releases it to the next trade unless it is complete for the purpose of the simulation. This is accomplished via the “sink” object. The delay times and resource capacity are parametized (e.g. “a_delay”)for ease of data input and flexibility. The delay times are set to a triangular distribution and are stochastic. A work schedule for the resources was set for this simulation. An eight-hour workday that includes two fifteen minute breaks and a one hour lunch was chosen. The resource schedule is controlled by an on/off scheme. That is, the resources are either working or available to work, or off. The resource schedule is shown in Figure 5.7. 165 Figure 5.7: Resource schedule showing an eight-hour workday with two fifteen minute breaks and a one hour lunch. The arrival and resource schedules have been designed so that they are in synch. That is, a delivery only occurs during working hours (i.e. everyday at 7 am when the crews start to work). A simple two-dimensional graphic was created that shows how the entities move through and are handled by the system. It is shown in Figure 5.8. Figure 5.8: Animation for each Experiment. The number “5” indicates the capacity of each trade. 5.2.5.2 Experiment 2 Simulation In Experiment 2 (partially shown in Figure 5.9), Experiment 1 is expanded to include disruptions to the model. The disruptions added are of either Type A (external to the system) or Type B 166 (systematic in nature and a disruption of function, capability or capacity) as described in the literature review. Figure 5.9: Experiment 2. Type A disruptions (those external to the system) are modeled as events (i.e. event, event1, event2, event3, and event4). Each event (i.e. event, event1, event2, event3, and event4) is scheduled by the modeler at a certain time and date during working hours and occurs only once in each simulation run. In effect, here the modeler may schedule disruptions that could be considered “acts of God” such as a lightning strike, or a major supply chain disruption. The events disrupt by reducing the capacity of each trade by 3 units, effectively slowing throughput. The capacity reduced is not restored fully until reset by the Type B disruption discussed below. Type B disruptions are modeled by statecharts as shown in Figure 5-7. This can be thought of as a disruption caused by an equipment breakdown, or perhaps a routine maintenance of equipment that shuts down work for a day, effectively halting production. The state of this disruption is either “Working” or “Out of Order.” The states transition to one state or another by the “Maintenance” or “Repair” transition link. The parameters “MTTF” and “MTTR” refer to “Mean Time To Fail” and “Mean Time To Repair,” respectively. Each resource is given an estimate to when it may be knocked out of commission, or fail. Each resource is assigned a 167 different MTTF to add to the stochastic nature of the program and is not synced to the resource schedule. That is a disruption could occur in non-working hours and not affect production. However, these disruptions could occur during working hours and be disruptive, adding to the randomness of the program. Each time the maintenance transition fires it sets the resource capacity to zero for that particular trade for one working day. After that, the repair transition fires and sets the resource capacity back to full capacity (5 resource units). The statecharts modeled for this simulation are specifically a “Disruption of Unreliability” as described in the literature review. Figure 5.10: Experiment 2. Type B disruptions (those internal and systematic) are modeled as statecharts (statechart, statechart1, statechart2, statechart3, statechart4). The parameters MTTR and MTTR contain the time between failures and recovery, respectively. 5.2.5.3 Experiment 3 Simulation Experiment 3 expands Experiment 2 by adding agents to the simulation model. These agents can sense their environment, specifically the amount of work entities moving through the system in the vicinity of their trade and adjust the speed with which they do work accordingly. Specifically, they sense if entities residing in the queues waiting to be processed. If the queue 168 preceding becomes too long they work faster. However, if the queue succeeding their work becomes too long they will slow their pace. The acceptable queue lengths are described in ranges and are illustrated below. The agents behavior seeks to counteract the increased production and seek to provide the thoroughness in the ETTO. The time to work on each entity is the “delay” in the service object mentioned above. Each trade has one agent, which represents a crew. The crew has three working speeds, normal, faster, and slower. Every crew has the same speed distribution. The normal speed is as described in Experiment 1 as a triangular distribution with the arguments (2.5,3,3.5) which corresponds to days needed to work on an entity. The faster and slower speeds are distributed as (1.5, 2, 2.5) and (4.5,5,5.5), respectively. As an example, the workers in Trade B will work at a normal speed as long the queue preceding it is between zero and two (including the endpoints) and the Trade C queue is less than or equal to 2. It will work at an increased speed if the Trade B queue is less than two and the Trade C queue is less than or equal to 5. Finally, it will work at a slower pace, so that the work is not “piled on” the succeeding crew, when the Trade C queue is greater than or equal to 15 units. These conditions were coded into the Anylogic program and a sample is shown below in Figure 5.11. Figure 5.11: Experiment 3 Java coding for Trade B. 169 5.2.5.4 Experiment 4 Simulation In Experiment 4 the agents behavior was changed. Instead of looking to the queue immediately succeeding it and adjusting the installation time, the crews only sought to maximize their own effort. If the queue preceding it, its own queue, reached a threshold level, it sped up; otherwise it went at normal speed. This behavior mirrors a more efficient (as compared to thorough) attitude in the ETTO Principle outlook. The Java coding for this behavior is shown in Figure 5.12 Figure 5.12: Experiment 4 Java coding for Trade B. 5.2.6 Understanding of the Experiments The purpose of developing the conceptual model for this work is to show an exemplar of how hybrid computer simulation could be used to illustrate RE principles under simulated field conditions. It would be difficult and overly complex to represent all of the RE principles in one model. Specifically, the ETTO Principle was illustrated with conditions of a typical construction production process subject to internal and external disruptions and increasing production pressure. The agents contained the ETTO behaviors of either efficiency or thoroughness. The elements of a production line and disruptions were used to correspond to the RE premise that production and safety are inseparable and that disruptions set the stage for accidents. 170 Stochasticity was added to the system in the form of triangular probability distribution for the delay times and internal disruptions. Simulations were run over an approximately six month period which corresponded to the project length. With the exception of the increasing production pressure as entered by the modeler, all variables were kept constant for each Experiment and simulation run. Twenty simulation runs were made for each delivery scheme. The average throughput of each delivery scheme was compiled and compared; an example is given in Table 5-1. Additionally, each chart was graphed; an example is shown in Figure 5.13. Table 5.5: Compiled Average Throughput of the system for each experiment. 171 Figure 5.13: Graph showing the behavior of the production system subject to disruptions and increasing production pressure. The production pressure is manifest as a delivery scheme that progressively increases deliveries each day of the week. For instance, one unit is delivered on Monday only, and then 1 unit is delivered on Monday and Tuesday, and so on. The data for this graph is shown in Table 5.8. Throughput was used as proxy for performance and evaluation of the system. In general, most industries, including construction, aim for high throughput to maximize profits. The highest average throughput, 240.5 units completed over the length of the project, occurred during Experiment 3 on the Monday, Tuesday, Wednesday only delivery schedule that contained five deliveries per day. This scheme had a high standard deviation of 26.8 with a range of 201 to 292 units over the length of the project. Standard deviation may be a proxy for variation in the system. 172 Highest Throughput Experiment 1 222.25 Experiment 2 189.5 Experiment 3 240.5 Experiment 4 234 Occurs at Delivery Scheme MTWTh ONLY; 3 Deliveries per day MTWTh ONLY; 5 Deliveries per day MTW ONLY; 5 Deliveries per day MTWTh ONLY; 5 Deliveries per day Standard Deviation 12.3 21.5 26.8 22.1 Table 5.6: Compiled performance statistics of the model. Based on trends in the raw data and the compiled data in Table 5-9, the project would benefit from crews that take a more thorough approach (in terms of the ETTO Principle) and choose a Monday, Tuesday, Wednesday delivery schedule that has delivers five units per day. One reason that the performance is superior in this Experiment may involve the use of agents that seek to be thorough rather than strictly efficient. These crews may be able to mitigate the disruptions in a manner that benefits the system. The data appears to mimic Woods and Wreathall’s (2008) Stress-Strain Analogy for RE shown in Figure 2-6. This is most clearly shown in Figure 5.13 above. Increasing demands placed on the system, with the result of increased throughput, are plotted on the y-axis (i.e. “Average Throughput”). This is what Woods posits corresponds to the stress. The strain, or how the system stretches to accommodate increasing production pressure, is plotted on the y-axis (i.e. “Number of Deliveries per Day of the Week”). The linear portion of the curves represents 173 the “on-plan” or “uniform” area. Here the project can easily handle the production schedule (or deliveries) even in the face of disruption. At the inflection point of about 213 units, using Woods and Wreathall’s analogy, the project begins to adapt to handle the handle the increased production and disruptions. In material science this is called the plastic region, Woods and Wreathall call it the “extra-region” and it represents first-order adaptive capacity to deal with the increased demand. In this section the data indicates that the variability of the system increase as the standard deviation jumps sharply, indicating increased variability in the system. Comparing Woods and Wreathall’s idealized graph with Figure 5.13, a noticeable difference is that the former is decidedly parabolic in the extra-region, indicating continually changing condition. Figure 5.13 is linear in the extra-region. This may be accounted for in that the variables in all of the trades are equal for the experiments. At around 230 units, Experiments 3 and 4 show a slight downward trajectory as the demands increase. This may correspond to Woods and Wreathall’s area of second-order adaptive capacity. If resources are not increased in this region it is predicted that the system will fail. 5.3 Verification and Validation Verification is the process of model debugging (Martinez 2010). North and Macal (2007) define a verified model as one that works as designed. This model was verified by code debugging, logic examination, and comparing several runs of the simulation with hand calculations. Code debugging is embedded in the Java Eclipse platform of Anylogic. Code errors were corrected as they were found. The logic examination and hypothetical test case were completed hand-in-hand to ensure the model was running as envisioned and designed by the author. A simple test case of the delivery schedules was conducted that tested the agents intervals and tolerance for 174 anticipated and completed work. All of the data was examined for reasonableness as the experiments were conducted. In the modeling world there is great debate over the need for and the process of validation. This work was verified but not validated. Opinions run the gamut from the need for absolute validation (Heath 2010) to the impossibility of validation (Sterman 2000). Martinez (2010) posits that “A model is considered valid only for the purpose for which it is built, and not in absolute terms.” In terms of Epstein’s 16 ways models are useful (as presented in Chapter 3) this model was built to explain RE principles and the ETTO, especially, the interplay among production, safety, and disruptions; illuminate the core dynamics of RE; demonstrate tradeoffs; and to challenge the robustness of prevailing theory through perturbations. The computer simulation was designed to be as abstract as possible but to still represent a typical construction process. The simulation was inspired by the “Parade of Trades” dice game used by the Lean Construction Institute (LCI) to illustrate variability in construction operations. The author did not find any validation procedures for hybrid modeling. Validation schemes do exist for agent-based and discrete event systems individually but there does not appear to be a consensus in the literature with regard to best practices for validation either approach. Speaking of DES and the relationship among the purpose of a study, the rigor it requires, and its validation, Martinez (2010) states: “When DES is used to demonstrate some principle or the effectiveness of some technique, the emphasis should be on capturing the principle faithfully. Input distributions need to make sense and be reasonable to individuals knowledgeable in the area of application, but they do not need to be based on collected data.” and “Many DES models exist solely to demonstrate a research product that in some ways claims to advance the state-of-the-art in simulation modeling. This is another case where the specific inputs and details that are used are of no significant consequence unless the research is related to data synthesis—they must simply be reasonable in order to give credibility to the work.” 175 No data was collected for this work and the inputs were based on the authors experience in the construction industry and are reasonable for this work. In his dissertation Son (2011) presents the argument that validating ABMs is more difficult than other models due to path dependencies, multiple equilibrium, and even the absence of equilibrium altogether. He cites Miller and Page (2007) and Banks et al (2004) in observing that “The validity of ABM should be evaluated not only by predictive capability but also explanation accuracy of formal models unlike traditional simulation approaches where validation is the overall process of comparing the model and its behavior to the real systems and its behavior.” Son posits that in ABM’s “If we examined actual test data to test simulation results, it would be like comparing points to clouds of points. Any given result of an ABM would differ from others depending upon the random features inherent in the rules.” Son also cites Yilmaz (2006) in the position that ABMs should be consider a model to be valid on the basis of qualitative and subjective evaluations of its contextual adequacy rather than an objective representation of the system under study. Therefore, Son suggests using North and Macal’s (2007) guidance for ABM validation consisting of agent behavior validation, interaction validation, and emergent structure validation in addition to the validation of model inputs, model outputs, and processes in the model: • Agent behavior validation: Do agent behaviors correspond to agents in the real world? What theory is included in the model about agents? • Interaction validation: Do interaction mechanisms correspond to agents in the real world? What theory is included in the model about processes? • Emergent structure validation: Does the model look right? 176 The agent behavior and interaction in this work closely match the real world. They are grounded in RE premises and theory, especially the notion that “Safety and field operations management are inseparable and do not operate independently.” Additionally, it explores the premise that performance conditions are always underspecified in the face of disruptions and that adjustments must be made. Finally, it examines performance variability in the form of the ETTO. With respect to interaction validation I argue that the mechanisms are grounded in real world observations and that the inspiration for the process, the “Parade of Trades” dice game is a welltested physical simulation that has been used by industry and academia for many years to illustrate performance variability. Finally, the emergent structure validation is found in the similarity of the output to Woods and Wreathall’s (2008) Stress-Strain Analogy for RE shown in Figure 2-13 and the output graph shown in Figure 5.13. This is further explained in section 5.2.6. The model ‘looks right” and acts as expected over a broad range of inputs. In summary, the rigor of both DES and ABM validation depends upon the question being addressed and the purpose of the model. Son (2011) notes that: “When the purpose of a computational is explanations of behavior and hypothesis testing, it is important to abstract the complexity of the real world to obtain insight and a parsimonious explanation. When the purpose is exploration and theory generation, validity requires a confirmation that the model makes sense in larger world of possibility. Then, when the purpose is prediction and advice, the validity should take more practical dimensions into consideration.” The purpose of this research is to understand disruptions and how they might affect safety using the emerging paradigm of RE by developing a conceptual framework. That is, to use Son’s description, to explore whether RE might be useful in a larger sense in the construction 177 industry, the level of validation is sufficient in the authors’ opinion and no outside data collection is necessary at this stage of the research. 178 Chapter 6: Conclusions This Chapter summarizes how the Goals, Research Questions, and Objectives of this research were met. It also states the expected contributions of this research, addresses the limitations of the research, and identifies possible future research agendas. 6.1 Review of the Research Goals, Questions, and Objectives This research was motivated by a desire to gain a better understanding of disruptions on construction projects. It is suggested that the emerging paradigm of RE provides a lens through which disruptions may be better understand, harnessed, and foreseen on construction projects. One reason this is of interest is the disproportionally high accident and fatality rate in the construction industry as compared to other industries. In general, the phenomena of disruptions in general are not well-understood, categorized, or researched in the construction industry. The goal of this study, to explore schemes and methods to understand, harness, and foresee disturbances that arise from demands placed on the construction operations of project-based organizations that deliver the built environment, was partially met by an examination and analysis of the literature, the creation of a conceptual framework for RE implementation on construction projects, and a computer simulation that mimicked a production process subject to disruptions and populated by agents. The literature review chronicled previous schemes and methods over the history of industrialized society to deal with safety issues in all industries. The focus of this work was on exploring the concept of RE as one way to meet this goal. Hybrid modeling was used to explore its utility in simulating internal and external disruptions with a RE principle as a backdrop. Specifically, the notion of the ETTO was used as inspiration for the hybrid computer model. 179 The Research Questions were approaches and answered in various parts of the dissertation. The first question, “How does RE differ from traditional ways of thinking about how to deal with disruptions?” was answered in the literature review. A distinct trajectory was traced that led to the current thinking about RE. Many traditional approaches to safety and risk management are encapsulated in the concept of RE, but are viewed anew through a “lens” of resilience. Some traditional safety approaches, such as the idea of placing blame due to “human error” are rejected due to the fact that many systems have become so complex that the blame is more likely systematic than due to a singular person or event. RE also appears to be closely aligned with Reliability Concepts. Indeed, RE researchers point out that reliability may be necessary to handle disruptions that may endanger safety, but that reliability is not enough, a system also needs to be resilient to effectively deal with disruptions. The second question, “What are the current principles and practices of RE?” was addressed in the literature review. The third question “What elements of RE may help a construction project avoid, survive, and recover from disruptions?” was answered in the Framework. Finally, the fourth question “How can we begin to simulate disruptions in construction operations and use RE principles?” was explored in the computer simulation. Objective 1, “Abstract the concept and underlying theories of RE and explore RE deployment in non-construction industries for use in formulating Objectives 2 and 3.” was met via the literature review. The conceptual framework mentioned in Objective 2 is presented in Chapter 4. Finally, Objective 3 is explored in the simulation presented in Chapter 5. 6.2 Contributions to Knowledge of this Research This research provides several contributions to the advancement of the body of knowledge in the construction industry. In general, it introduces the idea that the emerging paradigm of RE may 180 be a way to deal conceptually and practically with disruptions to the construction process. Specifically, it posits that RE is worthy of further investigation to alleviate the chronic safety problem in the construction industry. Other contributions include the extensive literature review, a conceptual framework that provides a scaffold that other researchers and practitioners may build upon, and a process and example of hybrid simulation modeling that incorporates disruptions and worker behavior in the context of a production process. These are discussed below. The literature review is a contribution in itself for two reasons. First, RE is an emerging paradigm and the entire “story” of how it came into being and how it relates to approaches to risk management and safety that came before it is neither clear nor chronicled in the literature. The literature review in this work expanded upon Hale and Hovden (1998) and Bory’s et al. (2009) work to provide a complete picture of the trajectory of RE by elucidating the “Five Ages of Safety.” This clarity is not found elsewhere in the literature. The second contribution of the literature review was to expose the gap that exists in disruptions research in construction. Disruption analysis appears to be primarily confined to delay claims for litigation in the literature. A second contribution of this work is the development of the conceptual framework to provide a starting point for the discussion of the use of RE in construction. To the author’s knowledge, no other research has attempted to abstract and formalize RE principles and translate them to the construction industry. The conceptual framework in this work follows Smythe’s (2004) definition and is intended to help form the “…agenda for negotiation to be scrutinised and tested, reviewed and reformed as a result of investigation.” The Framework provides scaffolding for other researchers to test, build upon, and to debate. 181 Finally, this work provides both a conceptual simulation model that may be adapted to explore other areas and facets of production systems subject to internal and external disruptions of varying magnitude. The RE related ETTO Principle was used as an exemplar to test the production and disruption scheme. It appears to offer greater insight and understanding of RE and can be adapted to include richer agent behaviors and test other RE scenarios. 6.3 Limitations of this Research The main limitation of this study is that it is abstract and conceptual and may or may not represent actual conditions. In other words, it needs to be tested against a concrete construction project scenario. This applies to both the Framework and Simulation. Another limitation lies in that not all aspects of RE have been explored. Including all of the various aspects of RE into a single simulation may prove unwieldy and overly complicated. Finally, the Literature Review revealed a gap in research and understanding of the phenomena of disruptions and its relation to the production process. More research needs to be conducted in this area. 6.4 Future Agenda The area of RE is ripe for future research. The limitations of the study point the way to future research in the area of disruptions, RE, production and the interrelation among the three. Further studies need to be completed on the nature and frequency of occurrence of disruptions on construction projects. Some future research ideas include measuring the baseline resilience of construction companies and identifying their strengths and weaknesses in terms of response, anticipation, monitoring, and learning. This work could be used to calibrate the Framework and Simulation. Other work could include further simulating other areas of RE such as a Just Culture and organizational behaviors, the so-called “soft” factors, using the computational technique of 182 system dynamics. This would enhance and triangulate the analysis by providing a richer and fuller understanding of RE. Systems Dynamics (SD) was developed at MIT in 1950’s by Jay Forrester. The SD method helps users better understand how complex systems function over time. SD computes interactions among abstracted system elements and allows the programmer to model non-linear feedback loops. In short, SD focuses on cause and effect in systems and related feedbacks (Sterman, 2000). System Dynamics uses a high level of aggregation, continuous flows, and focuses on the pattern of behavior produced by a system, such as increasing costs, decreasing quality, and stagnating waiting times (Sadsad and Donnell 2007). SD has been used in construction to examine how construction managers learn and to develop curriculum based on the SD model (Mukherjee et al. 2005). SD is considered a “top-down” approach as previously discussed. 183 BIBLIOGRAPHY 184 BIBLIOGRAPHY Adamski, A. J. & Westrum, R. (2003). Requisite Imagination: The Fine Art of Anticipating What Might Go Wrong. In E. Hollnagel (Ed.), Handbook of cognitive task design. Mahwah, NJ: Lawrence Erlbaum Associates. Aldrich, M. (1997). Safety first: technology, labor, and business in the building of American work safety, 1870-1939. Baltimore, Md: Johns Hopkins University Press. Association for the Advancement of Cost Engineering (AACE) International, Inc. (2004). “Estimating lost labor productivity in construction claims,” AACE International Recommended Practice No. 25R-03. Axelrod, R. and Tesfatsion L. (2005). “On-Line Guide for Newcomers to Agent-Based Modeling in the Social Sciences.” (April 21, 2009). Barroso, M. P., and J. R. Wilson. (1999). "HEDOMS - Human Error and Disturbance Occurrence in Manufacturing Systems: Toward the Development of an Analytical Framework". Human Factors and ergonomics in Manufacturing. 9 (1), 87-104. Bertelsen, S. (2003). “Complexity – Construction in a New Perspective.” Proc., 11th Conf. of the Int. Group for Lean Construction, Blacksburg, VA USA. Boin, A., and Schulman P. (2008). "Assessing NASA’s Safety Culture: The Limits and Possibilities of High-Reliability Theory". Public Administration Review. 68 (6), 1050-1062. Borshchev, A. and Filippov, A. (2004). “From System Dynamics and Discrete Event to Practical Agent Based Modeling: Reasons, Techniques, Tools.” The 22nd International Conference of the System Dynamics Society, July 25 - 29, 2004, Oxford, England. Borys D, Else D, Leggett S. (2009). “The fifth age of safety: the adaptive age.” J. Health Saf. Res. Pract. 1(1), 19-27. CPWR--THE CENTER FOR CONSTRUCTION RESEARCH AND TRAINING. (2008). The construction chart book: the U.S. construction industry and its workers. Silver Spring, MD. Chenhall, Everon C. (2010). "Assessing safety culture, values, practices, and outcomes." PhD thesis., Colorado State University. Choudhry, R. M., Fang, D., & Mohamed, S. (2007). “The nature of safety culture: A survey of the state-of-the-art.” Safety Science, 45 (10), 993-1012. The Construction Industry Institute (CII) University of Texas at Austin. (1989). Management of project risks and uncertainties. Austin, Texas, The Construction Industry Institute, University of Texas. 185 Díaz-Cabrera, D., Hernández-Fernaud, E., & Isla-Díaz, R. ( 2007). “An evaluation of a new instrument to measure organisational safety culture values and practices.” Accident Analysis & Prevention, 39( 6), 1202-1211 Dekker, S. (2005). Ten questions about human error: a new view of human factors and system safety. New York: Lawrence Erlbaum. Dekker, S. (2007). Just culture: balancing safety and accountability. Aldershot, England: Ashgate. Epstein, J. M., & Axtell, R. (1996). Growing artificial societies social science from the bottom up. Complex adaptive systems. Washington, D.C., Brookings Institution Press. Epstein J.M. (2008). “Why model?” JASSS. 11 (4), 28-47. Gehbauer, F. , Zülch, G. , Ott, M. , and Börkircher, M. (2007). “Simulation-based analysis of disturbances in construction operations.” Proc., 6th Conf. of the Int. Group for Lean Construction , East Lansing, Mich., 571–579. Glendon, I.A., Clarke S., and McKenna E.F.. (2006). Human safety and risk management. Boca Raton: CRC/Taylor & Francis. Groeneweg, J. (1998). Controlling the controllable: the management of safety. DSWO Press, Leiden, Netherlands. Grote, G., and C. Kunzler. (2000). "Diagnosis of safety culture in safety management audits". Safety Science, 34(1), 131-150. Guldenmund, F. W. (2000). "The nature of safety culture: a review of theory and research". Safety Science. 34(1-3). 215-257. Hale, A. R. & Hovden, J. (1998). “Management and culture: the third age of safety. A review of approaches to organizational aspects of safety, health and environment.” In A.-M.Feyer & A. Williamson (Eds.), Occupational Injury: Risk, Prevention and Intervention London: Taylor & Francis Ltd. 129-165. Halpin, D. W., & Woodhead, R. W. (1976). Design of construction and process operations. New York, Wiley. Heath, B. L. (2010). “The history, philosophy, and practice of agent-based modeling and the development of the conceptual model for simulation diagram.” PhD. dissertation, Wright State University, Dayton, OH. Heinrich, H. W., Petersen D.,. Roos N.R., Brown J., and Hazlett S.. (1980). Industrial accident prevention: a safety management approach. New York: McGraw-Hill Herrera, I. A., Hollnagel, E., Macchi, L., & Woltjer, R. (2010). “Exploring Resilience Engineering Contribution to Risk Analysis in Air Traffic Management.” October. EUROCONTROL. 186 Hollnagel, E. (2004). Barriers and accident prevention. Aldershot, Hampshire, England: Ashgate. Hollnagel, E., and D. D. Woods. (2005). Joint cognitive systems: foundations of cognitive systems engineering. Boca Raton, FL: Taylor & Francis. Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience engineering: concepts and precepts, Ashgate, Aldershot, England. Hollnagel, E. (2007). “Resilience Engineering Demystified”. Chair’s Newsletter. < http://www.crc.mines-paristech.fr/isc/newsletter/news1_1.pdf>. (May 23, 2008). Hollnagel, E., Nemeth, C. P., & Dekker, S. (2008). Resilience engineering perspectives. Volume 1, Remaining sensitive to the possibility of failure. Ashgate, Aldershot, England. Hollnagel, E. (2009). The ETTO principle: efficiency-thoroughness trade-off: why things that go right sometimes go wrong. Ashgate Farnham, England. Hollnagel, E. (2011). “Epilogue: RAG–the resilience analysis grid.” Hollnagel, E., Pariès J., and Woods D.D.,and Wreathall J.. 2011. Resilience Engineering in Practice a Guidebook. Ashgate Gower. Hollnagel, E., Pariès, J., Woods, D., & Wreathall, J. (2011). Resilience engineering in practice: a guidebook, Ashgate Farnham, England. Hovden, J., Eirik A., and Herrera I.A..(2010). "Is there a need for new theories, models and approaches to occupational accident prevention?" Safety Science. 48 (8): 950-956. Ibbs, W., L.D. Nguyen and S. Lee (2007). "Quantified Impacts of Project Change," J.Prof. Issues Eng. Edu.Prac., 133 (1), 45-52. International Ergonomics Association (IEA) (2011). “Definition of Ergonomics “. (Sep.3, 2011). Jackson, S. (2010). Architecting resilient systems: Accident avoidance and survival and recovery from disruptions. Hoboken, NJ: Wiley. Kuenzi, M., and Schminke M.. (2009). "Assembling Fragments Into a Lens: A Review, Critique, and Proposed Research Agenda for the Organizational Work Climate Literature". Journal of Management. 35 (3): 634-717. Kuhn, Thomas S. (1996). The structure of scientific revolutions. Chicago, IL: University of Chicago Press Kuivanen, R. (1996), Disturbance control in flexible manufacturing. Int. J. Hum. Factors Manuf., 6 (1), 41–56 Lay, E. (2011). “Practices for noticing and dealing with the critical. A case study from maintenance of power plants.” . 187 Leveson, N..(2004). "A new accident model for engineering safer systems." Safety Science 42(4), 237-270. Lindau, R. A., & Lumsden, K. R. (1995). “Actions taken to prevent the propagation of disturbances in manufacturing systems.” INT, J. PROD. ECON. 41 (1), 241-248. Lingard, H., and Rowlinson S. (2005). Occupational health and safety in construction project management. London: Spon Press. Macchi, L., and Hollnagel E. (2011). “A Resilience Engineering approach for the evaluation of performance variability development and application of the Functional Resonance Analysis Method for air traffic management safety assessment.” Ph.D. Thesis. Paris: MINES ParisTech. Madni A.M., and Jackson S. (2009). "Towards a conceptual framework for resilience engineering". IEEE Systems Journal. 3 (2), 181-191. Manuele, F. A. (2008). Advanced safety management focusing on Z10 and serious injury prevention. Hoboken, N.J.: Wiley-Interscience. Martínez, J. C. (1996). “Stroboscope: state and resource based simulation of construction processes”. Ph. D. Thesis --University of Michigan, 1996. Martinez, J. C. (2010). “Methodology for Conducting Discrete-Event Simulation Studies in Construction Engineering and Management.” Journal of Construction Engineering and Management, 48(1), 3-16. McDonald, N. (2006). “Organizational resilience and industrial risk.” Resilience engineering. Concepts and precepts. Aldershot: Ashgate, E. Hollinagel,DD Woods N. Leverson eds., Surrey UK, Adgate Publishing, 155 – 180. Mendonca, D. (2008). Measures of Resilient Performance. In: HOLLNAGEL, E., NEMETH, C. P., & DEKKER, S. (eds). Resilience engineering perspectives. Volume 1, Remaining sensitive to the possibility of failure. Ashgate studies in resilience engineering. Aldershot, Ashgate. Miller, J. H., & Page, S. E. (2007). Complex adaptive systems: an introduction to computational models of social life. Princeton, N.J., Princeton University Press. Mitropoulos, P., Abdelhamid, T. S., & Howell, G. A. (2005). Systems Model of Construction Accident Causation. Journal of Construction Engineering and Management. 131, 816-825. Mukherjee, A., Rojas, E. & Winn, W. (2005) "Exploring Mental Models of Construction Managers" ASCE Construction Congress 2005, San Diego. Nemeth, C. P., Hollnagel, E., & Dekker, S. (2009). Resilience engineering perspectives. Vol. 2, Preparation and restoration. Farnham, England, Ashgate. North, M. J., & Macal, C. M. (2007). Managing business complexity: discovering strategic solutions with agent-based modeling and simulation. Oxford, Oxford University Press. 188 Oglesby, C. H., Parker, H. W., Howell, G. A., & Parker, H. W. (1989). Productivity improvement in construction. New York, McGraw-Hill. Paries, J. (2011). “Lessons from the Hudson.” In Hollnagel, E., Pariès J., and Woods D.D.,and Wreathall J.. 2011. Resilience Engineering in Practice a Guidebook. Ashgate Gower. Pariès, J. (2012). “Resilience and the ability to respond.” Resilience engineering in practice: A guidebook, Pariès, M. J., Hollnagel, E., Wreathall, M. J., & Woods, D. D. (Eds.). .. Ashgate Publishing, Ltd.,3-8. Perrow, Charles. (1984). Normal accidents: living with high-risk technologies. New York: Basic Books. Qureshi, Z.H.; Ashraf, M.A.; Amer, Y. (2007) "Modeling industrial safety: A sociotechnical systems perspective," Industrial Engineering and Engineering Management, 2007 IEEE International Conference on , vol., no., pp.1883,1887, 2-4 Dec. 2007. Rasmussen, Jens. (1997). "Risk management in a dynamic society: a modelling problem".Safety Science.27(2),183-213. Rasmussen & Svedung. (2000). Proactive risk management in a dynamic society. [S.l.]: Swedish Rescue Services A. Re, A., and Macchi L.. (2010). "From cognitive reliability to competence? An evolving approach to human factors and safety". Cognition, Technology & Work. 12 (2): 79-85. Reason, J. T. (1990). Human error. Cambridge [England]: Cambridge University Press. Reason, J. T. (2000). “Human error: models and management.” NCBI Resources, (Mar. 22, 2009). Reason, J. T. (2008). Managing the risks of organizational accidents. Aldershot, Hants, Ashgate. Roberts, K. H. (1993). New challenges to understanding organizations. New York: Macmillan. Roberts, K. H. (2003). “HRO Has Prominent History.” (Nov.5, 2010). Robinson, S. (2007a). "Conceptual modelling for simulation Part I: definition and requirements." Journal of the Operational Research Society 59.3 (2007): 278-290. Robinson, S. (2007b) "Conceptual modelling for simulation Part II: a framework for conceptual modelling." Journal of the Operational Research Society 59.3:291-304. Sadsad R. and McDonnell G. (2007), “Using multi-scale systems simulation to evaluate health records solutions to improve medication use by the elderly in the community”. 8th PhD Colloquium of the Student Chapter of the System Dynamics Society. Sagan, S.D. (1993). The limits of safety: organizations, accidents, and nuclear weapons. Princeton, N.J.: Princeton University Press. 189 Sarter, N. B., Woods, D. D., & Billings, C. E. (1997).” Automation surprises.” Handbook of human factors and ergonomics, 2nd. Ed., G. Salvendy (Ed.), Wiley, 1926-1943. Saurin, T.A., Formoso, C.T., and Cambraia, F.B., (2004),"A Human Error Perspective of Safety Planning and Control." Proc. of 12th Annual Conf. of International Group for Lean Construction (IGLC-12), Elsinore, Denmark. Schein, E. H. (1999). The corporate culture survival guide: sense and nonsense about culture change. San Francisco, Calif: Jossey-Bass. Schein, E. H. (2010). Organizational culture and leadership. San Francisco, Calif: Jossey-Bass. Senge, P.M. (2006). The fifth discipline. London: Random House Business. Sheard, S., & Mostashari, A. (2008). A Framework for System Resilience Discussions. In 18 th Annual International Symposium of INCOSE, Utrecht, Netherlands. Shields, PM and Tajalli, H (2006).” Intermediate Theory: The Missing Link to Successful Student Scholarship.” Journal of Public Affairs Education. 12(3), 313-334. Simon, H. A. (1997). Models of bounded rationality. Cambridge, Mass, MIT Press. Smyth, R. (2004). "Exploring the Usefulness of a Conceptual Framework as a Research Tool: A Researcher's Reflections". Issues in Educational Research. 14 (2): 167-180. Son, JeongWook. (2011.)” An integrated model of evolution of project teams in large-scale construction projects.” PhD. Dissertation, University of Washington, Seattle, WA. Sterman, John. (2004). Business dynamics: systems thinking and modeling for a complex world. McGraw-Hill, Boston. Syal, M. (1998). “Construction Research Agenda: Focus Areas and Topics.” American Professional Constructor, 22(2), 8-12. Tjorhom, B. and Aase, K. (2011). “The art of balance: using upward resilience traits to deal with conflicting goals.” In Hollnagel, E., Pariès J., and Woods D.D.,and Wreathall J.. 2011. Resilience Engineering in Practice a Guidebook. Ashgate Gower. Toulouse, G.( 2002). "Accident risks in disturbance recovery in an automated batch-production system". Human Factors and Ergonomics in Manufacturing & Service Industries. 12 (4), 383406. Trist, E., and K. Bamforth. 1951. "Some Social and Psychological Consequences of the Longwall Method of Coal-Getting". Human Relations. 4 (1): 3-38. Watkins, M., Mukherjee, A., Onder, N., and Matilla, K. (2009). “Using Agent-Based Modeling to Study Construction Labor Productivity as an Emergent Property of Individual and Crew Interactions.” Journal of Construction Engineering and Management Volume 135(7), 657-667. 190 Weick, K. E. 1996. "The collapse of sensemaking in organizations: The Mann Gulch disaster". Wildfire. Weick, K. E., Sutcliffe K.M., and Obstfeld D.. (1999). Organizing for high reliability: processes of collective mindfulness. Stamford: JAI Press. Weick, K. E., Sutcliffe K.M.. (2001). Managing the unexpected: assuring high performance in an age of complexity. San Francisco: Jossey-Bass. Weick, K. E., Sutcliffe K.M., and Obstfeld D. (2005). "Organizing and the Process of Sensemaking". Organization Science. 16 (4): 409-421. Westrum, R. & Adamski, A.J. (1999) “Organizational Factors Associated with Safety and Mission Success in Aviation Environments.” In D.J..Garland, J.A. Wise & V.D.Hopkin (Eds.) Handbook of Aviation HumanFactors. Lawrence Erlbaum, Mahwah, NJ. Westrum, R. (2006). “All coherence gone, New Orleans as a resilience failure.” In E. Hollnagel & E. Rigaud (Eds.), Proceedings of the 2nd Resilience Engineering Symposium. Paris: Mines Paris Les Presses. Wild, A. (2005) "Uncertainty and Information in Construction: From the Socio-Technical Perspective 1962-1966 to Knowledge Management - What Have We Learned?." In Knowledge Management in the Construction Industry: A Socio-Technical Perspective, ed. Abdul Samad Kazi, 203-224. Woltjer, R., & Hollnagel, E. (2007). “The Alaska Airlines Flight 261 accident: A systemic analysis of functional resonance.” In International Symposium on Aviation Psychology ISAP. Wright State University. , 763-768. Woods, D. D., & Cook, R. I. (2002). Nine Steps to Move Forward from Error. Cognition Technology and Work. 4, 137-144. Woods DD and Hollnagel (2006) “Prologue: Resilience Engineering Precepts.” In: Hollnagel E, Woods DD, Leveson N (eds) Resilience engineering. Concepts and precepts. Ashgate, Aldershot, England. Woods, D.D., Patterson, E.S., & Cook, R.I. (2007). “Behind Human Error: Taming Complexity to Improve Patient Safety.” In P. Carayon. (eds.). Handbook of Human Factors and Ergonomics in Health Care and Patient Safety. Lawrence Erlbaum Associates, Mahwah, New Jersey. Woods DD, Wreathall J. (2008) “Stress-strain plots as a basis for assessing system resilience.” In: Hollnagel E, Nemeth C, Dekker S (eds) Remaining sensitive to the possibility of failure. Ashgate Publishing Company, Aldershot, pp 143–158. Woods, D. D., Dekker, S., & Cook, R. (2010). Behind human error. Farnham, Ashgate. Wreathall, J. (2009). "Leading? Lagging? Whatever!" Safety Science. 47 (4): 493-494. 191 Yilmaz, L. (2007). Modelling Software Processes as Human-Centered Adaptive Work Systems. Lecture Notes in Computer Science, 4764, 148-159. Yilmaz L. (2009). “Toward Systems Engineering for Agent-directed Simulation," Agent-Directed Simulation and Systems Engineering, L. Yilmaz and T. Oren, eds., Wiley Series in Systems Engineering and Management, Wiley. 219-236. 192