I III ' " .I I ' I I??? W w ' I”. l _ I‘ V"- II. . “315"? v " l " fu‘: TI...I’ ‘(~{'$'I|'I$ I) [I I J." I I I I {WWI} ., 8%: *§{UJ'|J".0? l I‘l(':» h" "' -".'.-.‘ '1'! ' 1%": it-u.» .1". I.‘ e .' _ I. I I '. I, I I I I” .A ml! I _ arm} :wI . bl . I "3:; .\.I‘r5%o "z :53 A" | .. _ < ' .1 [ '.. CIBI‘ ' ' ' "N .. IE: «1‘ .. r- 3% .9 i . sh: 3:" h . ' 3‘ ‘2‘ It. . : a)- . . 1 9:1; ’1' l ‘3‘" “ ‘ I' -~§JK'~ ' I ‘ as y ' “e; ‘ 7‘: ' ":4. ”9‘1 ' ‘ . - ~ v —-~L-.u ‘FL‘ ..'.. y ’1’ "' \- - T333 5'0 '37:; n- ~ ~03”; - ‘ .[‘L° tfi: a l . I1; .I . as , _- ., *F'fiyflfix '3. ,CJ’J'MB; . 3",".1 If»; . II I: \L'XI ,. I.» . 7:55,.-1' v.“ .41.; . . a; I} e ‘g‘ 133; l it 4} 2’! If 4"?" I? I .E 'u A r! A: 32:: is»? 31-1 . I at?! g; 3;? 1333513 .- 5. I.- ~ ‘4‘ W" .. t ' “'71” ‘ .9 VLK‘F'ah' - Jh ‘v - ‘ ._ .vvv‘v‘ti‘ . , ._ 5"“ n . . t , V» . . , .N‘ ,' *3... 232:. (meaty. 7.1;. ..,~ . o | w ;ré?”u";¥*’~‘;ti~..°=a It I (1.. - 3" " E. n . - VI ‘ mama w \_ v A I ka- I '2 F ' "'9 RE)" '2” . fir?“ , 5 ‘I “L. NV!" 'I '- "M"? 1 '. V' .1... 3‘ 'V , .:’\‘.-... F 3,1“ 16?: .q‘ ‘3 “1W“ 3“: i k) '- .i' “afljkafi .‘ ' r; _ 'Ji'; I t'I' . .V' d _'_ ~ Iu‘r‘jI. fi-wa ,..‘. I; W3 ‘ 51%? " Ira—flaw?“ x’ ' 1’32.— u I i O I I I 133'!“ n ' mm- «4., I ' .l . '¥ - ".I’M‘MV' ' ' :sz 13-." . urfil- -‘ n 34.2-- a ' '4. ' v. 1“ - Isa affix .. l ~ I 5 ‘J 9 I? h u, - 15; I q. 3,. . .. _ ’ '{wi - p“.’,€_[.'-' funk": d}; .‘n’éwkgfkgfi mt “ _w.-'v; “w ‘ a f ""§;V~3T(5:;f;.‘ ’31.“. 73*.‘.-.L~'5Cii'—Q¢fl&a 1’35??? "‘HM‘F‘W'?’"Wt“?“9):- ;.:“-~;Lv-r:¢;u«1;»,91- . “35+." I-‘v'-‘ f “a: . . v , ~ I H1. - ' . .. - I-u. '13.: $33” .:.y:—§- ',"‘}g. u .. I @212 If} M r 1: \.'J‘! .v - ”‘1‘: L" “,.‘,‘_ — - ‘73-- L {East-”‘1‘! "" '5‘: mi: . ' ' ‘ :nw. “ 35",, ‘3- ,r‘—’I.. VI _ .1 -' ‘ “ — 14.331“ . f? - ' I n ‘1 . . I: £13, )4.- n .31”ng . I .’ r 1 I ,_.p4 '. , v ‘..I' I} ‘l- l' .‘ {'JyI-fl}. I “~L nu. I" ' _ I 'I‘ n',- I II III.->II INLWI‘V 1’ I '. mm. . ~ I » r . . ‘ ”Y ‘I'u '.‘ I “4.1%" “II ‘ , , , l , OI I ' u‘ . v1, I. IL.» Mr . m» .' . '. 'I'I'. -:""« I‘VE}; \- WWI," .I . ‘-'_ 1‘ :1“ l ’\ " ""7 n-‘I'qz‘ '-‘ L". 'r-s.‘ 1 (‘I' “ urn I ~ " '{3' W": “Hit; fir "’ "'3." " r I'I‘II‘M h g I‘. . ugh-3'3 . ,. ~~'... n“ .. ‘+ r‘ .JT-m "...":' m; “ ' ':.\.. M." with; St " ' c . ~-“.-I."I13.'w.1--‘1I . -‘,.' I' I g .'.'- I . “I Null:- ”41%; ,o .._‘. >~ I 'I l ' . . I . . I ‘ ' ' I ' {NW-a: 1 I'm a :wy w (. 4 I“ "I‘ ’. v‘rw' ' a I T." ‘-l' u .. I. ‘- M." I . I . I . , ' .1.- ‘ - .vi "3‘3"" 1 Flat-A” n: “‘5‘; “"4" “g I'. ~ ' . ‘.' t I. .'." l {!IIIII.I.II:I+IIIIII,I IIIIIrIII * II.. mu.“ m .f'. . .. . 1 film ”I n -:‘."n, I - ‘I‘l'l’I't'IH M" I“ . ”H. II 'II . "' ' I , III ‘[ [Ir-ll: I““l‘_‘\- II'HIW‘ . I . 3:1‘fl‘N '22. ll "IIII.I.:v"I,I . "WI -~’|’é‘»‘; J. «M II» IIII " I‘ I 'II I I aka-5.33;; ' 7 'I_"I I‘ I , fl “LIE-"O“ )1 I. .1‘."l" ‘ II. 'l’ I55“. «-1. I 1‘, - f; 1 .5 2‘1 0 '- 5' ; £2 5. ; .I‘Kln .~ .-_a nu . A F: I _; I 7‘ 5“.» «is . _ _ ' _ —.-.L"..‘ O {I} t" ’ »' 3'7 64:6 ‘ ‘..:,of ~- -. . _-‘ M This is to certify that the dissertation entitled INNOVATION IN PUBLIC SECTOR ORGANIZATIONS: A TEST OF THE MODIFIED RD&D APPROACH presented by David B. Roitman has been accepted towards fulfillment of the requirements for Ph . D. degree in Psychology Wjor prtfi‘essor Date 2/ l O/ 84 MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES .—:—. RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. l INNOVATION IN PUBLIC SECTOR ORGANIZATIONS: A TEST OF THE MODIFIED RD&D APPROACH By David B. Roitman A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Psychology l984 ABSTRACT INNOVATION IN PUBLIC SECTOR ORGANIZATIONS: A TEST OF THE MODIFIED RD&D APPROACH By David B. Roitman The present research was designed to test the viability of the modified Research, DevelOpment, and Diffusion (RD&D) approach to the dissemination of innovative programs. This approach emphasizes systematic development and evaluation involving practitioner input, and encourages interpersonal contact during dissemination from developer-based operations. A current effort to abandon this approach lacks empirical support. The following two major research questions were therefore addressed: (a) To what extent are modified RD&D programs implemented with fidelity (correspondence to original program models) at adopting sites?, and (b) To what extent is the fidelity of implementation related to program effectiveness? Other research questions involved the development of an empirically-based definition, and typology, for the concept of reinvention, and examining the relationships among reinvention, fidelity, and effectiveness. Seven social programs developed and disseminated nationwide with federal funding were studied. In general, results supported the modified RD&D approach. The programs were implemented with acceptable i3delity,and a significant correlation between fidelity and effectiveness was obtained. Agreement between telephone and site visit results attested to the fidelity instrument's construct validity. However, these results were qualified by evidence of variance between programs. Based on this research, reinvention was defined as the use of materials, activities, procedures, or organizational structures that cannot be explained using the framework provided by developers to measure fidelity. Instances of reinvention were categorized as additions or modifications, proactive or reactive, and if reactive either externally or internally induced. The extent of reinvention was found to be positively related to both fidelity and effectiveness. Partial correlation analyses did not disconfirm a model which posited high levels of fidelity leading to high levels of reinvention and effectiveness. The positive relationship between reinvention and effectiveness was shown to be due to additions, rather than modifications. ACKNOWLEDGEMENTS This research was conducted as a team project. The intellectual environment created by the members of this team was one of the most stimulating and enjoyable I have experienced. After being in the arena with these individuals, I found that defending my ideas in other settings felt like childs' play. The core of this group included Craig Blakely, Bill Davidson, Jim Emshoff, Rand Gottschalk, Jeff Mayer, and Neal Schmitt. Because of their good sense, good humor, and dedication, the project has been great fun as well as great learning. I hope that I will have the good fortune to work on equally balanced and spirited teams in the future--but I doubt it. I will certainly miss this crew of friends. Another major influence on this research was Lou Tornatzky. In fact, he was a team member in absentia--most of us formed our ideas about innovation research largely or in part through conversations with Lou, and his vision inspired this project from beginning to end. In addition to his role as invisible guru for this project, Lou has been a great social scientist/activist role model for me. No one says more with fewer words of wisdom and integrity. Thanks, Lou. In addition to their roles as team members, I'd like to address each of the core team members as individuals. I'll pick one thing to thank each for, although of course there are many. Bill's friendship and words of support have really been something to lean on during seven long years. Craig is "what it is"--solid, solid, 1': solid. Rand managed to put up with me for four months on the road-- and has been a great partner. Neal's calm and considerate brilliance is always a treat--one of the best teachers I've ever known. Jeff, I'm sorry you signed on so late, and Jim, I'm sorry you left early-- two of my deepest pals up here. Thanks again, team. Others who served in the trenches on this project included Phil Nickel, Jeanna Chodakowski, Dave Thompson, Theresa Narzniak, and Devi Smith (later for you, Joe)--all of whom I thank for their contributions. Becky Mulholland was the classic secretary doing it all, and Karen Garlock and Kelly Campbell also helped a lot. Also thanks to Suzy Pavick for typing this monster. And the cast of thousands--teachers, aides, jury clerks, youth workers, cops, counselors, administrators and service providers of all stripes, kids, jurors, neighbors, pre-release residents-—the stuff of which these data are made--I hope our research does some justice. Now to the individuals not directly involved with this project who have helped me make it through. I'll single out three from many more: Charlie Johnson, who has been there with just what I needed, whenever asked; Bill Fairweather, who invented this strange and marvelous enterprise of Ecological Psychology, which has been the dominant shaping force throughout seven years of my life--and I'm proud of it; and Don Davis--a great buddy, real smart guy, raconteur, connoisseur, elegant hobo--Don's into a lot of good stuff. For those reading this who've never struggled through the Ecological Program, let me tell you, it's tough. I couldn't have made it through without the constant support of my wife, Susan, who's made some incredible sacrifices putting up with all of this; and my parents, who have been parents in the best sense of the word. Thanks also to Joe Bornstein for his intellectual and cullinary companionship, and to our good friends Joe and Linda. Thank you, all. iv TABLE OF CONTENTS Page LIST OF TABLES ......................... ix LIST OF FIGURES ........................ x INTRODUCTION ......................... l Modifying the Classical RD&D Approach ........... l Fidelity and Adaptation .................. 5 Middle Ground Positions .................. 7 The Need for Measurement DevelOpment ........... lO Precision and Accuracy in Measuring Fidelity ...... ll Are Modified RDBD Programs Implemented with Fidelity?. . . l3 Are Different Programs Implemented with Equivalent Levels of Fidelity? .................. l4 Are Programs Implemented with Fidelity Across Social Policy Areas? ..................... l5 Is Fidelity Related to Program Effectiveness? ....... l5 Reinvention ........................ 16 Relevance of Reinvention to the RD&D Approach ....... 18 Summary of the Research Questions ............. l8 METHOD ............................. 20 Overview ......................... 20 Innovative Social Programs and Vehicles for Their Dissemination ................. 20 Selection Criteria ................. 23 Description of Programs .................. 24 Sampling ......................... 24 Sampling Strategy ................... 24 Unit of Analysis .................... 26 Respondents ...................... 27 Measurement of Implementation Fidelity .......... 27 Measurement Development for the Telephone Interview Instrument ................ 27 Preliminary Identification of Innovation Components .................... 28 Preliminary Identification and Scaling of Variations .................. ~ 30 Feedback Interviews with Developers ......... 3l Pilot Tests ..................... 32 Data Collection for the Telephone Interview Instrument ..................... 33 Reliability and Validity of the Telephone Interview Instrument ................. Reliability ..................... Validity ....................... Measurement Development for the Site-Visit Instrument ...................... Data Collection for the Site-Visit Instrument ...... Reliability and Validity for the Site-Visit Instrument ...................... Reliability ..................... Validity ....................... Measurement of Reinvention ................. Measurement Development ................. Site-Visit Data Collection ............... Content Analysis .................... Development of Criteria and Procedures ........ Content Analysis: Coding Procedures ......... Empirically-Based Definition of Reinvention ....... Typology of Reinvention ................. Addition-Modification ................ Proactive-Reactive .................. Extent of Reinvention .................. Summary of Typology and Extent of Reinvention Methods . . . Measurement of Effectiveness ................ Effectiveness Criteria ................. Data Collection ..................... Data Transformation ................... Summary of Methods ..................... RESULTS ............................. Fidelity Per Se ...................... Overview ........................ Are Modified RD&D Social Programs Implemented with Fidelity? .................... Telephone Interview Results ............. Site-Visit Results .................. Comparing Telephone and Site-Visit Results: Visual Comparison ................. Comparisons Between Telephone Interview and Site-Visit Results: Correlational and Percentage Agreement Analyses ............ Correlations Between Data Sets ............ Percentage Agreement Comparisons ........... Are There Differences in Fidelity of Implementation Between Programs? .................. Telephone Interview Results ............. Site-Visit Results .................. Comparing Telephone and Site-Visit Results ...... vi Page 34 34 34 35 35 62 63 63 64 64 67 69 69 69 7O 72 72 Are Programs Implemented with Fidelity Across Social Policy Areas? ............ Telephone Interview Results ............ Site-Visit Results ................ Comparing Telephone and Site-Visit Results . . . . Reinvention Per Se ................... Overview ....................... Descriptive Analyses ................. Differences Between Programs ............. Sum of Instances ................. Weighted Instances ................ Unweighted Category Sums ............. Weighted Category Sums .............. Differences Between Policy Areas ........... Overall Reinvention Scores ............ Reinvention Categories .............. Effectiveness Per Se .................. Overview ....................... Types of Data .................... Comparisons of Sample Sites with Demonstration Sites ....................... Relationships Among Fidelity, Reinvention, and Effectiveness .................. Use of Site Visit Data ................ Tests for Non-Linearity ............... Data Transformations ................. Standardized Fidelity Scores ........... Standardized Reinvention Scores .......... Normalized Effectiveness Scores .......... Simple Correlations: Relationships Among the Major Variables .................. Partial Correlations ................. DISCUSSION .......................... Question l: Are Modified RD&D Programs Implemented with Fidelity at Adopting Sites? . Question la: Are There Differences Between Sample Programs of Fidelity? Question lb: Are There Differences Between Two Policy Areas (Education vs. Criminal Justice) on Fidelity?. . . Questions 1, la, lb: Implications .......... Question 2: To What Extent is the Fidelity of Program Implementation Related to Program Effectiveness? .................. Question 2: Implications .............. Question 3: What is a Useful Definition, and What is a Useful Typology, for the Concept of Reinvention? .................. vii 94 95 95 ~ 98 99 100 Page Question 3a: Are There Differences in Reinvention Among the Sample Programs? ...... 102 Question 3b: Are There Differences Between the Educational and Criminal Justice Policy Areas on Reinvention? .............. l02 Questions 3, 3a, and 3b: Implications ....... l02 Question 4: What are the Relationships Among Fidelity, Reinvention, and Effectiveness? . . . . lO3 Question 4: Implications .............. 104 Future Research .................... lO7 APPENDICES A. Examples of Components and Variations ....... 110 B. Descriptive Analyses of Reinvention Data ...... ll2 C. Descriptive Analyses of Effectiveness Data: Types of Data, Summary Statistics, and Comparisons to Demonstration Sites ....... 116 REFERENCES ......................... l3O viii LIST OF TABLES Table Page 1. Summary of Methods .................... 6O 2. Descriptive Statistics: Fidelity ............ 65 3. Descriptive Statistics: Reinvention ........... 77 4. Modification and Addition by Program: Descriptive Statistics .................. 79 5. Rank-Ordering of Partial Correlations with Effectiveness ...................... 93 ix LIST OF FIGURES Figure Page 1. Innovative Social Programs Selected for Study ....... 25 2. Typology and Extent of Reinvention ............ 47 3. Program Effectiveness Criteria .............. 56 4. Mean Average-Item Fidelity Scores ............. 68 INTRODUCTION Modifying the Classical RD&D Approach The present research explores the viability of the "modified" Research, Development, and Diffusion (RDBD) approach as a vehicle for social innovation. The classical RD&D model (Havelock, 1976; House, Kerins, & Steele, l972) was used as both a research paradigm and a model for policy making. Those who used the model assumed that social innovations should be developed through systematic research performed at laboratories which specialized in RD&D, and should be evaluated using both formative and summative methods prior to dissemination to user sites. This classical model was popular among federal policy makers in the l960's, and was partly inspired by the success of federally-sponsored RDBD related to space exploration (House, 1981). In transfering the classical model to social programming, potential innovation ad0ption sites such as school districts, municipal departments and local-level social agencies were assumed by researchers and policy makers to value evaluation results highly, to make decisions according to specified goals, and to act as relatively passive consumers in the dissemination process. It was reasoned that if an innovation was demonstrated to be effective through research, disseminating information through printed media would be sufficient to encourage adoption. Implementation of the innovation was assumed 2 to proceed automatically from adoption (Tornatzky, Fergus, Avellar, Fairweather, & Fleischer, 1980). These latter assumptions were called into question by several studies of educational innovation conducted in the later 1960's and throughout the 1970's (Berman & McLaughlin, 1978; Farrar, DeSanctis, & Cohen, 1979; Fullan & Pomfret, 1977; House, 1975; House, 1981). This body of research presented evidence that sites were not at all passive receivers of innovations; instead, a myriad of organizational factors were uncovered as potent influences on the extent of program implementation. These included the extent to which local decision makers mobilized broad-based support, used a "problem-solving" rather than "opportunistic" mode of decision making, and planned ahead for implementation (Fullan & Pomfret, 1977). While these educational implementation studies were being conducted, Rogers and his colleagues (Eveland, Klepper, & Rogers, 1977; Rogers, 1978) were studying the implementation of several other federally sponsored social innovations, including the GBF-DIME computer-based information system and the Dial-A-Ride transportation program. These researchers were struck by the degree to which the innovations were "reinvented" by sites to fit their specific needs and to provide a sense of innovation "ownership". In addition to the studies of educational innovation and the work of Rogers and his associates, a third source of influence leading to modification of the classical RD&D model was the theoretical work of March, Simon, and their colleagues (March & Simon, 1958; Cyert & March, 1963). These researchers argued that organizational decision criteria were usually "satisficing", rather than "maximizing". 3 Instead of engaging in thorough searches for information and elaborate comparisons of pro-and-con factors, decision makers were recognized as bounded by a more limited, expeditious rationality. In addition, organizations were conceived by these researchers as composed of various coalitions of members, each coalition with different and possibly conflicting goals (Cyert & March, 1963). A fourth body of research contributing to the modification of the classical RD&D model was the work of Havelock and his colleagues (Havelock, 1976). As a result of a massive review of the innovation literature, Havelock identified three dominant models for innovation dissemination research and practice; the RDBD, Social Interaction and Problem Solver perspectives. Both the Social Interaction and Problem- Solving approaches are more "process" and "user" oriented when compared to the "product" and "developer" orientation of the RD&D model. The Social Interaction approach, developed primarily by rural sociologists (Ryan & Gross, 1943; Rogers & Shoemaker, 1971), emphasized the social relationships between disseminator and adopter, and the importance of reference group identification (i.e., that the disseminator be respected as an "opinion leader"). The Problem-Solving approach, utilized primarily by organizational development theories and practitioners (e.g., Lippit, Watson, & Westley, 1958) stressed diagnosis of the adopter's needs and maximizing the use of the adopter's resources (both personal and organizational) in the innovation process. A hallmark of this approach was that change initiated by the adopters themselves has the best prospects for long-term maintenance. Havelock and his associates synthesized these three perspectives to produce the 4 "linkage" model of innovation dissemination. This approach attempted to retain the best features of each perspective, keeping the systematic development and evaluation aspects of the R080 model, while adding the interpersonal and interorganizational emphasis of the Social Interaction approach, and the attention to adopter needs and organizational processes stressed by the Problem-Solving approach. Thus the linkage model can be viewed as a modification of the classical RDBD perspective. Although these four bodies of research (the empirical educational innovation and federal-program reinvention studies, and the empirically-based theoretical works of March and Simon and Havelock) can be viewed independently, their historical interrelationship is clear. Havelock's work was widely read among social scientists, and contributed to the conceptual frameworks utilized in the educational innovation and reinvention studies, while Havelock's thinking was shaped somewhat by the ideas of March and Simon. These various bodies of research were influential at the federal policy level in the modification of the classical RD&D model (Datta, 1981). For example, the Office of Education's National Diffusion Network, established in 1974, was designed to utilize "state facilitators" as active change agents to assess the need of local districts, to tailor change strategies to the district's political climate, and to foster local support for the disseminated innovations (Emrick, Peterson, & Agarwala-Rogers, 1977). In addition, local- districts were encouraged to work cooperatively with research groups in developing and evaluating their own site-generated innovations. 5 These would be eligible for dissemination under NDN auspices if demonstrated to be effective. The NDN modifications of the classical RD&D approach were directly inspired by Havelock's work (Raizen, 1979). Also during the early 1970's, a similar program for encouraging site-generated social innovations (the Exemplary Projects Program) was established by the Department of Justice's Law Enforcement Assistance Administration (LEAA). Although also stressing the importance of interpersonal contact through site-visits, the LEAA did not establish as extensive a network of linkage agents when compared to NDN. ‘ In sum, by recognizing that practitioners and local decision makers were likely to have greater understanding of the political realities and organizational nuances involved in successful implementation when compared to researchers or centrally-located bureaucrats, federal policy makers thus modified the classical RDBD model. The attempt by the Office of Education to establish a "Diffusion Network" with strong, relatively long-lived interorganizational and interpersonal ties, and the emphasis of the Exemplary Projects Program on site-visits to innovation developers showed further recognition of the limitations of the classical model. However, other elements of the RD&D model were retained, such as the development of programs using scientific research and evaluation, with funding initially channeled to specific development sites. Fidelity and Adaptation y Although these modifications were grounded in research, the modified RDBD model has not gained the same level of acceptance as the classical approach of the 1960's. Indeed, a number of writers 6 in the innovation area have argued for abandoning the RD&D model altogether, in favor of a more decentralized, local problem-solving approach (Berman, 1981; House, 1974). The field of social innovation policy research can thus be seen as divided into two opposing camps: "pro-fidelity" vs. "pro-adaptation" researchers (Fullan & Pomfret, 1977). The former conceptualize innovations as consisting of a number of relatively well specified components. Those championing fidelity argue that rigorously developed and evaluated programs should be implemented with close correspondence to the validated models or else suffer the consequences of "dilution" at adoption sites (Borden & Gomez, 1977; Calsyn, Tornatzky, & Dittmar, 1977). Dilution is expected to lead in most cases to reductions in outcome effectiveness. On the other hand, "pro-adaptation" researchers and practitioners argue that differing organizational contexts and practitioners needs demand on-site modification, virtually without exception (Berman & McLaughlin, 1978; House et a1., 1972). For example, according to Gephart, A specific product or procedure is developed for a particular purpose or function...(but)...typically, purposes or functions differ from setting to setting... (and)...although the ideal system would be one which had the needed number and types of components universally required...we seldom know enough in a design effort to create all the component parts. (1976, pp. 5-6) The roots of the "pro-fidelity" and "pro-adaptation" positions in the previous literature and practice are clear. The fidelity orientation corresponds to the RDBD approach, while the adaptation position is rooted primarily in the Problem-Solving tradition. It 7 would seem that the Social Interaction viewpoint is neither inherently pro-fidelity nor pro-adaptation, but does question two pro-fidelity assumptions; specifically, that the adopter is a "passive consumer," and that the innovation product, rather than the innovation process, is the most useful focus of attention. Although several additional frameworks for organizing the innovation literature have been devised (e.g., House, 1981; Yin, 1978), they follow a remarkably similar pattern. Each identified an RDBD-type approach as one perspective, and outlines additional perspectives which are either directly antagonistic to this model or offer alternatives which may co-exist with the RD&D approach, while questioning several of its basic assumptions. Although these approaches differ in the extent to which the RDBD assumptions are questioned, all hold similar implications for public policy. For example, one implication is that the freer users are to adapt programs to their local needs, the more likely they are to adopt programs which last. A second implication is that the more the program is modified to suit the site, the more likely it is to achieve the outcomes desired by users. An even more radical implication of these perspectives is that instead of channeling initial program development funds to specific developer sites, funding should instead be devoted to building the capacities of the local sites to develop innovations independently. Middle Ground Positions In a recent article Berman (1980) has considerably advanced the fidelity-adaptation debate by proposing a normative contingency model for implementation strategy. Although the model was developed for 8 policy implementation, it applies equally well to program implementation. This contingency model implies that different strategies for implementation are most appropriate for different situations. According to Berman, There is no universally best wa to implement policy. Either programmed (pro-fidelity) or adaptive implementation can be effective if applied to the appropriate policy situation...Policy situations are often so complex that a mix of programmed and adaptive strategies might be more effective than a simple choice between the two. (p. 206) Berman suggested five situational parameters to be considered when designing an implementation strategy: (a) scope of change (incremental or major); (b) certainty of technology or theory; (c) amount of conflict over policy goals and means; (d) structure of the institutional setting (tightly vs. loosely coupled); and (e) the environment's stability. He argued that relatively structured conditions support the use of programmed (pro-fidelity) approaches, while unstructured situations imply the use of adaptive strategies. Thus, rather than taking a dogmatic pro-adaptation position, Berman has outlined a sensible middle ground. Another recent shift from a pro-adaption to a middle ground position has been taken by House (1981), who argued that "a truly comprehensive (innovation) strategy would view the (innovation) situation from all three perspectives" (p. 39). The three perspectives identified by House are the technological (RD&D), the cultural (similar to Havelock's Problem-Solving perspective), and the political. (The latter approach has some similarity to the Social Interaction approach. However, by focusing on the conflict and negotiation aspects of the innovation situation in his description of the political perspective, 9 House distinguished his categorization from the previous authors.) House noted that, These different frameworks...set limits as to what is considered useful inquiry...They limit the very language and concepts employed in the discussions and thereby give a certain value slant. (pp. l9-20) However, House goes on to argue that the three positions will continue to coexist in research and practice, since each has a real constituency: ...the technological perspective represents the interests of those who sponsor innovation; the cultural perspective, the interests of those who are "being innovated"; and the political perspective, the negotiation of these interests... It is significant that the three perspectives reflect the viewpoints of dominant societal institutions. These viewpoints have already been institutionalized within the academic disciplines such as economics, engineering (technological), political science and sociology (political) and anthropology (cultural)...Can one of these perspectives be "proved correct"?...It would not seem so. Each perspective focuses on different aspects of reality, and in fact values the same aspects differently. (p. 40) Continuing with House's line of reasoning, both the political and cultural perspectives imply the inevitability of adaptation. If organizations are truly composed of various factions and sub-cultures, each with different political interests and values, some adaptation is necessary to resolve conflicts. Yet recognition of the complexity of innovation processes need not obviate the modified RD&D model. The question remains: Must adaptation prevent the implementation of a social program with reasonably high fidelity? The path to an empirical solution to this question was outlined by Hall and his colleagues (Hall & Loucks, 1978). Taking yet a third middle-ground position, Hall and Loucks argued that adaptation was acceptable up to a measurable "zone of drastic mutation", beyond 10 which the innovation lost its integrity. Therefore, the issue became empirically focused on measuring how much of what kinds of adaptation have taken place. Following this line of thinking, adapting a program to better fit its organizational context need not be anathematic to the modified RD&D approach. However, despite the clear good sense of the middle-ground positions taken by Berman, House, and Hall, few decision-makers have heeded their advice. Instead, as Berman notes, "advocates on both sides seem to be throwing down the gauntlet" (p. 206). The Need for Measurement Development Although the pro-adaptation position has attracted an increasing number of adherents in recent years (Datta, 1981), a close examination of the principal studies used in its support shows that its foundations are somewhat tenuous. For example, the widely-cited RAND report on federal programs supporting educational change (Berman & McLaughlin, 1978) found three dominant patterns for implementation: mutual adaptation (when both project and setting were changed); cooperation (when "the staff adapted the project...without any corresponding changes in traditional institutional behavior or practices"); and non-implementation. The RAND researchers reported that "mutual adaptation was the only process leading to teachers change," and "had a better chance of being effectively implemented" than coopted projects. In addition, they reported a striking absence of high fidelity adoption. In their words, ‘ A fourth process, which we call technological learning, represents a situation in which the staff would acquire skills in using a new educational method without adapting the method to the reality of the user's setting. At the 11 extreme, "teacher-proof" packaged materials assume such implementation. However, we did not observe any real instance of technological learning. Instead, we found that even highly technological or prescriptive projects were either modified to suit local needs and interests or were implemented in a superficial manner that destined project materials for the schoolroom storage closets. However, a closer look at the RAND methodology reveals the absence of any bona fide measure of program fidelity. The RAND researchers used as their implementation outcome measure "the extent to which projects met their own goals, different as they might be for each project" (Berman & McLaughlin, 1977, Vol. VII, p. 50). Therefore, their implementation measure was quite imprecise, and was biased to reflect adaptation, rather than fidelity. There is no way to determine from these results to what actual extent programs were modified or what components were changed. Additional doubts concerning the RAND conclusions were raised by Datta (1981), who noted that the "programs" examined were for the most part loosely defined policy statements, rather than highly specified social programs. Precision and Accuracy in Measuring Fidelity In short, reviews of the implementation literature (e.g., Scheirer & Rezmovic, 1982) have suggested that considerable attention to measurement development is presently required to advance the state of implementation research. Refinements in both the precision and accuracy of implementation measures are needed. The present research thus focused a good deal of effort on measurement development. With regard to obtaining precise specifications of innovation parameters in order to adequately operationalize fidelity of implementation, the present study followed the pioneering efforts of Hall and his associates (Hall & Loucks, 1978; Hall & Loucks, 1981; Heck, Stiegelbauer, 12 Hall, & Loucks, 1981). Their basic methodology involves identifying program components through extensive interviews with developers and users, and reviewing written materials concerning the innovation. Variations (different ways of implementing components) are then identified and scaled as "ideal," "acceptable," or "unacceptable." Interviews, observations, and examinations of documents may then be used to determine which variations are implemented at sites. Specific patterns of variations (configurations) may also emerge from this process. This methodological approach offered much promise and had begun to be utilized by contemporary researchers in the field of educational innovation (Crandall, 1979; Owens & Haenn, 1977). The basic approach was thus employed in the present study, with some modification to accommodate the scope and purpose of the research. The development of measurement accuracy, as well as precision is of critical importance to progress in this research area. Although both interviews and observations have been utilized by Hall and his colleagues to measure implementation, they recently reported that "to date no formal study of the reliability between checklist data (concerning which variations are implemented) obtained through interviewing and checklist data obtained through observations has been conducted" (Heck et al., 1981). In Scheirer & Rezmovic's more recent review covering 74 studies of program implementation, 55 studies were identified which used multiple measures of implementation, with interviews and questionnaires being the predominant types of measures. 13 However, Scheirer and Rezmovic found that, Of the 55 studies in which multiple implementation measures were taken, 34 studies (62%) did not present any comparative information on the extent to which data obtained by different methods were in agreement...Twenty-one of 55 studies (38%)...did compare findings from the different measures. However, the comparisons were as often qualitative and judgemental as they were quantitative. Further, biases were not necessarily reduced by the use of multiple measurement techniques...Based on the available data, we cannot make conclusions about the relative usefulness] meaningfulness/validity of data obtained by different measurement techniques. (pp. 38-40) The present study therefore devoted attention to this issue by constructing two forms of each measure: a telephone interview form which was administered using telephone interviews, and a site-visit form which involved site-visit interviews, observations, and examinations of archival data. Accuracy in the measurement of fidelity was also assessed by (a) checking inter-rater reliability periodically during both telephone interview and site-visit data collections; and (b) checking agreement among multiple respondents (telephone interview) and comparing data from multiple information sources and the consensus ratings of the research team at the site. Are Modified RD&D Programs Implemented with Fidelity? In sum, although a policy shift towards a pro-adaptation or local problem-solving model has already begun, the modified RD&D model has never been examined in a sufficiently rigorous and comprehensive manner to support or refute such a shift. The present research was designed to provide evidence concerning the viability of the modified RDBD model. The major research question to be addressed in order to test the model was the following: 14 1. Given a relatively precise and accurate operationalization of program fidelity, to what extent are social program innovations which are developed and disseminated according to the modified RD&D approach actually implemented with fidelity at adopting sites? In other words, this research question addressed the basic assumption of the model that programs can be adapted with reasonably high fidelity at user sites, an assumption which is questioned by much of the current literature. The next few sections discuss other issues related to this major research question. Are Different Programs Implemented with Equivalent Levels of Fidelity? An adjunct question concerns possible differences in fidelity between the specific programs chosen for study, in addition to possible differences due to the social policy area. A large number of variables could be hypothesized to account for such differences between programs independent of differences between social policy areas. For example, differences in the degree of technical assistance provided by developers, or differences between programs concerning the amount of change in organizational practices they require could lead to differences in fidelity between programs within a social policy area. Therefore, an adjunct to Research Question 1 was: 1a. Are there differences between the fidelity of the specific programs chosen for study in the present research? 15 Are Programs Implemented with Fidelity Across Social Policy Areas? As a second adjunct question to Research Question 1, the present study tested for differences in fidelity between programs in two different social policy areas: education and criminal justice. This was considered an adjunct to Question 1 since it addressed the issue of external validity (Campbell & Stanley, 1966) of fidelity results across the dimension of social policy area. This adjunct was stated as: lb. Are there differences in fidelity between a sample of educational innovations vs. a sample of criminal justice innovations? The fields of education and criminal justice were chosen for two major reasons: The federal Departments of Education and Justice are presently the most visible users of the modified RDBD approach in the social policy area; and, they each have been using the approach for over five years, enabling full implementation at a number of adopting sites. Is Fidelity Related to Program Effectiveness? A second important assumption of the modified RD&D model is that program fidelity is related to program effectiveness. It is assumed that the more an implemented program resembles the original "validated" implementation, the greater the likelihood that effectiveness outcomes achieved at the original site will also be achieved at the user site. Although Scheirer and Rezmovic's review (1982) reported results supporting this assumption, the sample of studies was too small to 16 permit generalization (only 11 studies measured both extent of implementation and program effectiveness). Also, Scheirer and Rezmovic did not attempt to isolate programs disseminated using the modified RDBD approach. Therefore, the verification of this assumption has yet to be demonstrated empirically. The examination of this assumption was addressed by the second research question: 2. To what extent is the fidelity of program implementation related to program effectiveness? Reinvention As noted above, "reinvention" was introduced by Rogers and colleagues (Eveland et al., 1977; Rice & Rogers, 1980) to capture the flavor of an active process of change at user sites. The term "reinvention" brings to mind the phrase "Not Invented Here," a common phrase used in both public and private sector organizations to describe the rejection of outsiders' ideas simply because they originated outside the organization. Such ideas must be "reinvented" to counter the "Not Invented Here" syndrome. However, despite the potential usefulness of the term "reinvention," the research by Rogers and his associates may not be generalizable to modified RD&D innovations, since the programs examined by Rogers and his colleagues were disseminated with low component specificity and explicitness (that is, the specifications of components were relatively sketchy and incomplete, and the components were not disseminated in explicit, "concrete" terms). Such programs may behave quite differently from programs which are more "well-in-hand" (Gephart, 1976). It is therefore fruitful to consider what the concept of 17 reinvention may add to the conceptualization of RD&D innovations. Perhaps the concept of fidelity alone more parsimoniously accounts for the salient phenomena (Taylor, 1980), and reinvention is simply an unnecessary synonym for low-fidelity implementation. Alternatively, there may be a need for a concept in addition to "fidelity" to accurately describe implementation. Rather than attempting to define the concept a priori, the strategy used in the present study was to collect case study notes on every variation that differed in any way from the variations listed in the fidelity instrument. These qualitative data were later content-analyzed to determine the most comprehensive and meaningful definition of reinvention. Content analysis was also used to categorize instances of reinvention and determine the frequency of occurrence of different types of reinvention. This empirically based examination of the "reinvention" concept can be summarized as an attempt to answer the following research question: 3. Given the present innovation literature and the data base from the present study, what is the most useful definition of reinvention, and what is the most useful and accurate typology of reinvention as it is practiced by adopters of modified RD&D programs? In parallel to the research strategy outlined for fidelity, secondary questions are: 3a. Are there differences in extent of reinvention among the sample programs? 3b. Are there differences between the educational and criminal justice policy areas on extent of reinvention? 18 Relevance of Reinvention to the RD&D Approach Once reinvention has been defined in useful terms as a distinct concept, the relationship between reinvention and other concepts becomes a meaningful issue. As stated above, one assumption of the RD&D model is that program fidelity is positively related to program effectiveness. Understanding the relationship among reinvention, fidelity, and effectiveness also has bearing on the modified RD&D model. The specific relevance will depend on the definition of reinvention selected as a result of content analysis. However, there should be implications for the model no matter what definition is selected. For example, the empirical relationship between "non-component-based" changes and program effectiveness would have implications for dissemination policies, concerning whether or not such changes should be encouraged. The possibility that reinvention functions as a mediating variable in a causal model is also worthy of consideration. In other words, the relationship between fidelity and effectiveness may depend on the extent of reinvention (however defined). In that case, administrators and practitioners might well consider policies towards reinvention based on this relationship. In light of these considerations, the fourth research question was: 4. What are the relationships among fidelity, reinvention, and effectiveness? Summary of the Research Questions In summary, this research is an empirical examination of the modified RD&D approach to the dissemination of innovative social programs. The research questions which guided the study are the following: 19 Given a relatively precise and accurate operationalization of program fidelity, to what extent are social program innovations which are developed and disseminated according to the modified RD&D approach actually implemented with fidelity at adopting sites? 1a. Are there differences between the fidelity of the specific programs chosen for study in the present research? lb. Are there differences in fidelity between a sample of educational innovations vs. a sample of criminal justice innovations? To what extent is the fidelity of program implementation related to program effectiveness? Given the present innovation literature and the data base from the present study, what is the most useful definition of reinvention, and what is the most useful and accurate typology of reinvention as it is practiced by adopters of modified RD&D programs? 3a. Are there differences in reinvention among the sample programs? 3b. Are there differences between the educational and criminal justice policy areas on reinvention? What are the relationships among fidelity, reinvention, and effectiveness? METHOD W In order to provide an examination of the modified RDBD approach, the present research involved the development of fidelity and reinvention measures, and the utilization of these measures to collect data on innovation implementation. Data were collected in two phases: a telephone interview phase, which employed telephone interviews; and a site-visit phase, which utilized observations, interviews, and reviews of archival data at the implementing sites. During the site-visit phase, data was also collected on program effectiveness. Existing instruments and records were the sources for these data. These included instruments with established reliability and validity (such as standardized achievement tests) and various archival measures (such as recidivism and indices of organizational efficiency, e.g., juror usage index). The report of research methods begins with a description of the social programs which were studied and the sampling strategy which was utilized. Following these descriptions, the fidelity and reinvention measures will be described, in terms of both measurement development and data collection procedures, respectively. Innovative Social Programs and Vehicles For Their Dissemination In general, the term "innovative social program" refers to new ways of doing things which are primarily intended to change people 20 21 and/or change the way they interact. Such programs are also usually intended to have as their goals various benefits to society. Social innovations can be considered as covering one end of a continuum. At the other end would be placed material innovations, which are primarily designed and perceived to change physical aspects of systems such as organizations or communities. These may or may not have as their goal a social benefit. In order to provide an adequate examination of the modified RD&D model, it was required that the social innovations selected for study fit the assumptions of RD&D thinking. That is, (a) innovations were required to be developed at a single site or small number of sites using scientific methods of development and evaluation; (b) it was required that some active process (e.g., site visits, conferences, or technical assistance) had been used to disseminate the innovations; and (c) the form in which each innovation was disseminated was required to be relatively well-specified and explicit. For example, site visits, training sessions and manuals were required to be relatively complete, covering the essential elements required to implement the innovation; and, the elements were required to be described in relatively concrete terms. The National Diffusion Network (NDN) of the federal Department of Education and the Exemplary Projects Program of the federal Department of Justice provided two rich sources of suitable programs typical of the recent use of the modified RD&D approach. The use of these two dissemination vehicles also enabled comparisons between implementation in two diverse social service fields. 22 The NDN was created in 1974 as a vehicle for the development and dissemination of innovation educational programs. Programs which are developed by local school districts(frequently in cooperation with university departments of education or other research facilities) are submitted to the Department of Education's Joint Dissemination Review Panel (JDRP) for review, following formative and summative evaluation at the developer site. Approved programs are considered "validated" and enter the dissemination network. Full-time change agents known as State Facilitators perform various tasks to encourage dissemination, such as organizing awareness sessions (at which developers explain their innovations) and site visits to the schools at which innovations were developed. Facilitators also answer inquiries concerning validated programs from schools and local districts. Each state has at least one facilitator, while some of the more populous states have more than 10 facilitators. The Exemplary Projects Program also validated programs using a review panel (The Exemplary Projects Review Board). Programs may be proposed for consideration by the operating agency, local government or criminal justice planning unit, State Planning Agency or Law Enforcement Assistance Administration (LEAA) office. For the period of May, 1976 to June, 1980, the major active dissemination mode for the Exemplary Projects Program was the site visit to the developer's program. Such site visits were managed by an auxiliary Justice Department program (the "Host Program"). This program arranged travel logistics and proVided per diem and travel expenses to site visitors. In addition to these active dissemination methods, both NDN and the Exemplary Projects 23 Program use printed material extensively. NDN publishes a catalog (listing over 150 programs. The catalog is cross-referenced and fully indexed, and contains a complete list of State Facilitators. The Exemplary Projects Program publishes a similar catalog listing 35 programs, as well as a detailed manual for each innovation. Selection Criteria. In order to select a subset of the many NDN and LEAA programs for study, the following criteria were used: 1. Potential availability of effectiveness data at sites (to enable data analyses involving program effectiveness). 2. Potential for at least 20 site adoptions per program (to provide sufficient statistical power to detect significant relationships). The programs were required to have been disseminated long enough to allow for implementations of at least two years, with a sufficiently extensive operation which could result in 20 adoptions. 3. "Organization-wide" quality of the program. This criterion was required since the research issues concern organizational rather than individual innovation implementation. Subcriteria included: (a) the program could not be implemented by only one staff member (teacher, caseworker, judge, etc.) or one organizational subunit (classroom, single courtroom in a multicourt system, single neighborhood group in an organization of neighborhood associations, etc.); and (b) the program should require some relationships between the implementing organization and its surrounding community. These criteria were applied by a team of seven researchers. Materials for each innovation disseminated by the NDN and the Exemplary Projects Program were read by two individuals and independently rated 24 on the selection criteria. Ratings were then discussed by the entire group. This procedure resulted in the selection of seven innovations, three from the NDN list and four from the Exemplary Projects Programs. Description of Programs Figure 1 contains descriptions of the seven innovative programs selected for study. M Sampling Strategy In order to maximize the external validity of the research results, an attempt was made to randomly sample from the population of organizations appropriate to each of the innovative programs. Lists were obtained from national research offices which contained the populations of the appropriate organizations (schools, courts, and police departments). Three percent samples were randomly generated from these population lists, and these samples were in turn randomly sampled to produce a test sample of 100 organizations. Telephoning these 100 organizations resulted in the identification of only two organizations which claimed to have adopted any of the programs; only eight of the organizations in the sample had even heard of the innovations. It thus became apparent that this sampling strategy would not efficiently yield a sufficient number of adopters. Random sampling was abandoned, and a purposive strategy was employed. Lists were obtained from the program developers and related agencies (e.g., state planning agencies, NIJ-Hosts Programs, Center for Jury Studies). These lists were randomly sampled until approximately 15-30 25 Education 1. HOSTS (Help One Student to Succeed)--A diagnostic, prescriptive, tutorial reading program for children in grades 2-6. Tutors are community volunteers and cross-age students. The program includes “pulling out" students from their regular classes at least B h0ur per day. 2. EBCE (Experience Based Career Education)--This program provides career experi- ence outside of school at volunteer field sites for the student. Each career site is systematically analyzed for its educational potential. Students' career and academic abilities and interests are systematically assessed. Individualized learning plans which integrate career experiences and academic learning are utilized. Programs typically take students from grades 11-12, although some also accept students from 9-10. 3. FOCUS (Focus Dissemination Project)--A “school within a school" for disaffected junior and senior high school students. All students are required to partici- pate in a support/problem solving group of 8-10 students and one teacher. Behavioral contracting and a governing board with student representatives are important features. Classes in the Focus program involve individualized, self- paced instruction. Criminal Justice 4. ODOT (One Day/One Trial)--A jury management system that calls in a certain number of potential jurors per day. Potential jurors come in for that day and if not selected to serve in a trial have completed their obligation. Jurors who are selected serve the length of the trial. 5. CAP (Community Arbitration Prgject)~Juvenile offenders are sent to a formal arbitration hearing run by the court intake division, rather than to courts. Juveniles have the specific consequences of their actions explained to them with parents and victims frequently present at hearings. Youths are then typically given a number of hours of informal supervision usually involving work in the community. Restitution is also frequently required. 6. SCCPP (Seattle Community Crime Prevention Program)—-This program is a three phase attack at residential burglary. It involves the setting up of a neighbor- hood block watch through proactive targeting of neighborhoods, property marking and inventory, and home security inspections. 7. MCPRC (Montgomery County Pre-Release Center)--Involves the setting up of a residential facility separate from the prison. This facility should be in the community from which most of the inmates are drawn. Inmates are encouraged to work so that they will have a job when they are released. Counseling, social awareness instruction, and behavioral contracting are also part of this pro- gram. Figure 1. Innovative Social Programs Selected for Study 26 adopters were identified for each program. (The original goal of 20 programs for each innovation could not be achieved since three programs had fewer than 20 total adopters. Consequently, the number of sites was increased for other innovations to maintain a total N of 140 adopting sites.) This was the sample used for telephone interview data collection. Following the telephone interviews, a subsample of the organizations interviewed were selected for the site visits. Ten organizations from each of the seven innovations were chosen to be site-visited, resulting in a site-visit sample size of 70 organizations. Two criteria influenced this subset selection process. The most important criterion required selecting organizations that exhibited a range of fidelity scores which were calculated from the telephone data. Thus, for each innovation, three organizations were selected from one standard deviation above and below the mean within-innovation fidelity score, and four were selected from the mid-range. This resulted in ten sites that varied from high to low on fidelity. The second criterion was the location of the site. A broad geographic distribution was sought for the sample to maximize generalizability. Unit of Analysis The unit of analysis was the organization in which the program was housed. In some cases, this differed from the organization which made the adoption decision, since implementation in these cases entailed creating a new organization or subcontracting to another agency. For example, in one case a crime prevention program was adopted by a police department and later moved to the town's Bureau 27 of Neighborhood Associations; in several cases, alternative schools were created to administer and house Experienced-Based Career Education and FOCUS programs. Note that schools, rather than districts, were considered to be the units of analysis in education. Preliminary discussions with program disseminators revealed that programs were truly "implemented" at the school level; within-district differences between implementations were likely to exist. Respondents Respondents for the telephone interviews were persons who were identified as "most familiar with the day-to-day operations of the program" by organizational gatekeepers (e.g., secretaries and clerks), and who proved to be familiar with operations after preliminary interview responses. During the site-visit data collection phase, additional respondents were interviewed and observed. An attempt was made to include at least one respondent from each relevant class of actors at each site. For example, site visit data collection for the Experience-Based Career Education program involved interviews and observations with students, aides, secretaries, teachers, counselors, resource people ("employers") and school administrators. Measurement of Implementation Fidelity Measurement Development for the Telephone Interview Instrument The five step approach for developing a fidelity instrument proposed by Hall and Loucks (1978) was utilized, with several modifications to suit the scope and purpose of the study. 28 Preliminary Identification of Innovation Components. Hall and Loucks (1978) found that innovation developers and users had differing opinions concerning an innovation's components. Further, Leithwood and Montgomery (1980) noted that the vested interests of different organizational roles influenced judgments concerning the innovation's components, and they suggested interviewing developers, administrators, change agents, and practitioners to get the most complete and accurate list of program components. However, the purpose of the present study involved testing the viability of the modified RD&D model and obtaining a comprehensive description of the innovation as disseminated, rather than attaining a comprehensive description of the innovation in practice. It was therefore decided to limit the sources for component identification to those individuals who were involved with the program before it had an opportunity to be modified or reinvented at adopting sites. Although this did not result in a complete description of the innovation as it is actually used at implementing sites, it did obtain the most accurate picture of the innovation as it was originally researched, developed, and disseminated, prior to modification and/or reinvention. Thus, the sample of respondents for component identification was limited to staff members of the developing organization and users and administrators at the original site or an "initial adopter" site. Each developer organization and original or initial adopter site was visited by two members of the research team. Several staff members, users, and administrators were interviewed for each innovation. 29 Interviews were tape-recorded and content-analyzed to identify components. This protocol represents an extension of the strategy proposed by Hall and Loucks (1978). This protocol had been pilot-tested prior to visiting the innovation developers, by interviewing a program developer at a local social service agency. All written materials and tapes were independently content-analyzed by two researchers for each innovation to identify components. The components were selected to conform to the following criteria: (a) preferably, the component was an observable activity, material, or facility. If not observable, the implementation of the component was verifiable through interviews with staff members and clients of the implementing organization; (b) the component was logically discrete from other components, and wherever possible, did not depend on the implementation of other components; (c) the component was "innovation- specific;" practices which were common to other programs in the organization were not considered components; and (d) the list of components exhaustively described the innovation. Following identification of components, each researcher also attempted to group the components in the most heuristic categorization scheme possible. Following the independent content analyses for each program, a third researcher joined each original pair of researchers to arbitrate disagreements. Thus for each innovation, three researchers reviewed components to maximize conformity to the criteria. This procedure resulted in a list of components for each innovation, with components grouped in heuristic categories. (Examples of such categories include "Assessment and Planning," "Training," "Staff-Organization Relationships," "Community Involvement," "Staff Functions," "Materials," etc.) 30 Preliminary Identification and Scaling of Variations. The methodology pioneered by Hall and his associates for measuring implementation requires the identification of "variations" for each of the innovation's components. These variations are scaled as "ideal," "acceptable," or "unacceptable." Thus, fidelity is not measured simply by the number of components implemented at the user site, but instead can be represented by a "fidelity score" which reflects the dimension of component variation at the site. Hall and Loucks (1978) recommended interviewing approximately 10-20 individuals with different role positions at different user sites in order to identify variations of components. However, the scope of the present study (seven innovations and 15-25 sites per program) and consequent resource constraints prevented the use of this strategy. Instead, it was decided to have those researchers who had visited the original innovation sites generate variations, with subsequent additions and modifications to be made based on pilot interviews and interviews with the innovation developers. In generating variations, the researchers attempted to list discrete, observable, and quantifiable alternatives. Variations which could not be observed were required to be verifiable through interviews with staff members and clients of the implementing organizations. Although generation of at least one midpoint ("acceptable") variation for each component was attempted, a number of components were dichotomous in nature, and creating a midpoint value would have been unrealistic. Thus, some components had only two variations; an ideal/acceptable variation and an unacceptable variation. For example, 31 the HOSTS reading program disseminated use of a specific cross-referencing index. Use of any other index, or use of no index, was clearly unacceptable. As another example, the Seattle Community Crime Prevention Program required a highly proactive staff approach. Consequently, block watch meetings were scheduled by staff. Scheduling of these meetings by any other person (e.g., block residents or community leaders) was unacceptable. Following identification of variations by researchers, program developers were interviewed to verify that these were indeed realistic ways of doing the programs. The procedure for these interviews are discussed in the next section. Feedback Interviews with Developers. In order to check the accuracy of the researchers' preliminary identification of components and variations, the staff members of developer organizations who were interviewed previously were recontacted. This second contact involved sending two lists to each staff member; a list of components and a list of variations. The respondents were instructed to review the two lists independently. When reviewing components, respondents were instructed to consider whether each component was or was not "relevant for saying that the program has been implemented." The innovation variations generated by the research team were reviewed by developers with the f01lowing questions in mind (regarding each component-specific set of ideal--unacceptable variations): "Are these variations realistic? Do they describe the possible - implementations of my program completely, or are there other important variations which should be included? Are the researchers correct in 32 their labeling of variations (as ideal-~unacceptable)?" These instructions and lists were sent to the individual primarily responsible for developing and/or evaluating the innovation at each developer organization. Four of these organizations (two each for education and criminal justice innovations) were sent duplicate lists to be reviewed by additional staff members. After the lists were reviewed by developers, these individuals were interviewed by telephone. During these interviews, each component and each variation was reviewed by the interviewer, and responses were solicited. The feedback of developers concerning the preliminary identification of components and variations was thus obtained, and appropriate modifications and additions were made to the lists. In sum, the researchers' identification of components and variations followed by the feedback interviews with developers resulted in a list of components and scaled variations for each innovation. These lists comprised the telephone interview fidelity instrument. Examples of components and variations appear in Appendix A. Pilot Tests. The instrument was pilot-tested on thirteen adopter sites. One innovation had a total of only eleven adopters and therefore only one pilot interview was attempted. The remaining six innovations had two pilot interviews per innovation. Sites selected for piloting were adopters which had implemented the innovations for less than two years, and thus did not have an opportunity to achieve full implementation. 1 The general procedure for pilot testing involved a pair of researchers. One researcher administered the telephone interview and 33 coded responses, while the other listened, coded responses, and made notes concerning improvements which could be made in the interview process. Following the interview, the two researchers compared their coding results and discussed disagreements. Protocols for coding difficult items were developed during this period, and the interviewing team (consisting of five researchers) met periodically to review these protocols. Data Collection for the Telephone Interview Instrument The telephone interview fidelity instrument was administered by means of a semi-structured telephone interview. The rationale for using this method was the following: It was anticipated that respondents would be aware of the program developers' attitudes concerning the way the programs "ought" to be implemented, and that respondents would wish to appear to be high-fidelity implementers. Although interviewers intended to inform respondents that they were "not being evaluated," it was expected that some skepticism and mistrust would remain. Also, it was anticipated that a long series of closed-ended questions would lead to considerable fatigue both on the part of respondent and interviewer. Consequently, to minimize the effects of evaluation anxiety, social desirability, and fatigue, respondents were asked open-ended questions, first about a category (heuristic grouping) of components, then about the specific components. Responses to these questions were coded on the closed-ended instrument (Appendix A). For example, to obtain specific information concerning the selection and entry procedures used for the FOCUS program, the respondent would be asked to describe selection and entry procedures in their own terms. 34 If sufficient information was not elicited concerning a particular component within the category entitled Student Selection and Entry, an open-ended question would be asked for that component (e.g., "who refers students to your FOCUS program," rather than "which of the following refer students to your FOCUS program: Teachers? Administrators? Counselors? Parents?..."). Interviews were administered such that each interviewer collected data on approximately the same number of sites per innovation. Responses were machine-scored for computer analyses. The length of the interview ranged from 45 minutes to four hours. Final N for data analyses was 129 sites. Reliability and Validity of the Telephone Interview Instrument Reliability. The reliability of interviewers was measured by conducting ten percent of the interviews with a second researcher listening to the interview, and both researchers coding the data. Reliability was computed as the percentage of exact agreement between interviews. The overall reliability figure was .86. Care was taken to counterbalance coder pairs such that 14 of the 20 possible coder pairs were utilized in reliability testing. Validity. A major validity issue was the potential for disagreement between respondents at the same site concerning the fidelity of implementation. This was considered to be a validity issue since it concerned whether or not the respondent was conveying a true picture of the organizational phenomena rather than his or her own perceptions. Although agreement between sources of information may be considered to be a reliability issue, others (e.g., Withey, Daft, Cooper, 1983) have taken agreement between organizational respondents as a reflection of the external validity of the measure, i.e., to what extent one may 35 generalize from the results of the particular study (Cook & Campbell, 1979). The extent of agreement was measured by interviewing both a "primary" and "secondary" respondent at ten percent of the user sites. An attempt was made to select secondary respondents who were of the same job level as the primary respondent, and equally familiar with the day-to-day operations of the program. These secondary respondents were usually nominated by the primary respondents. Unfortunately, the interviews revealed that in some cases secondary respondents were not as familiar as the primary respondents with the operations of the programs, and this tended to underestimate the validity figures for the instrument. Given this problem, the validity figures were considered to be acceptably high with a mean percent agreement (between respondents) of .74. Measurement Development for the Site-Visit Instrument The site-visit instrument was intended to be a parallel form to the telephone interview instrument, with guides for obtaining additional data whenever possible. The instrument listed for each component the relevant actor(s), key words identifying the component, interviewing probes, observables (activities, actions, materials, and facilities), and item anchors. Data Collection for the Site-Visit Instrument This procedure involved two pairs of researchers traveling to the sites selected for the site-visit sample. Each pair visited 35 sites, and spent two days at each site. Data collection consisted of interviews with respondents from several role positions at each site, observations 36 of pertinent activities and facilities (e.g., block-watch meetings, arbitration hearings, juror orientations, interactions among teachers, aides, and students, etc.), and examinations of archival records. Reliability and Validity for the Site-Visit Instrument Reliability. A similar procedure to that used for the telephone interview instrument was employed. At 13 of the 70 visited sites (19%), both researchers at the site interviewed the same respondents. observed the same activities, and examined the same documents. Forms were coded independently, and results were compared. Using the percentage of exact agreement method, an overall reliability of .81 was achieved. At the sites which were not included in the reliability sample, each researcher interviewed, observed, and examined different data sources. At these sites, just as at the reliability sites, the researchers coded the data independently. However, at the non-reliability sites, researchers discussed their reasons for coding before making final decisions. Thus the best data available to both researchers were used in coding. At the reliability sites, these discussions took place after the two codings were compared, so that a "best" scoring for each component could be determined and recorded. Validity. Since site-visit data collection involved interviewing as many respondents as were available at the site (who were familiar with the program's implementation), the site-visit phase provided many more opportunities to check agreements between respondents when compared to the telephone interview phase. Site-visit data collection also provided numerous opportunities to check for agreement between 37 informants' responses and researchers' observations. Consequently, a different strategy was employed for checking such agreements, compared to the telephone interview strategy. Instead of recording responses from two respondents at 10% of the sites, data-source comparison for the site-visit phase was achieved by computing the percentage agreement between the researchers' consensus ratings and all the various data sources for each component, for all components on which multiple sources of data were available. (Seven thousand and sixty-six out of 9214 total data sources, or 77%, were multiple sources, i.e., at least two different sources [respondents and/or observations and/or materials] were consulted for coding these components.) The overall percentage agreement between these data sources and the consensus ratings was .96. Measurement of Reinvention The intent of this research with regard to reinvention was twofold: (a) to develop a useful, meaningful, and empirically based definition of reinvention which distinguished the term to the greatest extent possible from such related terms as modification and lack-of—fidelity; and (b) to develop a typology of reinvention which could meaningfully and usefully categorize the data set. To this end, an inductive, exploratory methdology was utilized. This consisted of recording each instance of supposed reinvention in case study notes, and then subjecting the notes to content analysis. Measurement Development Information gathered during the telephone interviews was used to refine the conceptualization of reinvention to be employed during the 38 site-visits. The following three questions were asked following the coding of each content category of program components: "(a) With regard to the issues we've just been discussing, have we missed anything or are you doing anything in addition to these activities? (b) Is there anything you are doing in this area that you consider unique or different?, and (c) Have you changed anything?" These questions were intended to orient research team members to the types of changes which could be made for each program, and to determine to what extent respondents would be willing to discuss changes in the programs. It was felt that the level of detail needed to support a worthwhile content analysis was beyond the scope of the telephone interviews, especially given the considerable amount of time already devoted to fidelity in the interview. Also, the thinking of the research team concerning reinvention was quite primitive at this point, and changed with each discussion of the issues. Consequently, recording data on reinvention to be content-analyzed was reserved for the site-visit phase. During the telephone interview phase, responses to the three reinvention questions were discussed by the researchers, but were not recorded in detail nor content-analyzed. Site-Visit Data Collection On the basis of the discussions of responses to the telephone interview reinvention probes, it was decided to tentatively define reinvention, for the purpose of data collection, as "all instances of change in programs which cannot be coded using the fidelity A instrument." Implicit in this definition was the intent to employ a 39 variance, rather than a process, methodology (Mohr, 1978). In other words, "instances" of reinvention were identified and analyzed, rather than "events." This was necessary in order to provide consistency with the variance approach used to measure fidelity, and to enable correlational analyses relating reinvention to program fidelity and effectiveness. The two site visit teams were instructed to probe, in an unstructured manner, all changes which were mentioned by respondents or observed at the sites which could not be coded as ideal, acceptable, or unacceptable variations on the fidelity instrument. Immediately following each site visit, the research teams tape-recorded descriptions of these changes. Following the site visits, these tapes were transcribed into several hundred pages of notes. These notes were then content analyzed. Content Analysis Development of Criteria and Procedures. An iterative process was utilized by the two site visit teams to develop content analysis criteria and procedures. Concepts were tentatively defined; tested formeaninnglness, usefulness, and discreteness; redefined, i.e., by combining some concepts, refining others, and abandoning still others; and then tested again. The process was repeated until the classification system satisfied the criteria of exclusivity, inclusivity, and meaningfulness (Warwick & Lininger, 1975). The first stage of this procedure involved examining the transcriptions for 12 cases (sites). These twelve cases were selected using two criteria: (a) the case contained many potential instances of reinvention; and (b) the case required interesting and/or difficult content analysis decisions that, through their resolution, 40 would contribute to the development of the analysis scheme (definition and typology of reinvention). For example, a Community Crime Prevention Program site located in a city with very different demographic and geographic conditions than the original program model was selected; a Community Arbitration Program site with a complex intake system dependent on interorganizational relationships not found in the original model was also selected. The first step in the content analysis involved independent reviews of the case notes by each member of the four-person research team. Each individual reviewed three cases (each of which he had visited, and each from a different innovation) and attempted to develop potential definitions and typologies. In attempting to define reinvention, researchers were instructed to first identify instances which they did and did not want to call "reinvention." They were then instructed to articulate the reasons for their decisions. At the same time, they were instructed to generate typologies which could accommodate various concepts already prevalent in the literature (for example, see Rice & Rogers, 1980; Larsen & Agarwala-Rogers, 1977). Following a team discussion of these initial analyses, the cases were exchanged. During this second stage (and all subsequent stages) the researchers worked in pairs. Cases were assigned to pairs such that one and only one member of each pair had site-visited each case reviewed by that pair. This enabled each pair to have first-hand knowledge of each case it reviewed. The tasks during this stage were identical to the first stage. 41 Following a discussion of the potential definitions, typologies, and criteria for decision-making which emerged from this stage, a tentative "best" scheme was identified for reliability testing. The third stage of analysis involved each pair using this scheme to analyze four cases. (The eight most difficult cases were used in this stage). Following this analysis, the team again discussed their decisions and criteria, resulting in a final scheme to be used for data analysis. Parallel to this four stage process was a decision process aimed at identifying criteria for determining the boundaries of reinvention "instances." The following example illustrates the complexity of this issue: Several FOCUS (in-school support-group program for disaffected youth) sites were established as Special Education programs, and were required to develop an Individualized Educational Plan (IEP) for each student. This procedure involved a number of different steps (e.g., a team comprising teachers, district coordinator, school administrators, and parents decides to accept student into program; the IEP is developed by this team; the IEP is approved by parents and student; progress towards achieving the goals set in the IEP is reviewed by the team). Each of these steps may relate to one or more components on the fidelity instrument, and given various rationales, the steps can be separated or combined into various configurations of "instances of reinvention.“ Thus it can be seen that identifying discrete units of reinvention was a difficult task. Several factors made agreement between independent judges concerning the boundaries of reinvention instances difficult to achieve. These included different degrees of recollection concerning the specific details of the case, and different biases concerning 42 what was judged to be a meaningful, useful, and discrete unit. Thus, by the fourth stage of the content analysis, it was decided to achieve group consensus concerning the boundaries of all instances of reinvention in a particular case before content analyzing that case. Content Analysis: Codigg Procedures. The procedures which were used to code the reinvention transcripts were the following: 1. The four researchers formed two pairs for the first 35 cases; they then re-paired, to control for possible dyad biases. The two sets of pairs were both orthogonal to the site visit pairings, so that each pair had only one individual who had actually visited the site being reviewed. This enabled content analyzing all 70 cases with each being analyzed by two pairs each of which had one member with first-hand knowledge of the site. 2. One individual from each pair initially reviewed the case and constructed boundaries for reinvention "instances." His decisions were then checked by the second researcher. Prior to analyzing the case, the other pair also checked the "instancing" decisions, and if necessary, boundaries were redrawn according to the consensus decision of the entire team. 3. Each case, and each instance of reinvention, was coded by both pairs of researchers independently. Coding was performed according to the definition and typology of reinvention described in the following sections. 4. Following independent coding of a case, the two pairs' ' ratings were checked for reliability. After agreement/disagreement for each instance was determined, a consensus decision was made as to 43 the best coding. (Criteria for these decisions are described in the following sections.) Empirically-Based Definition of Reinvention For the purposes of this study, reinvention was defined as the use of materials, activities, procedures, or organizational structures by organizations implementing modified RD&D-mode1 programs, that cannot be adequately explained using the framework provided by the developer-defined program components and variations. Reinvention was treated in this research as a fidelity-based construct. "Instances," or units, of reinvention were identified by the use of materials, activities, procedures, or structures rather than by the components they related to in the fidelity instrument. This provision was required since one instance of reinvention could be related to one, several, or many components, depending on interpretation. Thus reinvention was defined as I'instance-based" rather than "component-based." Instances of reinvention were required to remain within the confines of the program and the implementing site. In other words, materials,activities, procedures, and structures which were implemented outside the organization that housed the program and/or which were not part of the program's implementation at the site were not considered to be instances of reinvention. In addition, a single instance of reinvention was differentiated from a broad organizational practice that could be further divided into two or more instances of reinvention. For example, an educational program developed and disseminated as a "mainstream" program might be implemented at a particular site as a Special Education program, 44 thus involving a number of activities, procedures, and structures which differ from the original model (e.g., the various steps involved in the Individualized Educational Plan described above). Decisions concerning the “boundaries of instances" were therefore somewhat arbitrary. However, it is important to maintain consistency throughout a specific content analysis, and throughout content analyses which might be compared to that analysis. Consequently, decisions concerning the boundaries of instances were made using a group consensus procedure to insure consistency. A given use of materials, procedures, activities, or structures was not considered to be reinvention if it could be reasonably fit within the boundaries of the developer-defined components and associated variations. Such practices were adequately discussed in terms of variations in developer-defined levels of fidelity, and using the additional concept of reinvention would only have confused the issue. Careful consideration was therefore given to the entire set of components and variations when deciding whether or not a specific practice should be called reinvention. It was not uncommon to find an instance which first appeared to be reinvention was actually a "specific implementation of a vague component or variation." These "specific implementations" occurred in three major types. First, the developer may have purposefully defined the component in vague terms. That is, even though RDBD-model programs must be relatively well-specified and explicit to fit the R080 approach, a range of specification detail necessarily exists. For example, a very explicit program objective may be well-specified, but the means of 45 achieving the objective might be left up to the implementors. A specific instance from the present data set which illustrates this general example was the procedure implemented by a Montgomery County Pre-Release Center site for informing prospective employers of the site's residents that applicants they were interviewing for jobs were clients of the site's pre-release program. This site sent letters written by Center staff to the employers explaining the job applicants' status. In this case, the original program developer had specified that employers should be informed; however, the means (e.g., mailed letter, letter hand-delivered by applicant, phone call from staff, visit by staff, etc.) was not specified, and was not identified as a set of variations during the construction of the fidelity instrument. Second, an organizational practice could be described as a specific implementation of a vague component if the component was not well-specified or explicit in the fidelity instrument, but had been originally defined and disseminated by the developer in clear and precise terms. In this case, an error was made by the researchers in the original content analysis of developer responses that was used to construct fidelity components. Again, this would not be considered reinvention. The third type of "specific implementation" resulted from decisions by the researchers during the development of the fidelity instrument to delete potential implementation components due to their apparent insignificance. In retrospect, these potential components could be seen to be reasonable specifications of program aspects. 46 In short, the first type of “specific implementation" resulted from the actual program definition and dissemination, while the second and third types resulted from measurement error. Besides "specific implementations," other examples of instances from the transcripts which were not classified as reinvention were for the most part a result of inaccuracies in the memories of the site visitors concerning the specific details of component variations. With regard to inter-coder reliability, each instance was classified as reinvention or not-reinvention by each pair, independently. All codings were compared to check reliability. For judgments which classified an instance as reinvention or not-reinvention, the percentage agreement between pairs was .80 across all instances. Typology of Reinvention The category scheme used to code instances or reinvention was two dimensional. One dimension was used to code the instance as either an addition to the original program or a modification of the program. The second dimension was used to code the instance as either proactive or reactive. The Reactive category was further divided into Internal or External Reinvention, depending on the source of the constraint(s) which influenced the reinvention. Finally, all instances of reinvention were rated on a three point importance scale. This typology is summarized in Figure 2. The following sections describe the typology in greater detail. Addition-Modification. Additions were defined as materials, procedures, activities. organizational structures which were supplemental 47 zo~hcwmm mo vcmuxm ccw amopoaxp zoahmoa< zo~e2w>z~ma Lo Auuz~hu~hu mpwm mcompmmsc caucmucmno cowucm>cwmm cowuumprou sumo to muozpmz ozh Eogm .mupzmmm museum; :mmzumm pamEmmLm< one mpmwcmums Emgmogq m>wum>occw to cowpmcwsmxw :umm Low .mcowum>emmao mco .mamwp cm>mm mmugzom memo .mzmw>gmpcw .Ampnmpamuuwca mcoe< acmEmmcm< Locouncwch "uwmw> mpwm .mpnmpamuow .meuwv ucmcoasou gumm mpcmwcoamwm mzmw>gmch com mcowumwgm> use cowumucmem_nsH cmmzpwm pcmsmmgm< Lmuou-me=~ mcosampm» mpcmcoasou mo umm4 to xuwpwnpm ucmsmmmmm< “cosmmmmm< coppuwppou cowpmpcm53LumcH mpam_cm> xpwuwpm> xuwpvnmmrmm mpmo we cosumz mcoggmz to mem553m F wFQMH 61 Lmuooucmch Emcmoga cwcupz mmpvm to mcwxcom muwmw> mupm mcwgau mmmum gugmmmmc Xe umpzawggmwc mew: magmascumcw mcwpmwxm oz» .Amummv msmgmoga cm>mm mgp mo use can mowuw>wuum compmzpm>m Peace: mecu cw mmpvm ma cum: msmgmoga we uwmw> muwm mucmszcumcm mcwumwxm mmmcm>muomwwm acmsmmmmm< ucmsmmmmm< :omuumppou cowuwpcmE:LumcH mpnmwcm> abvcwpe> apwpwnawpam mama to venom: A.u=ouv F m_nee RESULTS This chapter presents the results of data analyses designed to answer the eight research questions listed at the end of the Introduction (p. 19). Although all eight questions are of interest, attention should be focused on questions (1), (2), and (4): 1. Given a relatively precise and accurate operationalization of program fidelity, to what extent are social program innovations which are developed and disseminated according to the modified RD&D approach actually implemented with fidelity at program sites? ’ 2. To what extent is the fidelity of program implementation related to program effectiveness? 4. What are the relationships among fidelity, reinvention, and effectiveness? These questions are of primary importance since they are the most critical to testing the viability of the modified RD&D approach, which is the primary purpose of this research. This chapter is structured as follows: Analyses pertaining to the three major variables (fidelity, reinvention, and effectiveness) in and of themselves are presented insequence. For each variable, descriptive analyses are presented, followed by comparative analyses. The chapter concludes with a presentation of analyses examining the relationships among fidelity reinvention, and effectiveness. 62 63 Fidelity Per se Overview The paucity of previous empirical research in the implementation area argue in favor of treating the data analyses of fidelity pgr_§g_ as exploratory. In this spirit, considerable attention is given to the descriptive analyses. Also in this vein, comparative analyses are presented using two different scoring systems: the original three point scaling of variations (ideal variation = 2, acceptable = l, unacceptable O) and also a two point scale (ideal or_acceptable = l, unacceptable 0). The rationale for using the two point scale was the following: When using the two point scale, it is necessary to assume that distinctions between "ideal/acceptable" and "unacceptable" variations are measurable and consistent across components and across programs. Analyses using the three point scale require the additional assumption that "ideal" variations are measurably and consistently different from "acceptable" variations. Since distinctions between "ideal" and "acceptable" variations are more difficult to make than distinctions between "ideal/acceptable" and "unacceptable" variations, it was felt that analyses using the two point scale would provide more conservative tests of differences between programs on fidelity. The use of a two point scale also reduced scale variance and made between-program differences more difficult to detect. However, even with this conservative approach, the results of the analyses of variance presented below should be viewed with' caution, since the sets of components used to measure the fidelity of different programs were to some extent program-specific. Component and variation sets differed across programs with respect to (a) the 64 number of components per set, (b) the degree of explicitness with which individual variations were written, and (c) the number of three- variation vs. two-variation components per set. Thus analyses of variance should not be considered to be accurate tests of quantitative between-program differences. Instead, they should be viewed as exploratory; they provide limited evidence for generating hypotheses to be tested in future research. Are Modified RD&D Social Programs Implemented with Fidelity Telephone Interview Results. The initial research question addressed the extent to which modified RDBD social program innovations are implemented with fidelity at user sites. For this analysis and most of the following analyses, "site fidelity scores" were computed by obtaining average-item fidelity scores for each site (the sum of fidelity item scores divided by the number of completed items per site). Thus the unit of analysis, unless otherwise indicated, was the site (the implementing organization). In many of the analyses, these site scores were aggregated within program (i.e., within each of the seven innovations). Table 2 shows that the mean fidelity scores for each program were all greater than 1.0 and therefore fall within the acceptable range. Both the mean and median across programs equaled 1.33 when measured with the three point scale (2 = ideal, l = acceptable, 0 = unacceptable). The standard deviation equalled .28. The distribution of two point scale scores (1 - ideal or or acceptable, 0 = unacceptable) also indicated a moderately high fidelity pattern. The mean of the distribution was .73, and the median was also .73. The standard deviation was .14. Both three and two point distributions were clearly skewed in the direction of 65 Ammzcwucou m—nmuv mo._ 4?. ma. m~.\-. mm. mm. mm._ \mN.P\m~._ mm.P eowpanweomwo Pragm>o .u mF. we. NA. mm. mm. N~.P mN.P mm._ we __atauo NP. ,8. mace o: co. Pm. mo.F F~._ m_.P ammo: my. Po. NA. mm. mm. PP. MN.P 8F.P aauom ~_. P“. we. me. am. Fm.P m~._ mN.F ao No. on. om. F“. m_. FN.P mN.P m~._ maoou 4F. _m. ma.\mm.\ow. mm. mm. Nm.F so._\~m.F mm.P mumm mo. an. we. Pm. mp. ae._ mo.P em.P memo: mamgmoca Pacowpauzem .< am new: «avaeoz cavemz am cam: Amwaeoz caweaz . Ac": ._ugmch mcosgwpmh .H »»m_mewa ”mawomwuaom a>wpawtumao N mpnmk 66 mm.P\mm.P mp. em. mm. mm. om. mF.F No.~\mm.\Pm. vP.F cowpanwcammo ppmgm>o .0 mp. Po. mm. mm. mm. mo._ No.F mo.~ au Ppmgm>o mo. Fm. mm. Pm. mp. mm. mo. em. ummuz mp. Fm. macs on we. mm. mm. No.P\¢n.\—m. om. mauum mp. um. um. mm. mm. Fm. mo.P m~.~ mo NF. mm. m©.\om. mm. NN. em. mm. em. wagon cw. mm. ow. mm. mm. mm.P mUOE o: N¢.p mumm mo. mm. mu. NR. 0F. om.~ mm.~ vm.~ mhmo: msmgmaea Pacotumuzum .< om cam: Amvauoz septa: am new: amvaeoz suave: AC": nFfl apwm .HH A.b=ou N apnaev 67 high fidelity. Descriptive statistics for these results are summarized in Table 2 (I), and are visually portrayed in Figure 4. Site-Visit Results. The site-visit results also provided support for the modified RD&D approach. As shown in Figure 4, four of the seven programs clearly scored in the acceptable range, with a mean fidelity average-item score across sites greater than one (X = 1.13). Of the three programs which did not clearly exceed the acceptable level, means of .94, .86, and .86 indicated scores close to the acceptable level.; Of the four means which exceeded the acceptable value, two were from the Educational policy area and two were from the Criminal Justice area. The median of scores across programs was 1.15. The standard deviation equalled .30, almost identical to that of the telephone interview distribution. The skew towards high fidelity was less pronounced when compared to the telephone interview scores, with 22 of the 70 sites (31%) scoring below 1.0 (vs. 11% scoring below 1.0 for the telephone sample). The distribution of two point scale scores also revealed a pattern of moderately high fidelity. The mean of this distribution was .64, and the median was .65. Again there was a skew towards high fidelity, with only 21% of the sites scoring below the "acceptable" point. Also following the previous pattern, the standard deviation was nearly identical to that of the telephone interview distribution (telephone interview SD = .14, site visit SD = .15). Table 2 (II) summariZes the descriptive statistics for the site visit data. 68 mucoum xuwpmuwu EmpHsmmem>< new: .e mesmwm Pmcuucmcco “Pave Lacy Loucau mace—az-oce xucaou xcmeomucozuu secooL; :a_u:o>mea os_cu zuwcaszao upuuoomno a—oozUm-elc_zu_r-_oocumv uuomocm mauouum AEeLaoLa co_mcm>mu m.,:c>=m. sacooca co_uccu_ng< xu_c===aonm co_ucu=cw cmocou cuwam ou:o_eoqxuum Emumxm “cosmomcaz mesa .eveh «:0 mac ozone Azacoocn mcvuowcv ammuuam cu acmczum mco upwxufl mz ouwv com. my fiom. mx -.~mk o~.~JM coo. Wu mwm. duh. om.~uk #m.nom mm.nom mm.nom HN.uom m~.uom w~.nom m~.nom oo.~nk. oo.~uM. ~m.~n&. em.~JM. -.~JM mm.~JM me.~mm wu_:ww¢ occzaw—m» Kongo: cauum ma mgpm mouou_uc_ mc_F umzmcu m poon_ ”mu—:mmt xmw>cmac_ «conga—mu mmuou.tc. mcvp uwpom "upoz 69 Comparing Telephone and Site-Visit Results: Visual Comparison. Figure 4 shows the similarity between the two sets of fidelity results scored on a three point scale. Note that the standard deviations of the two data sets are nearly identical as well as the mean fidelity scores. Comparisons Between Telephone Interview and Site-Visit Results: Correlational and Percentage Agreement Analyses As described above in the literature review, few quantitative comparisons between different methods for measuring degree of implementation have been reported. Given the greater costs of site- visits, a careful comparison of telephone interview and site-visit methods is of considerable practical importance. Two general types of telephone versus site-visit comparisons are presented. First, throughout the results section, both sets of results (telephone and site-visit) are described, and the patterns of the data sets are compared for each analysis. Secondly, in the present section, correlational and percentage agreement comparisons are discussed. Correlations Between Data Sets. Two correlational analyses were performed: the first used the iEEE.35 the unit of analysis; the second used the sjtg_as the analysis unit; 1. For each fidelity jtgm_of each program, scores obtained by telephone interviews were correlated with those obtained by site-visits. For all items which could be scored on both two and three point scales, correlations were obtained for both types of scoring. Once the. correlations were obtained, they were summed across items within program (using Fischer's Z Technique). This resulted in two sets of correlations: one set using the three point scale, and a second set 70 using the two point scale. The mean correlation across programs was .38 for the three point scale, and .44 for the two point scale. The sets of within-program correlations ranged from .26 (ODOT program) to .44 (CAP program) for the three point scale and from .33 (SCCPP program) to .59 (EBCE program) for the two point scale. 2. The correlations between methods using the gitg_as the unit of analysis ranged from .46 (FOCUS program) to .84 (EBCE program) for the three point scale, with an average correlation across programs of .68. In order to establish reference points for these correlations, t-tests were used to test the hypotheses that the two types of correlations (item-level and site-level) differed from zero. The results showed that 12 of the 14 jtgmylevel correlations differed significantly from zero at the .01 level (with a 13th differing from zero at the .05 level). 0f the seven site-level correlations, three differed significantly from zero at the .01 level and a fourth differed from zero at the .05 level. (The three program-level correlations which failed to differ signicantly from zero were reasonably high, ranging from .46 to .54.) In sum, the correlational analyses indicated a moderate level of agreement between the two data collection methods. Percentage Agreement Comparisons. A final analysis used to compare the results of the telephone and site-visit methods employed the percent-agreement technique using Cohen's "Kappa" statistic to 71 correct for chance agreements. The formula contributed by Cohen is the following: K = f(obs) - f(chance) N - f(chance) (f [obs] = the frequency of observed agreements; f [chance] = the frequency of agreements to be expected by chance [computed using the marginals of a category x category matrix], and N = the number of opportunities to agree/disagree.) Percent-agreement figures indicated a moderately high level of agreement between methods. The raw figures ranged between .59 (FOCUS) to .73 (HOSTS), and the corrected figures ranged between .290 (SCCP) and .385 (HOSTS). (The two sets of agreement figures are not perfectly correlated, due to the different proportions of two versus three point items between programs.) When averaged across programs, mean percentages of.658 (raw) and .338 (corrected for chance) were obtained. In sum, three types of between-method comparisons were employed. Two correlational analyses were conducted, first at the item level, second at the site level. Third, percentage agreement figures were obtained, both in raw form and corrected for chance. All analyses indicated a moderate extent of agreement between measures. Are There Differences in Fidelity of Implementation Between Programs? Telephone Interview Results. A one-way between-program analysis of variance using the three point scale revealed significant differences, F = 9.97 (5, 122), p < .00001, w2 = .295. The between-program analysis of variance for the telephone interview results scored on a two rather 72 than three point scale also revealed highly significant between program differences, F = 7.71 (6, 122), p < .OOOOl, m2 = .238. Site-Visit Results. The between-program analysis of variance (with components scored on a three point scale) revealed significant differences in the site-visit data set as well, F = 11.45 (6, 63), p < .001, w2 = .472. The analysis of variance which treated components as dichotomous rather than trichotomous variables again revealed significant differences between programs, F = 8.58 (6, 63), p < .00001, w? = .395. Comparing Telephone and Site-Visit Results. The dominant impression produced by comparing analyses of variance between the two data sets is the similarity between the two sets of results. Significant between-program differences were found using both three point and two point scaling. Are Programs Implemented with Fidelity Across Social Policy Areas? Telephone Interview Results. A one-way, two-group analysis of variance was performed with the policy areas of Education and Criminal Justice serving as the two groups. This analysis resulted in a significant difference between the two social policy areas in fidelity (F = 20.89 (1, 127), p < .OOOOl, wz = .133, reflecting a higher mean for the Education group. An examination of program means suggested that the significant difference between the Education and Criminal Justice policy areas directly related to the difference between the high fidelity HOSTS and EBCE educational programs on the one hand, and the lower fidelity SCCPP and MCPRC programs on the other hand. 73 Using dichtomous scoring, the results of the analysis of variance duplicated the findings produced by the trichotomous scoring, with slightly diminished significance levels (F = 16.91 (1, 127), p < .0001, m2 = .110). Site Visit Results. This one-way, two-group analysis of variance to test the differences between Education and Criminal Justice programs also revealed significant differences between policy areas, (although differences between means were relatively smaller than previous analyses), F = 6.27, (1, 68), p < .01, w2 = .069. However, when components were scored dichotomously rather than trichotomously, the mean of the Education group was no longer significantly greater than the mean of the Criminal Justice group at the .05 level, F = 3.54 (l, 68), p < .064, m2 = .035. Comparing Telephone and Site-Visit Results. Although differences between policy areas were observed, these differences were less robust than the between-program differences reported above. Using trichotomous scoring, the differences in the site-visit data were smaller than the telephone data, and when a two point scoring system was used, the Education group mean was no longer significantly greater than the Criminal Justice mean at the .05 level, thought still significant at the .10 level. Despite this variation in significance, results followed a consistent pattern with regard to their direction. Reinvention Per Se Overview Given the definition and typology developed for reinvention (see Figure 2), an exploratory analysis of the concept requires 74 examining several variables. The following 21 reinvention variables can theoretically be computed for each site: (a) the sum of reinvention "instances;" (b) the sum of reinvention instances weighted by importance ratings (weighted instances); (c) weighted instance scores divided by the total number of instances (average weighted instances); (d) the sum of modifications; (e) the sum of additions; (f) the sum of instances rated as "proactive" (proactive instances); (9) the sum of instances rated "reactive" (reactive instances); (h) the sum of instances rated "internal reactive;" (i) the sum of instances rated "external reactive;" and (j) to (u) the weighted and average weighted scores which correspond to variables (d) and (i). However, in the interests of clarity and brevity, only the interesting and significant results will be discussed. The same general data analysis strategy which was followed for examining fidelity will be used to discuss reinvention. (However, given the multiplicity of variables, detailed descriptions of the distributions are described in Appendix B, with the main features highlighted in this chapter.) Differences between programs and policy areas are reported following descriptive information. Since the reinvention data collected during the telephone interview phase were used primarily to develop the conceptualization of reinvention to be used in the site-visits, only analyses of site- visit data will be reported. Thus the number of sites for most analyses will be N = 70. It should also be noted that the answer to research question #3 (concerning the definition and typology of reinvention) is discussed in the Method section, above. 75 A final issue concerns the different ways of measuring the extent of reinvention at each site. The first three reinvention variables listed above (the sum of instances, the sum of weighted instances, and average weighted instances) represent different overall summations of instances across reinvention categories. Each of these variables has different implications for interpretation. Recall that the decision rules for establishing "boundaries" between instances were fairly arbitrary and were set using a consensus process. Also recall that reinvention is a fidelity-based concept, and that the number of instances per program is partially a function of the fidelity instrument. Finally, recall that each instance was rated on a three point importance scale which was used to weight the instances. Given these three factors, the three summative variables can be compared as follows. The first variable (unweighted sum of instances) is useful for examining "reinvention: as distinct from "importance." This is a fairly crude index of extent of reinvention, since each instance's importance can be said to affect the extent of reinvention per site. The second variable (weighted sum of instances) uses importance weights, while the third variable (average weighted instances) controls for the number of instances per site. Although controlling for the number of instances has intuitive appeal, this third variable is actually a highly ambiguous index of the extent of reinvention. For example, consider the fbllowing equation: 5 instances x 3(rating per instances) + 5 instances x 1 = 2 10 total instances 76 In this case, the average weighted score equals 2. However, also consider this equation: 2 instances x 2 = 2 2Ttotal instances Note that the second equation represents a site which intuitively has a different "extent of reinvention" compared to the site represented by the first equation. Thus, average weighted scores obscure important differences between sites and are therefore omitted from the following discussion. In brief, the sum of importance-weighted instances of reinvention appears to be the most meaningful index of the extent of reinvention. Descriptive Analyses The descriptive statistics discussed below are summarized in Table 3. The following results are especially noteworthy: l. The median number of unweighted instances of reinvention per site was 6.0, and the distribution was fairly homogeneous across sites. This indicates a moderate amount of reinvention occurring throughout the sample. 2. Many of the instances were rated as relatively unimportant, as indicated by the positive skew of the importance-weighted reinvention scores, and the low median for these scores, which was close to the unweighted median (unweighted median = 7.0). 3. There was only a slight positive relationship between the number of instances of reinvention per site and importance ratings (r = .188, NS). This indicated that many sites with relatively many occurrences of reinvention also had mostly unimportant reinventions. 77 Table 3 Descriptive Statistics: Reinvention Median Mode Range Mean SD Sum of Instances 6 3 0-21 6.53 4.79 (absolute frequencies) Weighted Instances 7 9/11* 0-36 8.63 7.01 (by importance ratings) Average Weighted Instances 1.20 1.0 0-3 1.26 .487 (weighted instances 9 # of instances) Sum of Modifications 2 2 0-11 3.04 2.07 Weighted Sum of 3 O/2* O-l7 3.80 3.37 Modifications Sum of Additions 3 1/4* 0-14 3.49 3.24 Weighted Sum of Additions 4 O/l/5* 0-25 4.84 4.98 Sum of Proactive Instances 4 0/4* 0-19 5.03 4.39 Weighted Sum of Proactive 5 0 0-31 6.37 6.07 Instances Sum of Reactive Instances l 0 0-6 1.50 1.47 Weighted Sum of Reactive 1 0 0-10 2.27 2.44 Instances Sum of Internal Reactive O 0 0-4 .086 .705 Instances Sum of External Reactive 0 0 0-6 1.13 1.41 Instances *Multimodal distribution 78 4. Distributions of sums and weighted sums for both addition and modification were positively skewed, reflecting the greater frequencyyof low-scoring than high-scoring sites. A comparison of the central tendencies of importance-weighted modifications (median = 3.0, mean = 3.8) vs. importance-weighted additions (median = 4.0, mean = 4.8) combined with the fact that the unweighted sums were not very different (N of modifications = 213, N of additions = 244) suggests that additions were rated as more important than modifications. 5. An examination of the distributions of additions and modifications by program (Table 4A) indicates two interesting patterns. The distribution of additions shows a relatively uniform progression from low to high frequency. The distribution of modifications also shows a uniform progression, with the clear exception of the ODOT program. This innovation had 29 more instances of modification than the next highest program. Also of interest was the comparison between the ODOT mean of unweighted modification sums (6.6) and the mean of weighted modification sums (7.2) (Table 4B). The difference between these means (7.2 - 6.6 = .60) was the second lowest difference of its type among the seven programs. This indicated that ODOT modifications were in general rated as trivial compared to those of other programs. Thus ODOT can be viewed as an "outlier," with relatively many, highly trivial modifications. 6. The most noticeable feature of the proactive-reactive ’ dimension was the extremely low frequency of reactive instances. The mode of the reactive distribution was 0.0; 42 out of 70 sites had no 79 Table 4 Modification and Addition by Program: Descriptive Statistics A. Ranked distribution of unweighted sums of instances Modification Addition Unweighted Unweighted Rank Program Sum Rank Program Sum l ODOT 66 1 FOCUS 50 2 SCCP 37 2 CAP 50 3 CAP 34 3 MCPRC 49 4 MCPRC 24 4 EBCE 33 5 FOCUS 18 5 SCCP 26 6 EBCE 17 6 ODOT 21 7 HOSTS l7 7 HOSTS 15 B. Unweighted and weighted modification sums Mean of Mean of Difference Program Unweighted Sums Weighted Sums Scores HOSTS 1.7 2.8 2.8 - 1.7 = 1.1 EBCE 1.7 1.9 1.9 — 1.7 = .2 FOCUS 1.8 2.6 2.6 - 1.8 = .8 0001 6.6 7.2 7.2 - 6.6 = .6 CAP 3.4 4.3 4.3 - 3.4 = .9 SCCP 3.7 4.8 4.8 - 3.7 = 1.1 MCPRC 2.4 3.0 3.0 - 2.4 = .6 8O reactive reinvention whatsoever. However, it should be realized that the low frequency of reactive reinvention is partly artifactual. The proactive-reactive dimension was more sensitive to the post hoc content analysis procedure than the addition-modification dimension (since coding decisions regarding proactive vs. reactive coding required case history information), while distinctions between additions and modifications could be made largely on the basis of the relationship of the reinvention instance to the program's components and variations. 7. The low frequency of coded reactive reinvention resulted in extremely low frequencies of internal vs. external reactive reinvention. The modes and medians of both these distributions equaled zero. The internal-external dimension was therefore excluded from further analyses. Differences Between Programs Sum of Instances. A one-way analysis of variance revealed no significant differences between programs on the sum of instances per program. The program means ranged from 3.2 (HOSTS) to 8.7 (ODOT) with an overall mean of 6.53. The standard deviations ranged from 2.10 (HOSTS) to 6.06 (MCPRC). The lack of differences reflects the previously described homogeneous distribution. Weighted Instances. Using a one-way analysis of variance to examine reinvention instances weighted by importance ratings revealed no significant differences between programs. Means ranged from 5.0 (HOSTS) to 12.40 (CAP), with standard deviations ranging from 3.78 (0001) to 10.01 (sccpp). The overall mean was 8.63. ' Unweighted Category Sums. This section reviews the unweighted sums of instances analyzed within reinvention category (e.g., addition, 81 modification, etc.) and across programs. Significant differences were obtained between programs on both modifications, F (6, 63) = 6.46, p < .00001, m2 = .319, and additions, F (6, 63) = 2.35, p < .041, wz = .104. There were no significant differences between programs on unweighted sums of proactive or reactive reinvention instances. The relatively extreme differences between programs on number of modifications per site suggested further analyses. A post hoc Scheffe test indicated that the ODOT program had significantly more instances of modification compared to all of the Education programs, and one of the Criminal Justice programs (SCCPP). There were no significant differences among the remaining programs. Thus, as noted above, the ODOT program appeared to be an "outlier." It had more modifications than other programs, and these were rated as more trivial than those of other programs. The importance of this finding will become evident when the addition-modification dimension is related to other variables, as discussed below. Weighted Category Sums. Significant differences were again found between programs on modifications, F (6, 63) = 3.46, p < .005, w2 = .175, and additions, F (6, 63) = 2.78, p < .02, w2 = .132. Significant differences were also obtained on weighted sums of reactive reinvention instances, F (6, 63) = 2.75, p < .020, w2 = .131, but not proactive instances. Differences Between Policy Areas Overall Reinvention Scores. A one-way analysis of variance revealed differences between policy areas on unweighted sums of 82 instances, F (1, 67) = 5.44, p < .022, w2 = .063. Adding importance weights produced results which approached significance at the .05 level (p < .068). Reinvention Categories. For the unweighted sums, significant between-policy area differences were obtained for number of modifications, F (l, 68) = 15.20, p < .0002, wz = .169 and number of reactive reinventions, F (1, 68) = 7.55, p < .008, m2 = .086. Regarding the weighted sums, differences were obtained for the sum of modifications, F (1, 68) = 9.70, p < .003, w2= .111, and for the sum of reactive reinventions, F (l, 68) = 4.60, p < .036, m2 = .049. Effectiveness Per Se Overview I This section describes the program-by-program effectiveness results in three ways: (a) The types of data which were collected are described; (b) descriptive statistics summarizing results across sites are presented; and (c) the summary statistics for the sample sites are compared to the results produced by evaluations of original demonstration sites. Appendix C details these descriptions, while this chapter presents the most important issues and findings. Before presenting these results, the following issues are of interest: 1. The importance of comparing research sample sites with original sites on effectiveness become evident if one considers the possibility that adopting sites produced effectiveness scores which differed by several orders of magnitude from demonstration sites. Examining the relationship between fidelity and effectiveness would be of less value as a test of the modified R030 approach if this 83 proved to be the case. Consequently, a comparison between research sample sites and demonstration sites is of considerable interest. 2. Summary statistics on adopting sites were computed using the site as the analysis unit since individual-level data were frequently unavailable. This precluded statistical comparisons with demonstration site evaluations, which used individual clients as the unit of analysis. However, the purpose of obtaining a rough comparison of the adopting-site and original-site samples is served by examining site-level summary statistics, e.g., means and standard deviations, for both samples. 3. In all cases, the most recent effectiveness results were sought in order to maximize the validity of comparisons between effectiveness and fidelity. (Recall that fidelity data was collected during fall, 1981 [telephone] and winter/spring, 1982 [site-visits].) 4. For all innovations with the exception of the HOSTS reading program, multiple measures were obtained. The ranking procedure utilized these measures as follows: sites were ranked within program on each separate measure available. (Some sites had data available for all measures, while others only had partial data.) Ranks were then averaged across measures to produce an overall "objective" rank for each site. These averages were unweighted. As described in Appendic C, there was considerable variation among sites regarding the number and type of measures available and the appropriateness of the time period covered. It was therefore decided in cases where data were of lower quality to check the objective ranks against the site visitors' subjective impressions. Subjective impressions were also used to make decisions on tied ranks. 84 5. For HOSTS, FOCUS, and SCCPP, change scores were used as data for the ranking procedure. The use of change scores is generally inadvisable due to the regression-towards-the-mean effect (Campbell & Stanley, 1966). However, their use was considered permissible for the purpose of this research, since they were to be employed as the basis for the ranking procedure for which relatively gross estimates of effectiveness were sufficient. Types of Data The measures of effectiveness which were used appear above in Figure 3. Data for these measures was obtained with varying success across programs. Four programs had all ten sites each providing adequate data. ("Adequate data" refers to a sufficient amount of data to make an objective judgment on ranking.) Two programs (HOSTS and FOCUS) had nine sites providing adequate data, and one program (SCCPP) has seven sites which provided adequate data. Thus the total number of sites providing adequate data was 65. Details concerning the types of data available for each program are elaborated in Appendix C. Comparisons of Sample Sites with Demonstration Sites Given the cost of obtaining the necessary data to perform statistical tests between research and demonstration site samples (e.g., obtaining standard deviations for research-sample client scores) and given the lack of consistency for availability of measures, statistical tests of the similarity of the two samples (Research sample and Original Demonstration sample) was not attempted. However, one can obtain a rough picture of the 85 comparative effectiveness from the data presented in Appendix C. In general, the two samples were similar. Briefly reviewing each program, the HOSTS research sample was slightly inferior to the demonstration site. (They differed by four NCE's; the NCE scale is an equal-interval percentile scale.) The FOCUS research sample was slightly superior to the demonstration site on achievement test scores, while extremely high variance and missing data made comparisons on GPA's much less meaningful. The ODOT research sample was somewhat inferior to the demonstration site on two indices of juror management efficiency, while the CAP research sites were slightly inferior on one measure (recidivism) and somewhat inferior on a second measure (percent of youths returned to State's Attorney). The MCPRC research sites were somewhat superior on recidivism rates and somewhat inferior on percent of residents employed, when compared to the demonstration site. Meaningful comparisons could not be made for the SCCPP sites (since only three sites reported neighborhood- specific data) or for the EBCE program (data on the measures used for the research sites were not available for the demonstration sites). In sum, given the available data, the sample of implementing sites can be considered to be roughly comparable to the sample of demonstration sites on program effectiveness. Relationships Among Fidelity, Reinvention, and Effectiveness Use of Site Visit Data As explained previously, effectiveness and reinvention scores were obtained during the site visits, but not during the telephone 86 interviews. Due to the temporal contiguity of the site visit fidelity data with the reinvention and effectiveness data, these fidelity scores were used in the analyses reported below, rather than the fidelity data obtained through the phone interviews. This avoids the confounding effect of history (Campbell & Stanley, 1966) on the interpretation of correlations between fidelity and the other two variables. Tests for Non-Linearity Before computing correlations among these three variables, scatterplots of the three relationships (fidelity-reinvention, fidelity-effectiveness, and reinvention-effectiveness) were examined for non-linearity. All relationships were observed to form acceptably linear patterns. Data Transformations Transformations were performed in order to change raw-site-level scores to scores which were most amenable to comparisons across programs. This was accomplished by standardizing fidelity and reinvention scores and normalizing effectiveness scores. (Raw scores for fidelity were average-item scores; for reinvention, raw scores were importance-weighted sums of reinvention instances; raw scores for effectiveness were the within-program ranks.) Each transformation followed a somewhat different rationale, and these are reviewed in the following sections. Standardized Fidelity Scores. As has been previously noted, fidelity component-and-variation lists for the seven different programs differed on (a) the actual number of components 87 (specificity); (b) the degree of concreteness with which components were operationalized into variations (explicitness); and (c) the number of three-variation vs. two-variation components. Therefore, the assumption that differences between program means reflect "real'I differences is open to question. This assumption can be avoided for the purpose of across program correlational analyses by standardizing scores within programs. That is, each site's average-item score was subtracted from the program mean, and the result was divided by the program deviation, to produce a standardized score for the site. Standardized Reinvention Scores. Although the reinvention coding system was developed to be uniform across programs, some dependency on the fidelity instrument existed. This is evident with regard to component explicitness. In other words, the degree of explicitness with which a component was operationalized into a set of variations influenced the extent to which reinvention (i.e., change which could not be coded using the component-variation framework) could be detected and recorded. Therefore, reinvention scores were also standardized. As discussed in the section on Reinvention Per Se, simple reinvention sums were an inadequate representation of the extent of reinvention, and average reinvention scores were somewhat ambiguous. It was therefore decided to use only the importance-weighted sums of reinvention instances for correlational analyses. Again, standardization was accomplished using within-program means and standard deviations. Normalized Effectiveness Scores. Recall that effectiveness measures differed across programs, and that a ranking procedure was 88 used to measure the relative effectiveness of sites within programs. It was necessary to transform these ranks, due to differences across programs regarding the number of sites which did not provide adequate effectiveness data. Four programs had all 10 sites providing adequate data; two programs had nine sites; and one program had seven sites. (The normalization procedure transforms rankings so that the difference between ranks becomes equivalent across programs, e.g., first-ranked sites in program distributions with 10 rankings [10 sites] and seven rankings [seven sites], respectively, are placed on the same scale, so that the seventh-ranked program in the seven-site set is at an equivalent scale point to the tenth-ranked program in the ten-site set.) Simple Correlations: Relationships Among the Major Variables The simple and partial correlational analyses presented in these sections were performed in order to address the following research questions: 2. To what extent is the fidelity of program implementation related to program effectiveness? 4. What are the relationships among fidelity, reinvention, and effectiveness? The zero-order Pearson correlations among fidelity, reinvention, and effectiveness were the following: .52, N = 70, p < .001 fidelity-reinvention: r fidelity-effectiveness: r = .38, N = 65, p < .001 reinvention-effectiveness: r = .33, N = 65, p < .004. Note that all correlations are strongly positive. The correlation between fidelity and reinvention supports a conceptualization of 89 reinvention different from merely "low fidelity;" i.e., these data show that sites with higher levels of fideltiy had higher, not lower scores on extent of reinvention. These results also provide strong support to the viability of the RD&D approach, as shown by the high correlation between fidelity and effectiveness. Given the importance of the fidelity-effectiveness correlation to the test of the R080 model's viability, the "true" correlation between these variables was estimated with the effects of the measures' unreliability controlled. This "correction" for attentuation procedure is usually performed using internal consistency reliabilities (Nunnally, 1978). In the present case, the true correlation was estimated using the inter-coder reliabilities. Recall that these had been obtained using the percent agreement method for the fidelity data, and an average Spearman rank-order correlation for the effectiveness data. The estimation procedure employing these two reliability estimates produced a correlation between fidelity and effectiveness of r = .44. Partial Correlations Given the set of correlations among the three variables, the question of spurious relationships was tested. The possibility that the reinvention-effectiveness correlation was actually spurious, resulting from the shared variance of fidelity and reinvention, was tested using partial correlation analysis. The parallel hypothesis (that the fidelity-effectiveness correlation was spurious) was also tested. First, the relationship between 90 fidelity and effectiveness was examined, with the variance contributed by reinvention controlled. This analysis produced a first-order partial correlation of rfe . r = .26, N = 65, p = .019. Second, the relationship between reinvention and effectiveness was examined with the effects of fidelity controlled. The first-order partial correlation resulting from this analysis was r f = .17, N = 65, re - NS. This provided equivocal support for the statement that the fidelity- effectiveness relationship was independent from the effects of reinvention. A second set of correlations was examined to further analyze potentially spurious relationships. These analyses involved the addition-modification dimension of the reinvention typology. It was hypothesized that the positive relationship among reinvention and effectiveness was largely due to contributions to programs made by additions, rather than modifications, due to a greater likelihood of additions being "in the same direction" as high fidelity variations. Some support to this hypothesis was provided by the zero order correlations between addition and effectiveness (r = .40, N = 65, p < .001) and between modification and effectiveness (r = .17, N = 65, NS). However since addition and modification were positively correlated (r = .58, N = 70, p = .001), the effects of these two variables were controlled using partial correlations. The first-order partial correlation between addition and effectiveness with modification controlled was rae . m = .38, N = 65, p < .001. The partial correlation between modification and effectiveness while controlling the variance contributed by addition was rme . a = -.09, N = 65, NS. Note the significant, positive correlation produced by the first partial 91 correlation, and the marginally negative, non-significant correlation produced by the second partial correlation. This clearly supports the hypothesis that additions, rather than modifications, contribute to program effectiveness. It had been observed in the previous section of this chapter that the 000T program had a disproportionate number of modifications. Therefore, the relations of addition and modification to effectiveness were further examined to determine whether or not the ODOT program made a disproportionate contribution to the weak overall modification- effectiveness relationship. Table 5 contains the ordering of partial correlations by program, and shows that the ODOT program did in fact have the largest negative correlation between modification and effectiveness, with the variance due to addition controlled. However, the table also shows that four of the six other programs also had negative correlations between modification and effectiveness. The unusual characteristics of the HOSTS program are also a clear influence on the direction of the overall partial correlation analyses. Table 5 shows that this program had the highest correlation between modification and effectiveness, and the lowest correlation between addition and effectiveness, by wide margins. However, it should be noted that the program-by-program analyses employed small sample sizes, and are thus likely to be influenced by sampling error effects. Also, the ODOT program had the largest ggmggr_of modifications, while the HOSTS programs had the fewest ggmggr_of additions and among the fewest ggmggr_of modifications. Therefore, these within-program correlational analyses should be viewed as suggesting relationships, rather than as clear indications of effects. .eumu mmmco>wuomwwm mcwmmwe op man mew op can» mmmp mmapm> ze 92 Fmo. m «5.- memo: n «No. op 50.. homo N mmm. op om. game: m omp. op mm.- momm m oFN. OP om. pogo m me. m mm.- auum m omp. 5 mm. auom v Nwm. m mp.- wagon v moo. op mm. uumm m owm. op NF.- ammo: m emo. m om. mauoe N . _w_. o_ mm. e1a *z e Ememoem xcmm mzpm>la ez e Ememoem xcmm umFFoepcou cowumuwwwcoz .mmm:w>wuummem umPFoepcou cowuwcu< .mmmcm>vpuwwem new new cowuwuu< cmwzpmn cowumpweeou emneo umewm :owpmuwevuoz cmmzumn cowpmpmeeou emueo umewe mmmcw>wuommem sue: mcowpmpmeeou —mwpema eo mcwemueo-x:mm m wpamh DISCUSSION This chapter reviews the major findings of the present research in the context of the research questions introduced in Chapter I, and discusses the implications of these findings. Question 1: Are Modified RD&D Programs Implemented with Fidelity at Adopting Sites? The results of this study showed that seven innovative social programs developed and disseminated using the modified RDBD approach have been implemented with acceptable fidelity at many adopting sites. Distributions of fidelity scores were skewed in a positive direction, and measures of central tendency fell within the acceptable range. These results were generally obtained by both the telephone interview and site visit methods of data collection. This claim can be made with some confidence, since the precision and accuracy of the fidelity measure were foci of measurement development and data collection. A precise measure was developed by content analyzing interviews with developers and extracting lists of components and variations which operationalized programs in relatively specific and explicit terms. Accuracy of measurement was tested by carefully monitoring inter-coder reliability, and checking for agreement between data sources. Substantial agreement between the telephone and site visit results also attested to the accuracy of measurement. 93 94 Question la: Are There Differences Between Sample Programs of Fidelity? Question lb: Are There Differences Between Two Policy Areas (Education vs. Criminal Justice) on Fidelity? Despite the overall pattern of acceptable implementation, there were significant differences found between programs, both within and between the two policy areas. For the most part, differences appeared regardless of whether components were scored on three or two point scales, although support for differences between policy areas was weaker than that shown for differences between programs. Questions 1, la, lb: Implications Given the fact that only seven programs were examined, and given the somewhat unorthodox use of analysis of variance to examine differences between programs and policy areas, the statistical significance of results attesting to between-program and between-policy area differences should be viewed with caution. However, these results clearly suggest that although it is quite possible to achieve high fidelity social program implementation using the modified RDBD approach, such implementation is far from automatic. The results also suggest that differences between policy areas as well as programs within policy areas exist, although the evidence supporting the policy area differences is admittedly weaker. Differences between policy areas may be due to different dissemination tactics used. For example, the two programs which were implemented with the highest fidelity (HOSTS and EBCE) were disseminated using highly sophisticated methods. These included elaborate networking (national, regional, and local conferences), 95 considerable on-going technical support, explicit preadoption agreements with administrators, and requirements for monitoring and evaluating program outcomes. The dissemination methods used for the Criminal Justice programs were in general less elaborate. Interestingly, the Criminal Justice program which exhibited the highest fidelity (One-Day One-Trial) also appears to have benefited from the most extensive dissemination efforts. However, since the extent of differences between programs on such variables were not measured in this study, these ideas should be viewed as grist for a hypothesis worthy of further study, rather than as a conclusion supported by this research. 0n the other hand, the findings of the present research clearly support the position that abandoning the modified RDBD approach at thisgpoint would be premature. In other words, those who criticize the R030 approach as unrealistic or foolhardy (e.g., Farrar, deSanctis, & Cohen, 1982) might re-examine the vociferousness of their position in light of these results. Yet in contrasting the results of the present study with those of studies used to attack the R080 model, several points should be clarified. This may best be accomplished by contrasting the present research with the most widely-cited of the "anti-RDBD studies," the RAND research reported by Berman and McLaughlin (1978). First, the samples of innovations differed considerably between the two studies. While the RAND study examined the implementation of loosely defined policies and the translation of these policies into various projects, the present research investigated the implementation 96 of explicitly defined and highly specified programs which fit the conceptual parameters of the modified R080 approach. Second, Berman and McLaughlin (1977) used as their implementation outcome measure "the extent to which projects met their own goals," thus building adaptation into both methodology and findings. The present study has attempted to separate the concepts of fidelity and adaptation in two ways. Rather than defining implementation outcomes in terms of project goals, the present research involved interviews with program developers in order to identify specific components and variations. In addition, changes from the original program models which did not fit this component/variation framework were characterized herein as "reinvention," a concept independent from "fidelity." Third, in fairness to Berman and McLaughlin, it should be recognized that these researchers did not attempt to confuse their "project-based" implementation measure with the concept of fidelity. However, as noted by Datta (1981), others have seized upon the RAND findings concerning the prevalence of "mutual adaptation, cooptation, and non-implementation" to support the dismantling of RDBD efforts. It is hoped that the present discussion will help rectify this confusion. 97 Given the difference between the RAND study and the present research, limits for generalizing from the results of these two studies should be made clear. The results of the present research should not be generalized beyond relatively explicit programs fitting the modified RDBD model to more loosely defined projects. Conversely, the results of the RAND study should not be generalized beyond loosely defined projects to more highly specified programs. This distinction between program types, coupled with the acceptably high fidelity results obtained by the present research, would argue that generalized support for dismantling the RD&D model is unwarranted. Finally, it should be noted that although generally high levels of agreement were obtained between the telephone interview and site- visit methods, differences were observed as well. The general implication for the measurement of fidelity of these findings is that types of items which can obtain high levels of agreement between methods should be measured using the more inexpensive telephone interview method, while types of items which did not produce agreement should be measured during site-visits. This approach would save resources, since both telephone interviews and site-visits could be considerably shortened. Therefore, it would be advisable to perform a content analysis of high-agreement vs. low-agreement items to develop a typology of items for further use. Question 2: To What Extent is the Fidelity of Program Implementation Related to Program Effectiveness? The relationship between fidelity and effectiveness was demonstrated in the present study to be moderately strong. A Pearson correlation of .38 was obtained between program fidelity (measured during site-visits) 98 and effectiveness rankings (based on archival records). In general, this finding supports an important assumption of the modified R080 approach: replication lead to effectiveness. Question 2: Implications Given the problems in the effectiveness data set which necessitated the ranking procedure to achieve comparability across sites, this major finding should be viewed with some caution. The two major flaws of the effectiveness data set were: (a) differences between sites and programs in time periods covered by available archival data; and (b) differences between sites in the types of measures used. Examples of these flaws are the following: For most sites, fidelity was measured during the winter and spring of 1982. However, for three HOSTS sites, 1980-1981 school year data was provided; for 000T, 1981 calendar year data was used. Thus program changes between 1981 and 1982 have an unknown effect on the fidelity-effectiveness correlation. An another example, archival data for the MCPRC program covered five different lengths of reporting periods, ranging from eight months to three years. The comparability of these data is therefore open to question. With regard to differences between types of measures, the following examples are illustrative: only four FOCUS sites reported achievement test scores; criteria for defining recidivism differed among CAP and MCPRC sites; and only three SCCPP programs had neighborhood level data, with four sites reporting citywide data and three sites providing no data. Despite these flaws, two points should be considered. First, recall that a high level of inter-coder reliability was achieved (corrected 99 Spearman Rho = .90). Second, it is noteworthy that no other study was identified in the literature review which utilized archival effectiveness data in the study of social program innovation. Those studies which have measured effectiveness in the context of social innovation research in public sector organizations have generally used global rating procedures based on the impressions of research staff (e.g., Pelz, 1983). In summary, although flaws in the effectiveness data set suggest treating the fidelity-effectiveness correlation with some caution, this finding should be given serious consideration, especially in light of the current policy trend to "throw the R080 baby out with the bath water" (Datta, 1981). ,Question 3: What is a Useful Definition, and What is a Useful Typology, for the Concept of Reinvention? The research efforts devoted to the topic or reinvention were intended to be exploratory. Previous discussions in the literature (e.g., Rice & Rogers, 1979) did not define the concept of reinvention in relation to other constructs, let alone attempt to operationalize it. Therefore, the present study of reinvention was primarily concerned with developing a data-based conceptualization of reinvention. Reinvention was defined and categorized by content analyzing case transcripts. This analysis produced a definition and set of categories for reinvention which enabled highly reliable coding. Reinvention was defined as "the use of materials, activities, procedures, or organizational structures by organizations implementing modified 100 RD&D model programs, that cannot be adequately explained using the framework provided by the developer-defined program components and variations." The typology of reinvention consisted of two major dimensions: Addition-Modification, and Proactive-Reactive. Reactive instances of reinvention were further categorized as either internal or external to the site, depending on the source of constraint(s) which led to the categorization of the instance as Reactive. Finally, each instance of reinvention, after being categorized, was assigned a weight using a three-point importance scale. Descriptive data analyses of reinvention revealed that a moderate amount of reinvention occurred fairly equally across sites (median = six instances). However, many instances were rated as relatively unimportant. Additions were rated as more important than modifications, and the number of instances per site was only marginally correlated with the importance-weighted scores, indicating a weak positive relationship between the absolute "amount" of reinvention and the "extent" of reinvention as reflected by importance ratings. Descriptive analysis showed very few instances of reactive reinvention. Finally, the ODOT program was found to be an "outlier," with more modifications, and more trivial modifications, than other programs. No clear explanation for this result is apparent. One possibility is that the extensive technical assistance provided by the Center for Jury Studies contributed to encouraging modification, due to pro-adaptation messages communicated by consultants. The importance of the finding is clarified in the context of the relationship between addition- modification and effectiveness, discussed below (Research Question 4). 101 Question 3a: Are There Differences in Reinvention Among the Sample Programs? Question 3b: Are There Differences Between the Educational and Criminal Justice Poligy Areas on Reinvention? Significant differences between programs occurred only when the number of instances per site was controlled (i.e., for average weighted scores). Similarly, only one of the three reinvention indices revealed differences across policy areas; in this case, differences were obtained on the absolute number of reinventions. However, differences (between programs or between policy areas) were not obtained on the index which is the most meaningful and unambiguous representation of extent of reinvention, i.e., the importance- weighted scores. Questions 3, 3a, and 3b:- Implications Similarly to the effectiveness data, the reinvention data set is flawed due to possible inconsistencies across sites. This point deserves some elaboration. The final conceptualization of reinvention was based on the content analysis of researchers' impressions which were tape-recorded immediately following each site visit. Unfortunately, there was lack of consistency in the questions used to probe for details on. reinvention. Secondly, it was not possible to check agreement among data sources, as had been done with the fidelity data. The inconsistency in probing and inability to check data-sburce agreement created questions concerning the validity of two types of judgments: (a) judgments distinguishing proactive from reactive reinvention; and (b) judgments concerning the "boundaries" of 102 reinvention "instances." In short, limitations related to the exploratory nature of this part of the study flawed the data set. These concerns created less of a problem for the addition- modification dimension. Addition vs. modification coding decisions were based largely on the theoretical relationship of reinvention instances to the fidelity instrument, rather than relationships among reinvention instances or case history information. Despite its limitations, this study of reinvention has produced several contributions. In general, the definition and typology contribute a framework which organizes the previously ambiguous reinvention concepts. Secondly, the analyses which examined the relationship of reinvention to fidelity and effectiveness provides interesting hypotheses for future research. Question 4: What are the Relationships Among Fidelity, Reinvention, and Effectiveness? The results of this research showed the relationships among these three variables to be clearly positive. A strong correlation was obtained between fidelity and reinvention. Moderate correlations were obtained between fidelity and effectiveness, and between reinvention and effectiveness. The use of first-order partial correlations to control for third-variable variance showed the correlation between fidelity and effectiveness to diminish slightly with reinvention controlled, and the correlation between reinvention and effectiveness to diminish to a greater extent when the variance due to fidelity was controlled. Also, partial correlations showed that the moderate correlation between reinvention was likely due to the contribution of additions to program effectiveness, rather than the contribution of modifications. 103 Question 4: Implications The clearly positive relationships among the variables represent an interesting set of findings. As discussed above, the relationship between fidelity and effectiveness is critical to the R030 approach. In addition, the moderate relationship between fidelity and reinvention indicates the potential viability of a conceptualization which defines reinvention in terms that differ from simple "reductions in fidelity." Given the pattern of zero-order correlations, the partial correlations involving fidelity and reinvention tested the hypotheses that (l) the fidelity-effectiveness relationship was actually independent from the effects of reinvention; (2) the reinvention- effectiveness relationship was independent from the effects of fidelity; (3) the addition-effectiveness relationship was independent from the effects of modification; and (4) the modification-effectiveness relationship was independent from the effects of addition. The partial correlation analysis provided some support for hypotheses (1) and (3). Support can be claimed on two counts. First, the reinvention-effectiveness correlation (with fidelity controlled) was diminished by partialling to a greater extent than the fidelity-effectiveness correlation with reinvention controlled. An examination of the statistical significance of correlations before and after partialling showed that the fidelity- effectiveness correlation remained significant after controlling the variance due to reinvention (change from p = .001 to p = .019) while the reinvention-effectiveness correlation became nonsignificant at the .05 level after controlling the variance 104 due to fidelity (change from p = .004 to p = .092). This finding supports the hypothesis that fidelity can contribute to effectiveness independently from reinvention. At the same time, the strong fidelity-reinvention zero-order correlation suggests that fidelity may also lead to reinvention. The nature of the two variables makes this causal direction more likely than the reverse direction, although bidirectional causality is entirely possible. (That is, programs are likely to be implemented with some correspondence to the original model before needs and opportunities for reinvention become apparent. Reinvention could then potentially influence the level of fidelity.) Second, addition was positively related to effectiveness with modification controlled, while modification was not related to effectiveness when the variance contributed by addition was controlled. This supports the hypothesis that positive reinvention leads to effectiveness. Despite these points, the evidence on these issues remains somewhat ambiguous for the following reasons: 1. The absolute decrease in the reinvention-effectiveness correlation (with fidelity controlled) equals .16 (.33 - .17), while the absolute decrease in the fidelity-effectiveness correlation (with reinvention controlled) equals .12 (.38 - .26). The difference between the two absolute decreases equals only .04.’ Also, note that after controlling variance due to fidelity, the first-order reinvention-effectiveness correlation of .17 was still 105 significant at the .10 level. In sum, these relationships are not sufficiently strong to claim unambiguous confirmation of a model which has fidelity making a clear contribution to effectiveness independent from the contribution of reinvention. The best that can be said is that this evidence does not disconfirm such a model. 2. The partial correlation analyses of the addition-modification dimension and effectiveness did produce evidence that the positive relationship between reinvention and effectiveness was due to the effects of addition rather than modification. Further, it is reasonable to assume that there is a positive relationship between addition and positive reinvention, and the data of Table 5 for the most part bear this out. However, the data also indicate that the relationship between addition-modification and effectiveness is not simple, and bears further study. First, the strongly negative relationship between addition and effectiveness for the HOSTS program suggests that some types of additions are negative reinventions, and that these types are program specific. Similarly, the data in Table 5 suggest a contrast between the positive modification introduced by some programs (e.g., the HOSTS and CAP programs) and negative modification introduced by others (e.g., 000T and EBCE). Third, the outlier nature of ODOT weighted these results to some extent, contributing heavily to the negative modification- effectiveness relationship. Finally, the program-by-program partial correlations are based on small sample sizes. These data are qUite subject to sampling error effects, and can only suggest possible relationships. However, the addition-modification partial correlation with effectiveness remains interesting, and the 106 relationships between modification-addition and positive-negative reinvention deserves clarification. This could be accomplished by coding instances of reinvention as positive or negative in future research, and introducing this dimension into analyses; and, by obtaining larger samples for each program. Future Research The provisional support which the present study provides for the viability of the modified RD&D approach argues in favor of continuing this line of research. Given the methodological flaws identified in this chapter, a first step in this direction would be replicating the present study with improvements made to correct these flaws. Such improvements would include (a) checking multiple data sources for agreement on reinvention data, (to provide a test of validity) and collecting information on reinvention case histories more systematically, to better determine the existence and source of constraints; and (b) spending a greater proportion of time during the sampling phase in determining whether potential sites could provide effectiveness data which were equivalent within programs. Other steps which could be taken to improve future research based on lessons learned in this study include content analyzing fidelity items which resulted in agreement between telephone interviews and site-visits, in order to develop a categorization of items for which information may be accurately collected over the telephone, thus saving resources; and, spending a greater proportion of time during the measurement development stage on component/variation development for fidelity measurement, so that a category scheme spanning across programs could be employed. 107 For example, a scheme might include such categories as Client Entry, Staff Selection and Training, Client Processing Procedures, Critical Staff Behavior towards Clients, Critical Administrative Behaviors, Materials and Facilities, etc. Rather than developing these categories a priori (cf. Leithwood & Montgomery, 1980), they should be based on the specific set of programs in the sample. The use of these categories would facilitate rational-empirical scaling of items (components) so that cross-program comparisons would be more meaningful. Finally, two sets of research questions could be added to those addressed in the present study in replications of this research. These are: (a) Can sets of "core" fidelity components be identified which are more essential to achieving program effectiveness than other components? If so, are these "core" components similarly categorized across programs? Can developers identify these core components a priori? (b) Can instances of reinvention be reliably coded as "positive" (i.e., in the same direction as ideal variations) or "negative" (in the direction of unacceptable variations)? If so, do positive reinventions contribute to program effectiveness, and do negative reinventions detract from effectiveness? These two sets of questions have important implications for program dissemination and implementation. Answering the first set of questions might enable disseminators and implementors to focus attention on core components, leading to more effective site ad0ptions. Similarly, attaining a greater understanding of positive vs. negative reinvention could potentially contribute to greater effectiveness at sites. APPENDICES APPENDIX A EXAMPLES OF COMPONENTS AND VARIATIONS APPENDIX A Examples of Components and Variations Note. I = ideal, A = acceptable, U = unacceptable Example #1: HOSTS Readigg Program N of components1 = 54, N of 3-point components2 = 22, N of 2-point components3 = 32 Tutors attend faithfully. I. Tutors are faithful in attendance, achieving on the average at least 95% attendance rates, (e.g., students are left without a tutor no more than 5% of the time). Tutors on the average achieve 80 to 95% attendance rate. Tutors are not faithful in attendance, achieving an attendance rate of less than 80%. C) Example #2: EBCE (Experience-Based Career Education) N of components = 60, N of 3-point components = 30, N of 2-point components = 30 Career Site: Resource Person Comitment 1. Resource people are asked to make a specific commitment regarding the specific learning experiences offered at the career site. A. Resource people are asked to make a more general commitment regarding the general kinds of learning experiences offered at the career site. U. Resource people are not asked to make any commitment regarding the learning experiences offered at the career site. Example #3: FOCUS (Alternative School-Within-a-School Program) N of components = 103, N of 3-point components = 84, N of 2-point components = 19 Hourly attendance is taken. I. Hourly attendance is taken for all students. A. Hourly attendance is taken only for those students who the teacher feels are an attendance problem U. Hourly attendance is not taken for any students. 1Total number of components 2Number of components scaled with 3-points: Ideal, Acceptable, Unacceptable 3Number of components scaled with 2-points: Ideal/Acceptable, Unacceptable 108 109 Example #4: 000T (One-Day One-Trial) N of components = 36, N of 3-point components = 23, N of 2-point components = 13 Panel size is kept small. 1. Panel size is between 14 and 18 jurors. A. Panel size is between 18 and 30 jurors. U. Panel size is over 30 jurors. Example #5: CAP (Community Arbitration Project) N of components = 59, N of 3-point components = 36, N of 2-point components = 23 Youth given choice of continuing arbitration. I. Youth is given a choice of whether or not to continue arbitration. It is explained that this is a real choice. A. Youth is given a choice of whether or not to continue arbitration. However, it is implied that if the youth does not continue arbitration the case will go to court. U. Youth is not given a choice of whether or not to continue arbitration. Example #6: SCCPP (Seattle Community Crime Prevention Program) N of components = 59, N of 3-point components = 36, N of 2-point components = 21 Focus is on residential crime. I. Focus of crime prevention activity is entirely on residential burglary. A. Focus is primarily on residential burglary and neighborhood crime/community issues. U. Residential burglary is but one of many components in a comprehensive crime prevention package. Example #7: MCPRC (Montgomery County Pre-Release Center) N of components = 85, N of 3-point components = 56, N of 2-point components = 29 Pre-referral briefing of potential residents. 1. All potential residents are briefed about the PRC while still in prison, jail or prior to entry (e.g., as condition of probation). A. All potential residents are provided with material about the PRC while still in prison. U. Potential residents are not briefed or don't receive materials while still in prison. APPENDIX B DESCRIPTIVE ANALYSES OF REINVENTION DATA APPENDIX 8 Descriptive Analyses of Reinvention Data Sum of Instances. The median number of instances across sites was 6.00 and the distribution had a mean of 6.53, with a range of 0.0 to 21. Four sites (6%) had no instances of reinvention. The distribution can be characterized as fairly homogenous, with only a slight positive skew. This is indicated by the closeness of the mean to the median. Weighted Instances. Recall that "importance of reinvention" for each instance was rated on a three-point scale. Since the greatest number of instances recorded was 21, the potential maximum weighted instances score would be 3 x 21, or 63. This serves as a reference point for the observed distribution. This distribution ranged from zero to 36, far short of the potential range. The median of the distribution was 7.00, and the mean was 8.63. The disparity of these two statistics indicated a positive skew (i.e., a larger proportion of cases fell below the mean than above the mean). Forty-five sites (64%, or almost two-thirds of the sites) scored nine (approximation to the mean of 8.63) or less. Average Weighted Instances. These scores had a potential range from 0.00 to 3.00. The observed range of the distribution was 0.0 (four occurrences) to 3.00 (only one occurrence). The median of the distirbution was 1.20, and the mean was 1.26. Thus, the distribution was slightly positively skewed, with 39 sites (56%) scoring less than the mean. The mode of the distribution was 1.0 (the lowest scale point with the exception of zero) with 20 occurrences. One might hypothesize 110 111 that absolute frequency of instances and importance are related; i.e., that sites having many reinventions tend to also have important reinventions. However, the distribution of the modal scores was not a function of the number of instances per site alone; sites with as many as 9, 10, and 12 instances had average weighted scores as low as 1.0. More precisely, the correlation between the sum of instances and the average weighted instances indicated a small, positive relationship (r = .188, NS). In sum, there was a slight relationship between the occurrence of reinvention and its importance; a site with many reinventions was only slightly likely to have important reinventions. Sum of Modifications. The distribution of unweighted reinvention instances which were coded as modifications had a median and mode of 2.00, a mean of 3.04, and a range of 0.0 to 11.0. The distribution was positively skewed with 46 of 70 sites (66%) having three or fewer instances coded as modifications. The distribution of importance-weighted sums of modification had a median of 3.00, a mean of 3.80, and a range of 0.0 to 17.0. The distribution was similar in its positive skew to the unweighted scores, with 46 of 70 sites (66%) having a weighted score of 4.0 (approximation to the mean) or less. This similarity indicated that most of the modifications were rated as relatively unimportant. Sum of Additions. The distribution of unweighted instances coded as "additions" had a median of 3.00 and a mean of 3.49. The range of the distribution was 0.0 to 14. Fifty-four of the 70 sites (77%) had four or fewer additions. The distribution of additions thus somewhat resembled the distribution of modifications. 112 For the importance-weighted additions, the median was four and the mean was 4.84. The distribution ranged from 0.0 to 25, and 49 of the 70 sites (70%) scored lower than the mean. Note that additions were rated as somewhat more important than modifications. Sum of Proactive Instances. The median of this distribution was 4.0, the mean was 5.03, and the range was 0.0 to 19. The scores were positively skewed, with 43 of the sites (61%) having five or fewer instances which were coded as proactive. The distribution of importance-weighted proactive scores had 37 of the 70 sites (53%) scoring at the mean of 6.37 or lower. Sum of Reactive Instances. This distribution exhibited an extreme positive skew, with a mode of 0.0 and a median of 1.0. The mean was 1.5. There were 23 sites (33%) which had no instances coded as reactive, and 19 (27%) sites had only a single reactive instances. The range of the distribution was 0.0 to 6.0. The high frequency of sites with no reactive instances meant that some similarity between the distrubutions of weighted and unweighted scores must occur, and a discrepancy between distributions is in fact not evident when examining scores below the mean. Yet when importance-weights were introduced, 20 sites (30%) scored 40 or higher, while only five sites (7%) had at least 4.0 (unweighted) reactive instances. Comparing these results to the previously described proactive scores, note that importance ratings had somewhat of a greater effect on reactive when compared to proactive reinventiOn. It should be noted that the distributions of proactive and reactive instances partly reflect the quality of the data and the decision rules for coding. This dimension was more sensitive to the post hoc coding 113 method than the addition-modification dimension, since coding decision regarding proactive vs. reactive coding required case history information, while addition vs. modification decisions could be made largely on the basis of the relationship of the instance to the program's components and variations. This issue is discussed in greater detail in the Discussion section. Sum of Internal Reactive Instances. Given the high frequency of sites which had no occurrences of reactive reinvention, it was not surprising to find 50 of the 70 sites (71%) with no instances coded as internally reactive. The median and mode of the distribution were thus zero. Only four sites (6%) had more than one instance of internal reactive reinvention. Sum of External Reactive Instances. This distribution was also characterized by a large number of sites which did not have any coded instances. There were 35 sites with no instances coded as externally reactive (50%), and thus both the mode and median were again zero. Twenty-six sites (37%) had more external than internal instances. APPENDIX C DESCRIPTIVE ANALYSES OF EFFECTIVENESS DATA: TYPES OF DATA, SUMMARY STATISTICS, AND COMPARISONS TO DEMONSTRATION SITES APPENDIX C Descriptive Analyses of Effectiveness Data: Types of Data, Summary Statistics, and Comparisons to Demonstration Sites Help One Student to Succeed (HOSTS) Types of Data. Obtaining effectiveness data for this program was relatively straightforward. All sites received Title I funding which required submission of annual evaluation reports. These reports all contained Normal Curve Equivalent (NCE) gain scores, although a variety of reading achievement tests were used. These tests included the California Achievement Test, the Gates-McGuinte Reading Test, the Metr0politan Achievement Test, the Iowa Test of Basic Skills, the Stanford Achievement Test, the Woodcock Reading Mastery Tests, and the California Test of Basic Skills. Recent research has demonstrated "fairly similar estimates of gains and pretest scores" across various achievement tests used to implement the Title I Evaluation and Reporting System (Thompson & Novak, 1981, p. 126), thus justifying the treatment of different tests as comparable data. The use of change scores is generally inadvisable due to the regression-towards-the-mean effect (Campbell & Stanley, 1966). However, their use was considered permissible for the purpose of this research, since they were to be employed as the basis for the ranking procedure for which relatively gross estimates of effectiveness were sufficient. In addition, NCE gain scores were readily available, while the analysis of regressed post-scores would have entailed considerable additional cost. 114 115 All HOSTS evaluations were annual reports. Seven of the ten evaluations were contiguous to the site-visit period (pre-measured in Spring, 1981; post-measured in Spring, 1982). The remaining three evaluations covered the previous year (pre: 1980; post: 1981). Summary Statistics. Gain scores are reported in NCE (Normal Curve Equivalent) units. The NCE scale is like a percentile scale in that it has values of "1'l and "99" at the extremes, and "50" is the midpoint of the distribution. However, the NCE scale is different in that it is an equal interval scale. The overall mean NCE gain score across sites was 10.24, with a standard deviation of 5.60, and a range of 2.31 to 19.74. This sample mean gain was similar to the demonstration evaluation gain of 14 NCE's. Three of the sample sites exceeded the 14 NCE gain score. Experience Based Career Education (EBCE) Types of Data. Comparing sites on effectiveness presented a special problem for this program. The original federally-funded EBCE dissemination effort had supported four program models: the Northwest Regional Lab (NWRL) model; the Far West Lab (FWL) model; the Appalachian Educational Lab (AEL) model; and the Research for Better Schools (RBS) model. The first three models were fairly similar, and were used to develop the fidelity instrument. (The RBS model was excluded from the study, since it involved considerably less out-of- school activity than the other three models.) This strategy was employed rather than basing the instrument on a single model and sampling only from that model's adopters, since preliminary investigation had revealed many sites claiming to use aspects from two 116 or three of these models. However, personal communication with evaluators from the Regional Labs revealed that a wide variety of instruments had been utilized to evaluate adopting sites. Consequently, it was decided to select appropriate existing instruments and administer them during the site-visits, rather than use archival data. Personal communications with EBCE evaluators and a review of the current literature (McCaslin, Gross, & Walker, 1979) revealed two instruments to be best-suited for the present research. These were the Career Exploration Scale (CES; used with permission from Dr. Thomas R. Owens, NWRL) and the New Mexico Career Planning Test (NM; used with permission from the New Mexico State Department of Education). These instruments were chosen as most suitable to evaluating achievement of the two primary EBCE goals: performance of actual career exploration behavior; and training students in the skills required for career exploration. The CES is a 25-item instrument. Sample items from the CES scale follow. Sample Items In relation to one or more jobs or career fields you might like to enter, how frequently during the past semester have you: 1. Talked about the job or career field with relatives or friends? 2. Talked about the job or career field with persons employed in that career field? 3. Talked about the job or career field with teachers, counselors or program staff? 4. Read materials about the job or career field? 117 (Note. These items were scored on a five-point scale with Never = 1, Once = 2, A few Times = 3, Several Times = 4, and Frequently = 5.) The NM Test contains 20 items. Sample items appear as follows. 1. Rachel, a 9th grader, is considering entering the high school program in printing because she heard that it is a good trade, but she would like to know more about it. Which is the least appropriate way to find out about it? A. Arrange to watch a printer all day 8. Read about occupations in printing C. Watch a film about printing 0. Learn about printing after she is in the program 2. Ramon wants to find out how good he is at carpentry relative to other fellows his age. What would be his most accurate source of information? A. A job sample on carpentry 8. His shop teacher C. His scores on a national carpentry test 0. The Opinion of his friends The CES scale was reported by NWRL to have been used to evaluate 85 sites by the time of this research, and an alpha coefficient of .93 was reported (personal communication, Dr. Thomas R, Owen, 1981). The NM Career Planning Test was reported to be highly reliable and an excellent instrument for evaluation by evaluators from all the Regional Labs. No measurement documentation was available from the evaluators. A review of the testing literature (McCaslin et al., 1979) and an attempt to contact the test developers also failed to produce documentation. However, the high degree of consensus among the lab evaluators prompted use of the NM instrument in this research. The two instruments were completed by students enrolled in EBCE programs at the time of the site-visits. They were administered during the latter third of the school semester. 118 Summary Statistics. The Career Exploration Scale was scored on a five-point scale ranging from "Never" to "Frequently" (see sample items above). For the present sample, the mean was 3.46, with a standard deviation of .340. The New Mexico Career Planning Test was scored according to a scoring key provided by Educational Testing Service, Princeton, New Jersey. Scores per site represented mean number correct. The mean across sites was 3.68, with a standard deviation of 1.01. FOCUS Altenative Education Program Types of Data. The evaluation of the original FOCUS model had examined the following outcome variables: achievement test scores; pupil attitudes toward school; school suspension; disciplinary referrals; grade point average; self-concept; court referrals; and police and sheriff contacts. Inquiries conducted during the telephone interview phase revealed that the following indices could be provided by implementing sites: attendance; GPA; grades in reading, math, and language arts (these were especially relevant since the program stresses basic skills); and achievement test scores. Although attendance data per se were not reported in the original program evaluation, it was felt that attendance would serve as an adequate proxy for measures which were not readily available at most sites (i.e., pupil attitudes towards school and court/police contacts) or which were relatively rare events (i.e., suspension and referrals). Change scores were again utilized. Since length of stay in the program differed across pupils within sites, pre-periods (semesters) were determined for each pupil, depending on his/her date of entry into the program. In all cases, it was attempted to obtain for 119 post-data end-of—semester records which corresponded to the semester during which the site-visit occurred. In addition to semester data, it was also attempted to get pre- and post-annual data, since the program developer has stressed the importance of allowing sufficient time for program's effects to influence pupil behavior. The pre-post ,periods covered were the following: six sites provided 9/80-1/81 semester pre-data and 9/81-1/82 semester post-data. Of these sites, two sites also provided annual data for appropriate pre-periods and 6/82 year-end scores. Three additional sites provided annual data only, also for appropriate pre—periods and 6/82 year-end scores. One site failed to provide any data. In assigning ranks, semester data was used for the six sites which provided pre-post scores, and annual data was used for the remaining three sites. Summarnytatistics.- Attendance information was provided by seven sites. Since data were computed for different time periods across sites, they were converted to percentage change scores. The mean change was -95% (absences). However, there was a great deal of variance in these scores, with scores ranging from -85% to +64%. Three sites reported positive changes, four sites reported negative changes, and three sites did not make attendance data available. Information on pupil grades was provided by five sites. Since different indices were used, these scores were also converted to percentage change scores. The mean change was +86%. The variance was again quite high, with scores ranging from -10% to +334%. When compared with the demonstration site evaluation, the mean change of the sample compared quite favorably. However, this was largely due 120 to two extreme scores (+334% and +102%) in the sample. The demonstration evaluation was performed at the two original FOCUS sites. The mean gain at one site was +11% and +17% at the other site. Only the two extreme sample scores previously cited exceeded +17%. Although grades in reading, mathematics, and language arts were not reported in the original evaluation, they were utilized in the present study where possible to compensate for missing data (i.e., missing overall GPA's) when assigning ranks. Four sites provided both reading and math grades, while three provided language arts grades. Means and ranges for percentage change scores were the following: Reading: Mean = +4.3%, range = ~12% to +10%; Math: Mean = +14.2%, range = -1.0% to +22.9%; Language Arts: Mean +14.2%, range = -1.0% to +22.9%; Language Arts: Mean = 32.1%, range +6.5% to +80%. Achievement test scores were provided by four sites. Three different tests were used (Iowa Test of Basic Skills, California Test of Basic Skills, and Metropolitan Achievement Test). Percentage change scores were again employed to assign ranks. The mean gain score across the four sites were +ll.4%, with a range of 4.5% to 16%. The mean gain for the present research sample was similar to the demonstration evaluation, which reported a +9% gain for one site and a 1% gain for the second site. (The Iowa Test was used for both sites.) Three of the four sample sites reported gains exceeding 9%. In sum, there was a great deal of variance across sites on all measures. Comparisons to two demonstration sites were made on two 121 measures: GPA and achievement test percentage gain scores. The first measure showed greater change for the present research sample, although this was due to extreme scores by two of the five sites. The second measure revealed rough similarity between the two samples. When considering the quality of these comparisons, it should be recalled that only half the research sample (five sites) provided GPA's and only four sites provided achievement test scores. One-Day One-Trial (000T) Types of Data. This program has three major goals: (a) to increase the efficiency of juror processing; (b) to make juries more broadly representative across socio-economic and demographic variables; and (c) to improve juror attitudes towards the courts. Efficiency of juror processing is a multivariate concept. An understanding of the basic processing system is necessary to appreciate the various sub~concepts. The basic sub-processes around which the system's efficiency revolves are the postponement/excusal process and the voir dire hearing flow. The first process refers to the rules and mechanisms for postponing or excusing juror service for an individual, while the second refers to the rules and mechanisms governing the flow of individuals to and from the voir dire hearing. (During this hearing, jurors are selected for trial service. They may be rejected by prosecuting or defense attorneys due to their possible predispositions for a guilty or innocent verdict, or excused for other reasons.) A wide variety of input-output ratios have been used by courts to quantify juror flows. Preliminary infbrmation obtained during the telephone interviews indicated that three of these indices were most 122 commonly used among the present sample. These were the following: (a) Juror Days Per Trial (JDPT; number of juror days served divided by the number of trials); (b) Juror Usage Index (JUI; juror days served divided by number of trial days); (c) People Brought in (PBI; number of jurors reporting divided by the number of trials begun). These ratios measure different aspects of input-output efficiency. The ideal system accurately estimates the number of jurors needed to be called on any given day, based on the number of trials to be begun that day. This involves not only efficient juror handling (i.e., notification, entry, orientation, and voir dire assignment), but also efficient communication with judges regarding trial needs, and an appreciation on the part of judges of the jury system's needs. Summary Statistics. For all sites, data for the 1981 calendar year was utilized. Data used to compute the three efficiency ratios (JDPT, JUI, and FBI) were provided by all ten sites. The means across the ten sites were the following: Mean (JDPT) = 62.13; Mean (JUI) = 23.4; Mean (PBI) 36.34. Standard deviations for the three indices were: SD (JDPT) 22.8; SD (JUI) = 7.3; SD (PBI) = 13.74. Ranges for the three indices were as follows: JDPT, 12.17 to 92.05; JUI, 9.13 to 36.10; PBI, 14.35 to 62.58. JDPT and JUI figures covering six month pre- and six month post-periods were reported in the evaluation of the original demonstration site. These figures were: pre (JDPT) = 49.3; post (JDPT) = 36.9; pre (JUI) = 16.3; post (JUI) = 11.1. Given these figures, it appears that the research sample in general was not as effective as the demonstration site. For both measures, only one research sample site scored better than the demonstration site. 123 Community Arbitration Project (CAP) Types of Data. The telephone interviews revealed that three measures were most likely to be available at the sample sites: (a) percent of youths recidivating; (b) percent of community assignments completed; and (c) the percent of youths referred to the States Attorney's Office as a result of the arbitrary hearing. The relevance of the first two measures to the effectiveness of the program should be fairly self-evident. As a proposed alternative to the existing juvenile justice system, the arbitration program should result in low percentages of youths committing further offenses, and high percentages of community service assignments completed. The importance of the third measure relates to the program's goal of successfully targeting those youths whose offenses are sufficiently minor to warrant diversion to the arbitration program. Thus, low rates of State's Attorney referrals are sought. There was considerable variation among sites with regard to the types of data and quality of data available. Six sites had collected recidivism data. However, the time periods for these data varied from six months to three years, and criteria for defining recidivism also varied. Time periods for the other two measures also varied, in this case from three months (the most recent quarter) to one year. There was also considerable variation on the availability of these two measures, with eight sites providing information on community assignments, and seven sites providing State's Attorney referral data. 124 Given these variations, it was decided to use "the best data available for each site" as the basis for ranking. Thus, annual data was used when available. Summarnytatistics. Means and standard deviations for effectiveness data across sites were the following: percent recidivating; Mean = 11.6%, SD = 4.2% (data provided by six sites); percent completing community assignments: Mean = 84.6%, SD = 8.0% (data provided by eight sites); percent returned to State's Attorney's Office as a result of the arbitration hearing: Mean = 12.5%, SD = 14.0% (eight sites provided data). The mean recidivism rate for the research sample (11.6%) was quite similar to the rate reported in the demonstration site evaluation (9.8%). However, the research sample was somewhat inferior to the demonstration site with regard to the percent of youths returned to the State's Attorney's Office (12.5% vs. 7.2%). Data on successful completion of community assignments was not reported in the demonstration site evaluation. Seattle Community Crime Prevention Program (SCCPP) Types of Data. Organizations implementing crime prevention programs face an evaluation problem above and beyond the normally difficult obstacles associated with criminal justice evaluation. This problem involves the difficulty in collecting on-going data on program "clients." Implementers of all of the other innovative programs included in the present study in general collected effectiveness data as part of normal program operations. However, police departments normally collect data on a citywide basis, or on levels of analysis 125 not necessarily contiguous with geographic units in which crime prevention programs are implemented. These units are usually referred to as "neighborhoods," and their geographic definition is for the most part ambiguous. Also, participation in crime prevention programs is constantly shifting, thus exacerbating the evaluation problem. The original site dealt with this problem by (a) targeting census tracts; (b) seeking to "saturate" census tracts with services; (c) administering victimization surveys to block club participants; and (d) keeping accurate and up-to-date records on block club membership. Unfortunately, few adopters used such a data-based and focused approach. Three of the 10 site-visited programs failed to provide any data whatsoever. 0f the seven programs providing data, only three had information specific to the served areas as well as citywide data; the remaining four were able to provide citywide statistics only. (One of these sites provided unaggregated data on three large neighborhoods claimed to have been "covered" by the program. However, investigation revealed that many residents in the neighborhoods had not participated in the program, and thus the data were treated as an estimate of "citywide" statistics.) The three sites which had collected data specific to the served areas had also collected data on control neighborhoods, although in contrast to the original site evaluation, no random assignment was utilized in these three sites. Pre-post data was available for all sites, including those whiCh had citywide data only. All data reflected burglary rates, but there was variation across sites regarding the type of burglary statistics 126 available, with four sites providing "residential burglary rates," one site providing "breaking and entry" (B&E) only, one site providing "burglaries and entries," and one site providing "burglaries, B&E." Summary Statistics. The three burglary rates for the sites reporting data on the neighborhoods covered by the program were .4%, .2%, and 5.9%. Two of these sites were superior to the demonstration site, which reported a rate in one survey of 2.43% for neighborhoods covered by the program, and a rate of 5% for a second survey. (These results are difficult to compare since the two demonstration site surveys covered different time periods [yearly vs. six months] for different years [1975 vs. 1976] and used different criteria for inclusion.) For the citywide burglary rate change-score data, the mean across the seven sites providing data was -69.l%. The range was -92% to +26.5%, with five sites reporting decreases and two sites reporting increases. The site reporting -92% was the site which had provided unaggregated data for the three neighborhoods claimed to be covered by the program. These data were not treated at face value during the ranking procedure, but were weighted by the credibility loss of the site. Montgomery County Pre-Release Center (MCPRC) Types of Data. Information gathered during the telephone interviews indicated that the types of effectiveness-related data most likely to be provided by sites were the following: percent of residents successfully completing the program; recidivism rates for those successfully completing the program; percent of residents presently employed; amount of restitution paid per resident; 127 savings per resident; amount of family support provided per resident; and reimbursement to the program per resident. As was the case with the Crime Prevention Program, there was variation across sites regarding time periods covered. There were five different reporting periods provided, ranging from eight months to three years. Summary Statistics. The mean percentage of successful program completion (i.e., not revoked from the program), across sites (seven sites reporting) was 66.6%, with a range from 41% to 94%. This compared to the demonstration site percentage of 74.3%. Three of the sample sites had percentages of 74% or better. The mean percentage recidivism rate for the sample (six sites reporting) was 16%, with percentages ranging from 3% to 22%. This compared favorably with the demonstration site evaluation, which reported a rate of 22.2%. Thus all sample sites reported lower recidivism rates than the demonstration site. With regard to employment, a mean percentage of 76% employed was achieved by the sample sites (six sites reporting). Percentages ranged from 41% to 100% employed. This compared to the demonstration site's figure of 93%. Two sample sites had percentages of 93% or better. The average amount of restitution across five sites reporting (and with site used as the analysis unit) was $40.61. The average amount of savings on release was $369.95, as reported by five sites. It is difficult to make comparisons with the sample, since the demonstration evaluation reported ranges such as "$50 or less" and "$250 or more." Average family support across the four sites reporting was $118.79, and the average amount reimbursed to the program by residents across four reporting sites was $160.76. REFERENCES REFERENCES Berman, P. (1980). Thinking about programmed and adaptive implementation: Matchin strategies to situations. In H.M. Ingram & P.E. Mann (Eds.), Why policies succeed or fail. Beverly Hills, CA.: Sage Publications. Berman, P. (1981). Educational change: An implementation paradigm. In R. Lehming & M. Kane (Eds.), Improvingyschoolz Using what we know. Beverly Hills, CA.: Sage Publications. Berman, P., & McLaughlin, M.W. (1977). Federalyprograms supporting educational chan e: Factors affecting implementation and’ continuation R-1589/7-HEWT} Santa Monica, CA.: Rand Corporation. Berman, P., & McLaughlin, M.W. (1978). Federal programs supporting educational change: Implementing and sustaining innovations (Final Report, R-1589/8-HEW). Santa Monica, CA.: Rand Corporation. Boruch, R.F., & Gomez, H. (1977). Sensitivity, bias, and theory in impact evaluation. -Professional Psychology, 8(4), 411-433. Campbell, D.T., & Stanley, J.C. (1966). Experimental and quasi- experimental designs for research. Chicago, IL.: Rand-McNally. Calsyn, R., Tornatzky, L.G., & Dittmar, S. (1977). Incomplete adoption of an innovation: The case of goal attainment scaling. Evaluation, 128-130. Crandall, D. (1979). A study of dissemination efforts supporting school improvement: Final study deSign. ‘Andover, MA.: The Network. Cyert, R.M., & March, J.G. (1963). A behavioral theory of the firm. Englewood Cliffs, NJ.: Prentice-Hall. Datta, L.E. (1981). Damn the experts and full speed ahead: An examination of the study of federal programs supporting educational change, as evidence against direct development and for local problem-solving. Evaluation Review, 5(1), 5-32. Emrick, J.A., Peterson, S.M., & Agarwala-Rogers, R. (1977, May). Evaluation of the National Diffusion Network, Vol. 1: Findings and recommendations (SR1 Project 4385). Washington, 0.0.: U.S. Office of Education, Department of Health, Education, and Welfare. 128 129 Eveland, J.D., Rogers, E., & Klepper, C. (1977, March). Igg_ innovation process in public or anizations: Some elements of a preliminary model. spFingfie d, VA.: NTIS. Farrar, E., deSanctis, J.E., & Cohen, D.K. (1979). Views from below: Implementation research in education. Cambridge, MA.: Huron Institute. Fullan, M., & Pomfret, A. (1977). Research on curriculum and instruction implementation. Review of Educational Research, 41(2), 335-397. Gephart, W.J. (1976, April). Problems in measuring the degree of implementation of an innovation. Paper presented’at the annual convention of the American Educational Research Association. Hall, G.E., & Loucks, S.F. (1981, April). The concept of innovation configurations: An approach to addressingprogram adaptation. Paper presented at the annual meeting of the American Psychological Research Association, Los Angeles. Hall, G.E., & Loucks, S.F. (1978, March). Innovation configurations: Analyzing the adaptation of innovations. Paper presented at the annualeeeting of the American Educational Research Association, Toronto. Havelock, R.G. (1976). Planning for innovation through dissemination and utilization of know1edge. Ann Arbor: University of Michigan. Heck, S., Steigelbauer, S., Hall, G.E., & Loucks, S.F. (1981). Measuringinnovation configurations: Procedures and applications. Austin, TX.: Research andFDevelopment Center f0r‘Teacher Education, The University of Texas. House, E.R. (1975). The politics of educational innovation. Berkeley, CA.: McCutchan. House, E.R., Kerins, T., & Steele, J.M. (1972). A test of the research and development model of change. Educational Administration Quarterly, 8(1), 1-14. House, E.R. (1981). Three perspectives on innovation: Technological, political, and cultural. In R. Leming and M. Kane (Eds.), Improving schools: Using_what we know. Beverly Hills, CA.: SageTPublications. Larsen, J.K., & Agarwala-Rogers, R. (1977). Reinvention of innovative ideas: Modified? Adapted? or None of the above? Evaluation, 136-140. 130 Leithwood, K.A., & Montgomery, D.J. (1980). Evaluating program implementation. Evaluation Review, 4(2), 193-214. Lippit, R., Watson, J., & Westley, B. (1958). The dynamics of planned change. New York: Harcourt Brace & Jovanovich. March, J.G., & Simon, H.A. (1958). Organizations. New York: John Wiley and Sons. McCaslin, N.L., Gross, C.J., & Walker, J.F. (1979). Career education measures: A compendium of evaluation instruments. *Columbus, OH.: The National Center forTResearch in Vocational Education. Mohr, L.B. (1978). Process theory and variance theory in innovation research. In R. Radnor, I. Feller, & E. Rogers (Eds.), Igg_ diffusion of innovation: An assessment. Evanston, IL.: Northwestern University Press. Nunnally, J.C. (1978).- Psychometric theory. New York: McGraw-Hill. Owens, T. (1981). Northwest Regional Educational Laboratory. Personal communication. Owens, T.R., & Haenn, J.F. (1977, April). Assessin the level of implementation of new programs. Paper presente at the American Educational Research Association. Pelz, D.C. (1983). Quantitative case histories of urban innovations: Are there innovatin stages? IEEE Transactions on Engjneering Management, EM-30(2 , 60-67. Raizen, A. (1979). R&D management practices: Dissemination programs at the National Institute of Education: 1974 to 1979. Knowledge: Creation, Diffusion, Utilization, 1(2), 259-292. Rice, R.E., & Rogers, E.M. (1980). Reinvention in the innovation process. Knowledge: Creation, Diffusion, Utilization, 1(4), 499-514. Rogers, E.M., & Shoemaker, F.F. (1971). Communication of innovations: A cross-cultural approach. New York: Free Press. Rohrbaugh, J., 8 Quinn, R. (1980). Innovation and organizational performance: A study of the implementation andIroutinization of a new information technology_. Grant proposal, National Science Foundation, Policy Research and Analysis Division. Available from authors, State University of New York at Albany, Graduate School of Public Affairs.