r I?“ ‘1 _, 55' ?: I55 I i. _- —— w) 1;. 3.: .. n_- ‘- - ‘. .5551: ‘- ' ' - -.- _- ”5155.3 I “ . .1, rfl'k.‘.- _ O _ 'II 5 5‘ I "-3“ I; d- '- “3., " ;:' III . .II a" " I I: ..g .2425." J ‘ :27; I I 'I '0‘ H I 5IIHI4I t . . 2.5:; II‘II‘;1_I..'._1.;T;:E‘:£; ._,II' X’L a} 5MiIIIII IIII‘W’ 0 I- ' I ' ‘ 3' ' ' ' 5;, ~."I .A I -',,I we. :'-“1'.I I', 355;" :5? _- 'III LI ' ' I ' “I" I: -‘ AI. 'I' , .3!“ .- ’ 1:1. -I- i. - I I Md 3 _ - - r... ' . 5.435 5 5.. “‘25: r .j‘- 3-51-31- ”. - ‘ o. I III . QI Iv : l ' h - - I 'zI'E... . .‘ ' I ' 'I‘ .i‘:%-:11L;F‘ It,“ ;.::‘:;:;‘O.- :._" 1"! it ._-t{l'_ II 1?:5 :5qu ' 5 4‘! 9* 8‘}: . {III “II: 'n"; "'1‘? 9'1“: 05555 I?‘ II IIIIIJ I 115-: I II . “-I‘ C” L A- 2‘; c 5’: 55 . .I ' . . I ' .' I . g ' .‘u I. "To ‘— . ' ' 5", . . l' . I. I ' ‘0' I I . I ”in; . II a... O. I... p' :0 II- ! . l I u i 1:;123 .t A . .. .'- ' ' o. . ‘5; ‘3 ‘- J l4:- 255‘, III ‘ l - I '-' ' IT“ I.“ '-' I ‘ A -“ 'Q L.t 1" I .. "' . I .I ‘ . .".l I Ir“ 5’::; ;:.-:¢-- :‘ - - WI}?! ' I'l. . I ‘ ‘9'. .' :. 0; _ . - ‘ .._I‘:-rl Q. _I:I":’ ‘ . Y'-' ' -j-‘;::.. J‘.' - :jip.‘ ..;;;. .l. 't: "___. d 4 . . Q . . 9 ' . . 5 ' ‘ . 5 . _ \ g A‘ '— l~ r ' ' ' ‘ ‘ ' ' ‘ - 'I ' ' -‘ I - '5! u‘. 1;er .‘-.\ -.-‘L ' ‘D k - ' l - -I 5.2:?” ‘1 ‘Lv‘ ' '--h "'l" ' I ' I . - ' ' ' ' .I x -' - 55-: ’ T: rju‘. ,— ,: . I ‘I; A _ . . o v'.‘ '.' ’ o _ I " ‘t..r‘L.“°.v' '2' .' ‘ I ‘ I ". '-' .- " ‘.' I - i I h. ‘ I I Elf-v L 3-7.“: 's':‘ .‘If 2“. II II ’ I I I. ‘~ ' .. ":I Effiv "II r,_ " - .~ in. . Q . - g ‘ 5 u _ . Q - o o I '_,I L I t . i 1:. I I | . I. g . . i J . I," n y " fi‘“ I. . ‘ - 'I.- 3. p...°. .b‘ I 3- . o.- i a o I O 'I | ‘-| I. 0' I ' - I' I - ' ' a c - o a c I. . i o n.5l ‘ '—.~ n I . '- ‘. s r- ‘- f ‘q ' ‘| u q, ' -‘§. . .5 . . " I | p l - I r \. 'I' I.... I u n." '- . I . I . ‘ . . . I 5 '5- ‘ "- L I 1‘ o- 1. ‘\ y . I . O ‘ l C ’ _ I ' D' o a . . g . . a . '. .0 . - n ‘ . . v ‘1 . . g .. ..I .0 ' I ll ‘ l I '- ' O U —-—-’.~ I IIk5III H1555! . 5‘; III ”JP-I “55 , . 5 | I 5I ' .II. -'-rl':';n‘-" 5I'IJI'IN5II 5, '5i *IuIIJII; III-LII II"? 555 11.. : 5 II III 5‘ II 5555II5' IIIuI'.’ I IIIIII I I. 5IIIIII I I “II I III III “VIII It IIIII .II III II .IIII'IIIIII iIIIIlII55I‘5I5'5IIiIIIIIIIIfiI _ :4 5 I?” rip lupin 5555555554 : III-"I III I; I'LII'II"'IIIHIII”, lI I'5i .5I, ”555455351555, iI555 g5” . 5 . 5 553:5?" ' 5.. x. ','_ I 5: I IIIIII I5II5IIIIII 5IIIIJIIII W555? I I. I '3 5' ;. I I I5 I 545,555 - "I w I“; I ,,5x M ',=' I.» IIIIII II"I5II:r5-EII.'_..;- . ‘ I I " "I“ 'II'I I'II‘I II"‘III; 5 zi‘ I I” i m5.» III 'u; i .3 5 M5 55 ,5JI5JHI 'II5 IIIIJIII 55IIII:I 1 WI I. ! a 57, Is; '4 O O . ' . o B 0 - . a - h . - . I .o .. ... ‘0 ' " a.“ o _ g.“ 9 ‘ . .0 o ’ . . I ‘ - -- _ I .0 ._ O I +"~ c C d . -‘ I’ O _ . 5'5 _ c. o? 1:. . . .IHIIzIII III"; 5:05 ”55 ‘5:,' 9 'J. 51:: .: 2.1 12.5, I5 In III? I I (“III“I’I * II ' I:.;'I'.o':°' IIIII IIII. I5:,'5.5{,'I.5I55 'II :III .I ‘IIIIII5I: '5' 'II HI; II II ,5 51:15:91.:6 . I""' I"I' III 555 5I:I' ’lIL . . I, i - I.’ :5 I I:I|.1":-|I,I15.'I16I'H.._!, 5:} I1 5 . I If" “III? :5I 'I i . II ".II I " ’ I" .5.I:II_.5'I.' I515 EI‘ I:- I. III-III ' '15" 'I.'.'5 «III: II I I I -. ' «II;1II.I' fI-I'I'II 5IIIII 'IIIII'II IIIII . , :2": j I,» -"- III III I 5'5" :III , :IJI '5l5' 555541555 ‘55 5II55 '53 5; 5IILI III; " I'.‘ I M I.II'III,"5451,555-1.”:Ia , II: "III I 55I5I {45,55551;‘ I, 550555.55“, 5,555 II5'I5 | ‘II5I'5 II5I5'5 I55‘5 5.555 5 ' 5 .: LIA. I~ ". Im' IIII'IM IIIM‘I II II ”I ' I I .II II.:.,-I;I'I=,IIII w‘ ""'i"""r" :II: “II ‘IIIHAIIe'Mw 5.555 H. Um. 5L5,» I. . I |I I H' LI' '1 (5 l.’ 5155 $55 I .5 55 '5'5'I‘ILIIIII . , , 0'7; I: II ' ‘ i] I5I:.' I II:1I::. ' ‘I ”‘7'? ”12'5”. III] IfaIlr I {I5 I ;,_ 1, I, , IIIIMI I 1-— __ MI I.‘lI'III - '11sz III. II I II ‘ III“. 55 I I! II 'Ii 5,5 I-: III d“ .I. IIIIIIiI "' I"III lI'*I.III: . II'II IIII5IIII ..I II LIBRAit Y E" Michigan State University This is to certify that the thesis entitled A. MODEL FOR CRITERION-REFERENCED MEASUREMENT AND A COMPARISON OF ITEM ANALYSIS PROCEDURES presented by Susan K. Thrash has been accepted towards fulfillment of the requirements for Ph. D. degree in Educat1on - a V/ f fl’KIA Major professor (7 g, I.” Date 1 ’/ “"// 0-7639 A MODEL FOR CRITERION-REFERENCED MEASUREMENT AND A COMPARISON OF ITEM ANALYSIS PROCEDURES By Susan Kaye Thrash A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling and Personnel Services and Educational Psychology 1977 ABSTRACT A MODEL FOR CRITERION-REFERENCED MEASUREMENT AND A COMPARISON OF ITEM ANALYSIS PROCEDURES By Susan Kaye Thrash The first purpose of this study was to propose a theoretical conception of criterion-referenced testing and to explain two basic item analysis techniques (Cox and Vargas, C-V and Roudabush, R) theoretically with respect to this general model. The second purpose was to determine the adequacy of the C-V and R procedures using the theoretical model. The final purpose was to compare three item analysis techniques, the C-V, R and the Brennan and Stolurow (B-S), using real data. A theoretical model for criterion-referenced testing was proposed. The model includes l2 parameters that completely described the pretest-posttest situation. The R and C-V indices can be explained in terms of this general model by making certain assumptions. There were two parts to this study. The first part attempted to determine if the C—V and R indices adequately estimated the true values, if one technique estimated the true values better than the other and if the C-V and R indices were better estimators of the true values for some parameter sets. These questions were considered by simulating data for 21 different sets of parameter values using Susan Kaye Thrash the model as the theoretical framework. It was found that for R, when the assumptions were met, the technique provided a more stable and accurate estimate than when the assumptions were not met. It was also found that when the sample size was increased from 50 to 200, the stability and accuracy increased greatly. The C-V technique seemed to provide a reasonably accurate and stable estimate regardless of whether the assumptions were met. The estimates were more stable with larger sample sizes. Also, the C-V technique estimated the C-V true value better than the R technique estimated the R true value. The second part of the study was designed to determine the comparability of the three item analysis procedures, R, C-V and 8-5. C-V and R values were computed for 128 items and 8-5 values were com- puted for 64 items. These items were testing l6 objectives from two subject areas, Mathematics and Reading, two grade levels, Middle and Upper, and two treatments, assigned objectives (treatment A) and selected objectives (treatment B). The major question to be answered was do the C-V, R and 8-5 item analysis procedures provide comparable results? Three additional questions were also considered: 1. Are the three procedures more comparable for items in Mathematics than for items in Reading?; 2. Does and the comparability of the three procedures depend on the grade level?; and, 3. Are the three procedures more comparable for items given in treatment A than for items given in treatment B? Susan Kaye Thrash The Pearson product moment correlation coefficient between the R and C-V indices was significantly different than zero (r = .80, p‘<.0l). The point-biserial correlation coefficients between the 3-5 procedure and the C-V index and the 8-5 procedure and the R index were also significantly different than zero (r = .70, p<<.0l and r = .36, p<=.0l, respectively). The separate analyses of the indices for each subject area, grade level and treatment indicated that the indices were more com- parable for Mathematics than for Reading. The indices were also more comparable for treatment B than for treatment A. The correlations between the indices for the grade levels, Middle and Upper, were almost identical. An analysis of the agreement among the three item analysis procedures showed than when a cut-off of .50 for the R and C-V indices was used for selection of items, there was complete agreement for 39 of the 64 items (6l percent) given on the pretest and retention test. From the results of the several analyses, it appears that the best item analysis procedure to use for criterion-referenced testing, or pretest-posttest situations, is the C-V technique. This tech- nique provides a reasonably accurate and stable estimate of its true value and gives very similar results when compared to the R index and the 3-5 procedure. TO THE N's IN MY LIFE ii ACKNOWLEDGMENTS There are many individuals who have contributed to this work as well as to my professional and personal growth. Dr. William Mehrens, my chairman, advisor and friend helped shape my ideas into a finished product, provided encouragement throughout my studies and gave me advice whenever I needed it. Dr. William Schmidt, who deserves a special thanks, spent a number of hours with me building the framework of this dissertation. Dr. Walter Hapkiewicz has pro- vided me with constant attention throughout my graduate studies. His advice and concern for my educational progress has always been appreciated. Dr. Robert Spira, also a member of my committee, has had a significant impact on my educational achievement. Dr. Spira has had the faith and confidence in me to achieve what at times I was not sure I would be able to do. I will always be indebted to Dr. Spira for his meaningful comments and suggestions for my disser- tation, professional goals and personal well-being. I also wish to thank Joseph Nisenbaker, who assisted me with the computer simulation, four supervisors, Harley Jensen, Dr. Charles A. Pounian, Robert Joyce and Dan Wallock, who provided support and understanding during the trauma of the writing and rewriting of the dissertation, and my mother-in-law, Mrs. Marguerite Thrash, who also has provided me with support and encouragement during the completion of my graduate studies and dissertation. iii I have saved for last the one individual who has inspired me the most, helped to build the self-confidence I lacked, listened to my ideas and helped to develop these ideas into a dissertation. I have reserved a very special thanks for this very special person-- thank you--William Thrash. iv TABLE OF CONTENTS Page LIST OF TABLES .......................... vii LIST OF DIAGRAMS ......................... x Chapter I. INTRODUCTION ........................ l Need. . , ........................ 1 Purpose ....................... .. .. 2 Research Questions .................... 3 Overview ......................... 4 II. REVIEW OF LITERATURE .................... 6 Proposed Item Analysis Techniques . ........... 7 New Techniques ..................... 7 Traditional Techniques ................. 22 Summary ........................ 25 Comparing Techniques ................... 26 Summary ......................... 33 III. THEORETICAL DISCUSSION ................... 36 Summary ......................... 45 IV. DESIGN ........................... 47 Part A: Design of the Simulation ............ 47 Part B: Design of the Comparison Study with Actual Data ...................... 55 Summary ......................... 6l V. RESULTS OF THE SIMULATION ................. 63 The C-V Index: Adequacy and Stability .......... 63 Assumptions Met .................... 63 Comparison--Assumptions Met Versus Assumptions Not Met ....................... 66 The R Index: Adequacy and Stability ........... 68 Chapter Page Assumptions Met .......... , .......... 68 Comparison-~Assumptions Met Versus Assumptions Not Met ....................... 68 The C-V Technique Versus the R Technique ........ 73 Consideration of C-V and R Techniques by Parameter Set Set .......................... 76 Summary ......................... 81 VI. RESULTS OF THE COMPARISON OF THE THREE INDICES WITH ACTUAL DATA ........................ 84 Comparability ...................... 86 C-V and R ....................... 86 8-8 and C-V ...................... 88 8-5 and R ....................... 90 B-S and C-V and R ................... 92 Summary ......................... 95 VII. SUMMARY AND CONCLUSIONS .................. 103 Summary ......................... 103 Conclusions ....................... 111 Discussion ....................... 113 Implications for Future Research ............ 114 APPENDICES .......................... ‘ . . 117 I. Roudabush's Technique ................... 118 II. Brennan's and Stolurow's Procedure ............ 125 III. Further Analyses of C-V and R ............... 129 IV. B-S Statistics; Application of the B-5 Decision Rules. . . 133 V. Reliability Estimates of Tests .............. 136 VI. Sample Tests and Objectives ................ 139 VII. Computer Program for the Simulation ............ 172 BIBLIOGRAPHY ........................... 176 vi Table 2.1 2.2 hh-PD-D-wawwNNN d-th-HU‘I-hw 0301-th UTU'IU'IU‘I 01-500“) LIST OF TABLES Categories for a Given Item ................ Categories for Individuals Answering Item 1 Correctly at the Posttest (ISI) .................. Categories for a Given Item ................ Categories of Performance (Reliability-Crehan) ....... Categories of Performance (Validity-Crehan) ........ Categories for a Given Item ................ Categories for a Given Item ................ True Proportions for a Given Item ............. Observed Proportions for a Given Item ........... Categories for a Given Item--Observed Proportions ..... Categories for a Given Item--True Proportions ....... Pretest--Actua1 ...................... Posttest--Actua1 ...................... Selected Parameter Values for the Simulation ........ B--Index .......................... Descriptive Statistics for Each Parameter Set ....... Parameter Sets Where Assumptions for C-V Are Met ...... Average Ranges for the C-V Estimates ............ Parameter Sets Where Assumptions for R Are Met ....... Average Ranges for the R Estimates ............. vii Page 12 14 15 27 27 36 37 37 39 48 48 49 52 58 64 65 67 69 73 U1 .ONOSOSOSONOONOICSO‘OS \JNNNOO‘ON DOOM #00“) .5B .6A .6B .7A .7B .9 .9A .98 Page Summary Statistics Comparing R to C-V ........... 74 Correlations ........................ 74 Summary Statistics for R and C-V With Consideration of Sample Size and Assumptions ............... 77 Comparison of R and C-V by Parameter Set .......... 80 Corre1ations of C-V and R ................. 87 Corre1ations Between B-S and C-V .............. 9O Corre1ations Between 8-5 and R ............... 91 Corre1ations for All Items ............ I ..... 92 Corre1ations--Mathematics ................. 93 Corre1ations--Reading ................... 93 Correlations--Midd1e .................... 93 Corre1ations--Upper .................... 93 Corre1ations--Treatment A ................. 94 Corre1ations--Treatment B ................. 94 8-5, R and C-V Values for Items Given on the Pretest and Retention Test ...................... 96 Agreement of the Three Item Indices 100% Agreement ..... 97 Agreement of the Three Item Indices 67% Agreement ..... 98 Agreement of the Three Item Indices 67% Agreement ..... 99 Pretest--Actua1 ...................... 104 Posttest--Actua1 ...................... 105 Categories for a Given Item--True Proportions ....... 105 Categories for a Given Item--Observed Proportions ..... 105 viii Table Page 1.1 Categories for a Given Item ................ 119 11.1 Rules for Decision-Making ................. 127 IV.1 B-S Statistics ....................... 134 IV.2 Application of the 8-3 Decision Rules ........... 135 V.1 Reliability Estimates of Tests ............... 137 ix LIST OF DIAGRAMS Diagram A Page 4.1 Design of Administration of Items ............ 56 6.1 Design of Administration of Items ............ 85 CHAPTER I INTRODUCTION Nggg_ Criterion-referenced testing has been an area much discussed and researched in recent years. Much of the research and discussions have focused on the appropriateness of applying classicial measure- ment theory to criterion-referenced tests and suggestions of new procedures and statistics for the evaluation of criterion-referenced tests. Livingston (1971), for example, developed a new statistic for the estimation of reliability for criterion-referenced tests. Alternative approaches to classical item statistics were proposed by several other individuals (Brennan and Stolurow, 1971; Cox and Vargas, 1966; Roudabush, 1973 to mention a few). In addition, a few studies compared these new item statistics to old statistics (e.g. Cox and Vargas, 1966; Hambleton and Gorth, 1971; and Hsu, 1971). Many of these new item statistics, however, were not based on a theoretical model. If such a model could be found, it would be easier to explain the item statistics and perhaps possible to develop more powerful statistical techniques. Moreover, little is known about the comparability of the new item statistics to each other. Most of the research has been concerned with the comparison of new with old; few studies have compared the new item statistics to each other. It would seem desirable to compare the new statistics both 1 empirically and theoretically, with the aid of a general model, to determine what the differences among them actually are and to develop general recommendations for their use. Purpose The first purpose of this study is to propose a theoretical conception of criterion-referenced testing and to explain two basic item analysis techniques (Cox and Vargas, and Roudabush) theoretically with respect to this general model. The second purpose is to determine the adequacy of the Cox and Vargas and Roudabush techniques. If the two techniques can be explained by the general model, then the estimate of each index will be compared to the corresponding true value. In this manner, it may be possible to determine if one technique estimates the item parameters better than the other. A third approach (Brennan and Stolurow) cannot be explained in terms of the general model due to the nature of the approach. The Brennan and Stolurow technique combines a number of statistics with a set of decision rules. The ultimate outcome is a verdict of revision or no revision for the item and/or the instruction. While the statistics used in the Brennan and Stolurow method do have the traditional theoretical framework, the decision rules have only intuitive appeal. It is not possible to fit the suggested decision rules of the Brennan and Stolurow technique into a theoretical framework. However, the adequacy of the Brennan and Stolurow technique may be determined by comparison of the three approaches on real data. This, then, is the final purpose of the study--to determine the com- parability of the three item analysis procedures (Cox and Vargas, Roudabush, and Brennan and Stolurow).1 If all procedures provide identical or nearly identical results then it seems reasonable to use the simplest method (in terms of computation and data collection) in the future. Research Questions In particular, this investigation will consider the following questions: 1. Can a theoretical conception or a general model of criterion-referenced testing be defined? a. Does the C-V technique fit the general model? What assumptions are needed? b. Does the R technique fit the general model? What assumptions are needed? 2. Do the C-V and R techniques adequately estimate the true values of the item parameters? a. Does one technique estimate the true values better than the other? b. 00 the C-V and R techniques estimate some true values of the item parameters better than the others? 1From this point on, Cox and Vargas, Roudabush, and Brennan and Stolurow techniques will be abbreviated C-V, R and 8-5, respec- tively. 3. Do the C-V, R and 8-3 item analysis procedures provide comparable results? 1a. Are the three procedures more comparable for items in Mathematics than for items in Reading? b. Does the comparability of the three procedures depend on the grade level? Overview The previous section provided a brief introduction to the ideas and questions pursued in this study. Chapter II will provide a review of the literature relevant to item analysis methods for criterion-referenced tests. Two types of studies are considered-- studies which proposed item analysis techniques (new and modifications of traditional approaches) and those which compared new techniques to old. The third chapter presents a theoretical conception of criterion-referenced testing. The C~V index and the R sensitivity index are described in the context of this theoretical model. A method for evaluation of the C-V index and the R sensitivity index with respect to the theoretical model is presented in Chapter IV. Procedures for determining the comparability of the C-V index, the R sensitivity index and the B-S method are also discussed in this chapter. Chapter V presents the results of the evaluation of the C-V and R techniques with respect to the model. The results of the investigation of the comparability of the C-V, R and 8-5 indices in a practical application are presented in Chapter VI. Finally, in Chapter VII some implications of the results of Chapters V and VI for test development are discussed, and some recommendations for further research on the proposed theoretical model are given. CHAPTER II REVIEW OF LITERATURE The concept of criterion-referenced measurement in education has initiated many discussions and much research with respect to measurement issues. The main points of interest have been cut-off scores, reliability and item analysis. This review will summarize the literature on item analysis. The literature can be divided into two categories. One group of studies can be collected under the heading of "proposed item analysis techniques." New techniques have been proposed by some (Brennan, 1972; Brennan and Stolurow, 1971; Cox and Vargas, 1966; Crehan, 1974; Hsu, 1971; Ivens, 1970, 1972; Kifer and Bramble, 1974; Kosecoff and Klein, 1974; Roudabush, 1973; Saupe, 1966) and the use of old (traditional) techniques have been advocated by others (Davis and Diamond, 1974; Ebel, 1973; Hambleton and Gorth, 1971; Harris, 1974; Nitko, 1971; Popham and Husek, 1969). The second category includes research which makes comparisons among the proposed tech- niques (Cox and Vargas, 1966; Crehan, 1974; Haladyna, 1974; Hambleton and Gorth, 1971; Helmstadter, 1974; Hsu, 1971; Ivens, 1970, 1972; Kosecoff and Klein, 1974; Ozenne, 1971). Proposed Item Analysis Techniques New Techniques One of the earliest item analysis techniques proposed for criterion-referenced tests was suggested by Cox and Vargas in 1966 (Cox and Vargas, 1966). This procedure requires two administrations of the item--before and after instruction. The item statistic is then defined as the difference between the proportion of individuals answering the item correctly as posttest and the proportion of indi- viduals answering the item correctly at pretest; C-V. (The original notation was Opp.) This is the simplest technique to use; however, it has been criticized by Oakland (1972) and Davis and Diamond (1974). Oakland claims that the C—V technique is limited because it is "more appropriately used to determine the extent to which students may profit from instruction rather than to determine the reliability estimates which apply to a particular CRM" (Oakland, 1972, p. 5). This is a strange criticism, for indeed the intent of the C-V pro- cedure is to select items and not to provide reliability estimates. Oakland also criticizes the use of a statistical technique for item selection without regard to item content. This is a criticism which could be applied to the use of any statistical technique in the selection of items without regard for content. Davis and Diamond suggest that use of difference scores make the C-V index unreliable. It should be remembered here that the statistic is not based on individual difference scores, but the dif- ference of proportions. They also felt that the use of this statistic without regard to the content of the items would impair the content validity of the final form of the test. According to Davis and Diamond, test developers should use the same four basic principles that have been in use for 25-30 years. They do caution, however, using the Second principle without regard to the content of the item. These principles are: 1. The items in an achievement test should constitute as nearly as possible a representative sample of the popula- tion of items that define the domain to be measures . . . . 2. The items in a predictor test, . . . , should constitute the set (drawn from the population of items that define the domain to be tested) which best predicts scores on the designated criterion variable in samples of examinees like those to whom the test will be administered. . . 3. The items in an achievement test should, within the con- straint imposed by principle 1, make up as efficient a measuring instrument as it is possible to produce. 4. Choice-by-choice item-analysis data should be used as a basis for editing and revising items for achievement, aptitude, and selection tests. (Davis and Diamond, 1974, pp. 128-131.) Of course all these principles are ones that should be con- sidered regardless of the referencing nature of the test. However, it does not necessarily follow that the items will be doing the proper job if these principles are followed. Ebel (1973) supports the use of the C-V technique when the purpose of the evaluation is to determine the effectiveness of an instructional program. However, he indicates traditional item dis- crimination indices are appropriate when the purpose is to determine how well an individual has succeeded in a particular course of study. Ozenne (1971) also recommends the C-V index. In his inves- tigation of a method of measuring test sensitivity, Ozenne suggested that a test composed of items selected on the basis of the C-V index would have the greatest sensitivity to instruction. Haladyna also recommends the use of the C-V index (Haladyna, 1976 and Haladyna and Roid, 1976). In fact, he feels that the C-V " . . . index comes con- ceptually closest to measuring CR item discrimination" (Haladyna, 1976, p. 12). Other individuals have considered the C-V technique as a starting point for further modifications. Brennan (1972) proposed the 8 index, a variation of the C-V technique and the traditional 0. The 0 statistic is defined as the difference in the proportion of individuals in the upper group answering the item correctly and the proportion of individuals in the lower group answering the item correctly. The upper and lower groups are generally defined as the top and bottom 27 percent of the individuals ranked on the total test. The B index is defined as the proportion of individuals in the mastery group (upper) who answer the item correctly minus the proportion of individuals in the nonmastery (lower) group who answer the item cor- rectly (B = U/n1 - L/nz). This index differs from D in that differ- ent sample sizes in the upper and lower groups are allowed. The evaluator is then able to use one administration, define the upper and lower groups according to mastery or nonmastery or by some similar criterion, and select items on the basis of this index. Brennan also determined the exact distribution of the 8 index under the null hypothesis, 8 = 0. This allows the evaluator to compute confidence intervals for the item statistic. Hsu had already suggested an identical procedure in 1971 (Hsu, 1971). He suggested that a predetermined cut-off score be 10 established which would classify individuals according to mastery or nonmastery. According to Hsu, the difference in proportions of those responding correctly in each group to a given item would be a mean- ingful discrimination index for items from criterion-referenced tests. This index is identical to the 8 index. One of the major problems with this technique is the decision of what defines mastery and nonmastery. Once this problem is solved, then it is possible that there will be too few mastery students in a pilot administration of the item if the group is uninstructed. If the group has been instructed then there may be too many mastery students. In either case, U/n] or L/nz would provide somewhat less than stable proportions and the value of B may not provide an adequate indication of the item's usefulness. A modification of the 8 index (and Hsu index) was introduced by Crehan in his 1974 study (Crehan, 1974). Crehan redefined the upper and lower groups as independent groups of instructed and unin- structed, respectively. This modification basically solves the problem of defining the mastery and nonmastery groups. The B index as originally proposed by Brennan and Hsu or modified as suggested by Crehan is very similar to the C-V technique and traditional techniques. One advantage for using B is the ability to use a different number of individuals in the upper and lower groups. A second advantage is the ability to test the null hypothe- sis, B = 0. It must be remembered, however, that teachers are the most likely users of criterion-referenced tests. It seems unrealistic to expect teachers to use sophisticated statistical techniques to 11 select items. A further problem is the availability of probability levels for B. The table of probability levels is available through a computer program which Brennan developed. The other criticisms that were mentioned previously must also be considered in the final analysis of the 8 index. A second index that Crehan proposed is defined as the propor- tion of consistent performances on logically parallel items. In other words, this index equals the number of individuals who fail both items plus the number of individuals who pass both items divided by the total number of individuals. This of course requires the development of logically parallel items which is not necessarily an easy task. In addition, it requires the administration of both sets of items at the same time. For a short test, the time factor would not be a particular problem. Crehan also employed a third unique technique in his study. The items were ranked by having teachers respond to the question, “Which item would you choose if you were to give a one item test?" (Crehan, 1974, p. 257). This was done until the item pool was exhausted. Compared to all the other item analysis procedures pro- posed, this approach is the most subjective one.1 Another refinement of the C-V method was suggested by Edmonston, Randall and Oakland (1972). For their method consider the two by two table below for a given item: 1Crehan also used a random ranking of items as an item selection device. See the section on comparison of techniques for the results of Crehan's study. 12 Table 2.1 Categories for a Given Item Posttest Pass Fail Pass pH p12 Pretest Fail p21 p22 The important pieces of information, they claim, are p12 and p2]. A high value for p21 would indicate a good item. Items that were less diScriminating would have high p12 values. The refinement seems unnecessary since the C-V index would be p2] - p12 and provides information of one value relative to the other. Schooley, et al. (1976) also recommend consideration of the proportion of individuals answering the item correctly (p) on pre- test and posttest. They suggest that the proportion should increase from pretest to posttest. In addition, items that supposedly measure the same objective should have similar p values. Those that have inconsistent p values should be looked at and revised if necessary. Their approach is very similar to the C-V method since a comparison of the p values from pretest to posttest would give the same value as the C-V method. Ivens also considered the C-V technique in addition to two indices of his own (Ivens, 1970, 1972; Ozenne, 1971). Iven's indices require three administrations of the same item to the same subjects. One of the indices is based on the expectation that there would be a 13 large change in performance from pretest to posttest and a small change from posttest to retest. Ivens calls this Index 2 and it is defined as (p post - p pre) (1 - lp retest - p postl ) where p is the proportion of subjects passing the given item on the particular administration. The other index (Index 1) is defined as (l - pre- post agreement) (post-retest agreement) where the agreement is the proportion of subjects whose item scores (pass or fail) were in agree- ment across the appropriate administrations. His final recommendation, however, is that the C-V technique be used for item selection and the information obtained from Index 2 be used for item revision (Ivens, 1970). The two indices defined by Ivens need three administrations of the item. In most situations this would be a definite disadvantage. In addition, if there is a minimum amount of change from posttest to retest Ip retest - p postl would be small and l - Ip retest - p postl would be close to one. In this case, Ivens' Index 2 would be approximately equal to the C-V index. Ivens' Index 1 is also intuitively appealing. However, Index 1 can have a high value--indicating a good item--and yet be a bad item. For example, if many students pass the pretest, fail the post- test and fail the retest, Index 1 would have a high value. Yet, revision of the item (and probably instruction) should be considered. Kosecoff and Klein (1974) suggest two indices--an Internal Sensitivity Index (ISI) and an External Sensitivity Index (ESI). For the first index (ISI) consider the following table which categorizes only those individuals who answered Item 1 correctly at the posttest: 14 Table 2.2 Categories for Individuals Answering Item 1 Correctly at the Posttest (ISI) Posttest Fail Pass Fail n1 n2 Pretest Pass n3 n4 where r11 = observed frequency of students who answered Item 1 cor- rectly on the posttest but failed the pre and posttest; 112 = observed frequency of students who answered Item 1 correctly on the posttest but failed the pretest and passed the posttest; n3 = observed fre- quency of students who answered Item 1 correctly on the posttest but passed the pretest and failed the posttest; and r14 = observed fre- quency of students who answered Item 1 correctly on the posttest and passed the pretest and the posttest. n - n1 The index ISI is defined as n] 3 n2 + n3 + n4, which accord- ing to Kosecoff and Klein, provides a measure of an item's ability to discriminate between those who have and have not profited from instruc- tion. Their interpretation of the index does not, however, follow from the definition. It is conceivable that the index could have a high value but all who passed the item at posttest also passed the item at pretest. How does the item then have the ability to dis- criminate those who have profited from instruction from those who haven't? If all the individuals who passed the item at posttest 15 also passed the item at pretest, the item could not be said to be sensitive to instruction. Their second index (ESI) is the Cox and Vargas index. The two indices are identical. Kosecoff and Klein do, however, suggest a "correction for guessing" for the index. They use the Marks and N011 procedure, which is also used by Roudabush in the development of his index, to derive the correction for guessing (Marks and N011, 1967; Roudabush, 1973). They claim to compute the expected cell fre- quencies and use these values in the computation of the E51. However, their expected cell frequencies are true frequencies which are heuristically computed from sample frequencies. This aspect will be discussed in more detail when Roudabush's sensitivity index is pre- sented. (See Chapter III and Appendix I). A method based on the four possible outcome patterns for an item administered on two occasions was proposed by Popham in 1970 (Kosecoff and Klein, 1974; Ozenne, 1971). The familiar two by two table (see Table 2.3) was used in conjunction with computation of Chi-square values. Table 2.3 Categories for a Given Item Posttest Fail Pass Fail f1 (n1) f2 (n2) Pretest Pass f3 (n3) f4 (04) 16 First it is necessary to count the number in each category (f1, f2, f3, f4--following the notation presented in Table 2.3). Secondly, a "prototypic item" is defined by taking the median frequency of each outcome category over all items. Finally, a comparison is made between this prototypic item and the actual frequencies in the four categories for each item. Large Chi-square values would suggest that the item is considerably different than the typical item. One problem with the technique is that the items in the test must be fairly homogeneous to give meaningful results. A second problem is not knowing how large the Chi-square values need to be for one to infer that the item is atypical or bad. Three other studies have proposed methods totally different from the basic two-way table--Cox and Vargas approach. Kifer and Bramble calibrated a criterion-referenced test using the Rasch model, which is a latent trait model (Kifer and Bramble, 1974). They felt that the_Rasch model could determine which items fit the model and which items need revision. However, as in the Popham method, all items need to be sampling one trait; if not, some items may not fit but yet be good items. Item analysis was a subobjective of their study. Their main emphasis was the desire to generalize about the scores and obtain more precision concerning the extent to which a score represents passing a criterion. Bayesian techniques were applied to item analysis by Helmstadter (1974). Three separate indices of item effectiveness are defined in terms of probabilities. The first is the probability that a subject knows the content given that the correct response was 17 selected. The probability that a subject does not know the content given that the incorrect response was selected defines the second index and the probability that a correct decision will be made about the examinee's knowledge of the content given the results of perform- ance of that item is the third. For these indices, P indicates a correct response, P'an incorrect response, K knowledge and K no knowledge. The first index is denoted by P(KIP), the second by P(RTP) and the third by P(correct decision) equal to P(KP or KP). Bayes' theorem then implies that _ P(PIK)P(K) P(Kip) ' P(PIK)P(K) + P(PIR)P(T<’) and P(FIR)P(R) , P(KIP) = p(p'|K)P(R) + P(PTK)P(K) Each of the subcomponents, such as P(PIK) were established on the basis of the administration of an item. The probabilities P(KIP), P(KIP) and P(correct decision) were then computed using these pieces of information. There is still the same problem with these indices of determining a cut-off value for the establishment of a knowledge group and a no knowledge group. These indices can use pretest - posttest data or a single administration. Saupe was concerned with maximizing the reliability of difference scores (Saupe, 1966). He suggested that items possessing certain characteristics would make the maximum contribution to the 18 reliable measurement of change. According to his analysis, items with the following characteristics should be considered as good items: 1. Items with high item-total score discrimination indices for both initial and final administrations of the test. 2. Items with low item-total score discrimination indices when the total score criterion is from the final adminis- tration for items in the initial administration and from the initial administration for items in the final admin- istration. 3. Items with high correlations between initial administra- tion item score and final administration item score (Saupe, 1966, p. 224). Saupe derived an index that could be used in the selection of items to measure change. Items with high values of this index would be selected and items with low values rejected. The index is based on the correlation of the change in the item score with the change in the total test score; xX +g:yY ' er ' YyX 2J1-rny1-rXY r '“do where x and y represent item scores and X and Y represent total test scores. Although Saupe was not directly concerned with criterion- referenced tests, his work has some applicability to it. Obviously items in a pretest-posttest situation are meant to measure change 19 and the index might have some usefulness in predicting those items which are sensitive to change. The third criterion, however, seems inconsistent with results of criterion-referenced testing. This criterion specifies that an item with a high correlation between initial item score and final item score is a good item. This high correlation would be achieved only if there is some variance on the pretest (not all individuals fail) and some variance on the posttest (not all individuals pass). In addition a high positive correlation is not obtained if an item is failed by most on the pretest and passed by most on the posttest. This is the situation desired in criterion-referenced testing. A high correlation would not designate items sensitive to instruction. Criterion two suggests that discrimination indices should be low between item and total score using the opposite administration for the criterion. Again these low discrimination values could be obtained and yet the item might be a bad item. For example, a low discrimination value could be obtained with almost all passing the item at pretest and getting low scores on the posttest. A similar situation would result with almost all failing the item on the post- test and obtaining somewhat high scores on the pretest. These results are not desirable in criterion-referenced testing. Items exhibiting these characteristics might not be good items. As with almost all of these techniques, care must be made to include items that cover all the objectives. Relying on only statis- tics to select items may result in the exclusion of some important aspects that need to be tested. Nitko, when considering this 20 problem, suggested "that tests constructed from carefully defined domains of items possess reasonably good psychometric properties without prior statistical selection" (Nitko, 1971, p. 8). On the other hand, Skager felt that "relying solely upon judgments as an index of item quality ought to leave us just as uneasy in the case of criterion-referenced tests as it should be for norm-referenced instruments" (Skager, 1974, p. 53). One of his suggestions was the use of item generation rules; although, he indicated that item selection for criterion-referenced tests is still open to debate. Hambleton, et al. (1975) also do not advocate the use of empirical techniques exclusively. They feel that items selected should be representative of the domain of items and the empirical methods should be used to detect bad items. Consideration should also be given to the impact of selecting items that are sensitive to instruction according to some statistic. If items are selected which are sensitive to instruction one might argue that the items, over a number of administrations and revisions, could become very easy or perhaps require only recall of simple facts. Care must be taken to include items that measure all aspects of the domain and to ensure that these items are not only sensitive to instruction but sensitive to the domain. Another approach similar to the C-V index was presented to Roudabush at the 1973 American Educational Research Association Annual Meeting. It is based in part on a procedure suggested by Marks and N011 (1967). As it was pointed out earlier, Kosecoff and 21 Klein used a similar technique to develop the "correction for guessing" for their External Sensitivity Index. Roudabush's technique is based on the familiar two by two table presented earlier as Table 2.3. Roudabush also makes two assumptions. First, he assumes that there is some fixed non-zero probability, p, that a student who does not know the answer to the item will guess the correct answer. This p value is determined by the item only and does not vary from student to student nor from occasion to occasion for the same student. This fixed p value sug- gests that there is no partial knowledge on the part of the student, and that the student's responses are independent at pretest and posttest when he does not know the correct answer and fails to learn it. Further, Roudabush assumes that the only possible result of exposure to instruction between pretest and posttest is that the student learns the correct response to an item. This then implies that the non-zero frequency of f3 is solely due to guessing, further implying that there is no forgetting. This suggests that the "true" value of f3 is zero. With these assumptions Roudabush derives a number which serves as an index of the degree to which examinees select the cor- rect response to the item as a function of the instruction received between pretest and posttest. This number is called a sensitivity index by Roudabush. It can be expressed as - f 22 (The original notation was 5.) Further clarifications and derivations are presented in Chapter III and in Appendix I. Traditional Techniques Traditional item analysis procedures also have been recom- mended for use with criterion-referenced tests. Most individuals have, however, suggested some modifications in the interpretation of these traditional indices. One of the more detailed procedures is outlined by Brennan and Stolurow (1971). Their procedure combines traditional item analysis techniques with a set of decision rules. Brennan and Stolurow compute four error rates and two discrimination indices from pretest, posttest and retention test data. The decision rules are then applied to determine the adequacy of the item and of the instruction. The decision rules are similar in context to the first criterion of a good item suggested by Saupe. Further clarifications of this tech- nique are presented in Chapter IV and Appendix II. Their procedure is very complicated and laborious and for this reason, perhaps, has not been investigated further. Other individuals have also recommended the use of traditional indices. Hsu recommends the use of the phi-coefficient with Right versus Wrong for a given item being one dimension and Mastery versus Nonmastery the other (Hsu, 1971). For this procedure, a cut-off score for each behavior must be established in order to declare a mastery and a nonmastery group. There are other limitations besides the problem of establishing a cut-off score. The phi-coefficient 23 cannot be used when the item is answered correctly or incorrectly by all or when all subjects are declared masters or nonmasters. Hsu then recommends the use of his upper-lower difference statistic, defined as the difference in proportions of those responding cor- rectly in the mastery and nonmastery groups, or the point-biserial correlation coefficient. Hsu's upper-lower difference statistic was discussed in the previous section. Hambleton and Gorth (1971) also suggest using traditional item analysis procedures. Items associated with the same objective should have approximately the same value for item difficulty. Items that are different should be modified and tested again. In addition item discrimination indices can be used. Negative indices would indicate a need for revision in the item, instructional materials, and/or teaching. Positive discrimination indices, according to Hambleton and Gorth, more than likely indicate a shortcoming in the instructional program. Items with zero discrimination may be acceptable. Popham and Husek recommended the same interpretations of discrimination indices in 1969 (Popham and Husek, 1969). If the traditional methods and the interpretations suggested by Hambleton and Gorth and Popham and Husek are used, then the information that is obtained seems to be ambiguous and no definite decision can be made about the item. However, Brennan and Stolurow took these bits of information with other information and a set of rules and have developed a useful guide for item selection for criterion-referenced tests. 24 Item characteristic curves, another traditional item analysis technique, can also be used for criterion-referenced tests (Hambleton and Gorth, 1971). The parameters (difficulty and discrimination) of the curves supposedly do not change from group to group. This implies that the parameters could be predicted from the pretest administration. An obvious disadvantage in using item characteristic curves would be in the construction and the interpretation of them. This procedure would not be one of the easiest to use or understand. Harris also suggests traditional item analysis techniques for criterion-referenced tests. However, the test should be used with a sample from a population of instructed students and a sample from a population of uninstructed students. Item difficulties for items for a given objective should be equal within each of the two groups; however, item difficulties should differ between the two groups (Harris, 1974). Woodson's position is very similar to Harris' position. Woodson argues that the item needs to be tested in the proper population. He feels that "items and tests must be evaluated for the range of the characteristic for which they will be used“ and if the items and tests give no variability in this population of observation, then the items and/or tests give no information and are not useful (Woodson, 1974, p. 64). Both of these suggestions are considered when pretest and posttest data are used. The pretest group is generally considered the uninstructed group and the posttest group the instructed group. The B-S decision process includes a comparison of the pretest and posttest item difficulties and the C-V index and R index are 25 comparisons of the pretest and posttest difficulties. Since most' of the other proposed item analysis techniques also consider pretest and posttest data, the Harris and Woodson suggestion of testing the item in a proper population are taken into account. Summary The various techniques that have been proposed fall into essentially two categories. One category of techniques contain the C-V technique and its variations (Brennan, 1972; Crehan, 1974; Edmonston, Randall and Oakland, 1972; Hsu, 1971; Ivens, 1970, 1972; Kosecoff and Klein, 1974). The other category contains item analysis procedures generally used for norm-referenced tests with possible alternative interpretations. As is discussed above, these new meanings for old statistics sometimes result in a technique or pro- cedure which is similar to the C-V procedure. Every new technique seems to have as its main purpose, selecting items that are sensitive to instruction. However, there is a need to be alert to the negative implications of selecting items sensitive to instruction. Most indi- viduals recommend using item statistics in conjunction with a review of the domain or objectives and close scrutiny of the instruction. This aspect will be discussed more thoroughly in the final chapter. Review of the proposed techniques has shown that the C-V index or modifications of the C-V index have been recommended more frequently than any other procedure as an appropriate item analysis technique for criterion-referenced tests. The R technique is a refinement of the C-V technique and, as it will be shown in the 26 following chapter, makes fewer assumptions than the C-V index. Therefore, the R index may provide a better estimate of an item's sensitivity to instruction than the C-V index. The B-S procedure combines the best of traditional methods in an attempt to select good items for criterion-referenced tests. All three of these procedures may be considered useful in selecting items that are sensitive to instruction. Most of the remaining pro- cedures are latent trait models. While these are useful they fail to meet the criterion of computational ease which is important in most of the situations where criterion-referenced tests are used. Comparing Techniques Several studies have been done to compare new item statistics to old item statistics. Crehan (1974) compared six item analysis techniques using a pool of items constructed by teachers. The pro- cedures he compared were the C-V, a modified Brennan, a teacher rating, a point-biseral correlation between item score and total test score in the posttest situation, a random ranking, and an index which was defined as the proportion of consistent responses on logically parallel items. Crehan used the concepts of reliability and validity to compare tests composed of items selected by each of the six tech- niques. Reliability was estimated by (a + c)/N where N = a + b + c + d and a, b, c, d are defined in Table 2.4 below. Validity was estimated by (a + c)/N where N = a + b + c + d and a, b, c, d are defined differently in Table 2.5 below. 27 Table 2.4 Categories of Performance (Reliability- Crehan) Form 8 Pass Fail Pass b a Form A Fail c d Table 2.5 Categories of Performance (Validity- Crehan) Uninstructed Group Instructed Group Pass b a Fail c d In addition validity was estimated by the point-biserial cor- relation between test score and a dummy variable representing group membership (instructed group and uninstructed group). The instructed group was a posttest only group and the uninstructed group was a pretest group. The results of his study suggested that the modified Brennan and C-V methods produced tests with higher test validity. However, the different item selection methods seemed to have no effect on test reliability. 28 In order to generalize from the results of this study, the definitions of reliability and validity employed by Crehan must be accepted as reasonable. Both definitions are rationally appealing if not theoretically appealing. Reliability could also have been estimated with a phi-coefficient. But with either method the deter- mination of cut-offs is arbitrary and the estimates can increase or decrease with shifts of the cut-offs. Validity could also have been estimated with a phi-coefficient. The same problem exists, however, with determination of cut-offs and assignment to pass or fail groups. The point-biserial, which was also used to estimate validity, does not have the problem of determination of cut-offs. Two groups of individuals were included in the sample. One group was used to compute item statistics, develop tests and set passing points. The other group was used to determine reliability and validity. The process was reversed and reliability and validity estimates obtained from both groups were averaged. This is unfortunate since it seems reasonable to think of one group as the cross- validation sample. The obtained reliability and validity estimates from both groups could then have been compared and inconsistencies located. Item statistics were not compared across samples of individuals, even though those data were available. Questions such as how did the item values fluctuate across samples and across subject areas were not considered in this study. The only conclusion that we can draw from this study is that if the C-V or modified Brennan techniques for selection of items for 29 criterion-referenced tests are used, the validity, as defined by Crehan, might be better than if some other technique for selection were used. Several other individuals have also compared the C-V index to alternative methods (Cox and Vargas, 1966; Haladyna, 1974; Haladyna and Roid, 1976; Hambleton and Gorth, 1971; Hsu, 1971; Ivens, 1970, 1972; Kosecoff and Klein, 1974). It is interesting to note that of the 11 studies that are reported here which compare criterion- referenced item analysis techniques, eight include the C-V method. This index has to be appealing because of the ease of computation. In addition it seems to fare extremely well in the comparisons with other techniques. Cox and Vargas (1966) and Hambleton and Gorth (1971) con- cluded that the C-V index produces results different enough from traditional methods to warrant the consideration of this alternative technique for criterion-referenced test construction. Cox and Vargas compared 0 to C-V and Hambleton and Gorth compared C-V to the biserial correlation and a modified C-V. The modified C-V was defined as the difference between the proportion of individuals who correctly answered an item on the delayed posttest and the proportion of indi- viduals who correctly answered the same item on the pretest, C-V'. While Hambleton and Gorth found no relationship between C-V and C-V' with the biserial, Cox and Vargas did find significant Spearman rank order correlations between the rank on C—V and the rank on D. Haladyna, on the other hand, concluded from his study that a point-biserial discrimination index computed on the combined test 30 results of pre and post-instruction examinees is better than C-V. His conclusion is based on the result of his analysis which indicated that the two statistics give identical information and the point- biserial requires a one-step analysis and the C-V requires a two-step analysis. His argument that the point-biserial is a one-step process is based on the availability of computer programs to compute the cor- relations. For a classroom teacher C-V has the advantage of being easy to compute as well as "conceptually satisfying" (Haladyna, 1974, p. 98). Hsu investigated the relationship of a modified C-V (C-V") with rpbi and the phi-coefficient using various samples of individuals (Hsu, 1971). The index C-V" is defined as the difference in propor- tions of those responding correctly in a mastery and nonmastery group. The mastery and nonmastery groups are established by a predetermined cut-off score. The samples varied with respect to the ability dimen- sion and test score distribution. The results indicated that the ., and the phi-coefficient depends on the pb1 ability dimension and the test score distribution. When the sample relationship of C-V”, r consists of individuals with a wide variety of abilities and the test scores are distributed symmetrically the indices are highly correlated. Hsu found that a highly discriminating item in one sample may not be a highly discriminating item in another; therefore, he recommended that test items not be tried out in a group with a wide variety of abilities. Items selected on the basis of performance of this group may not be measuring the same kind of performance in a second more homogeneous group. 31 Ivens also investigated the C-V index (Ivens, 1970, 1972). He found that by choosing items with larger values for C-V for one test and lower values of C-V for a second test, there were marked differences in the quality of the tests. To measure the quality of the test, Ivens considered reliability and validity. He used tradi- tional reliability estimates as well as unique reliability and validity estimates. All statistics computed supported the fact that tests composed of items with higher C-V values were better tests. It should be pointed out that the unique reliability and validity estimates were somewhat related to C-V. For this reason, higher reliability and validity estimates for tests constructed from items with high C-V values would be expected. The C—V index was again compared to other indices by Kosecoff and Klein (1974). They redefined C-V as ESI and compared this to their 151, the phi-coefficient and the point-biserial. (ESI and 151 are defined in an earlier section of this chapter.) The results of this study showed that ESI was generally lower than 151. The values of ISI tended to parallel the values of the point—biserial and phi- coefficient. Of course, the corrected version of ESI resulted in lower values. After consideration of the data, Kosecoff and Klein deter- mined that there had been too many masters at the pretest. To compensate for this, ESI and 151 were redefined. ESI was defined as n2 ' n1 (Table 2.3 and Table 2.2 notation, respectively). They nl+"2 concluded from the results of the analysis with the redefined 32 statistics that 151 is sensitive to instruction. The high proportion of prior masters caused the index in the first analysis to be arti- ficially deflated. ESI was found to be an unsatisfactory statistic because the values tended to vary greatly. The values for ESI did correlate significantly with the phi-coefficient and point-biserial values but the correlation coefficients were rather small implying, perhaps, that ESI would not give the same judgment as traditional statistics. Almost all the research that has considered the C-V index (or the ESI) has produced this same result. Interest in the C-V index remains high as indicated in a recent comparative study conducted by Haladyna and Roid (1976). They compared various Rasch statistics, traditional statistics, the Bayesian indices proposed by Helmstadter (1974), and the C-V index for a total of 17 indices. The results of the study demonstrated a high degree of relationship among four item discrimination measures. These were the z-difference--a Rasch statistic which is an index of the difference of difficulties of pretest and posttest samples, a combined samples point-biserial, the C-V index and a Bayesian index --the probability of having knowledge given that the student gets the item correct. This study provides further evidence that the C-V index may be the most appropriate item index for pretest-posttest situations. Three comparative studies that did not include the C—V tech- nique are Roudabush (1973), Helmstadter (1974), and Bernkopf (1976). Roudabush and Helmstadter compared their own unique indices to tra- ditional statistics. Unfortunately neither study mentioned exactly 33 which traditional statistics were being used. Roudabush concluded that his sensitivity index provided different information than the traditional statistics. Helmstadter, on the other hand, found that the "classical discrimination index [he defined it no further than this] comes closest to providing the same item assessment as would the Bayesian probability of making a correct decision . . . " (Helmstadter, 1974, p. 3). Haladyna and Roid (1976) confirmed Helmstadter's result in their study. 0n the basis of the analysis, Helmstadter also concluded that "items which are effective indicators that the examinee does know the material are not necessarily the same items which are effective indicators that the examinee does not know the material" (Helmstadter, 1974, p. 3). Bernkopf compared the point-biserial coefficient using total test score as a criterion (rt), the phi-coefficient (0e), and a second point-biserial coefficient using the total score on an essay test as a criterion (re). The dimensions of the fourfold table for the phi-coefficient were correct/incorrect for the item and above/below mastery on an independent criterion (the essay test). All three indices were significantly related. As could be expected the correlations between the Be and re were higher than the correla- tions between 9e and rt and re and rt. Summary The literature reviewed in this chapter has been divided into two categories. The first group of studies reviewed, recommends possible approaches for criterion-referenced item analysis (e.g. 34 Brennan, 1972; Brennan and Stolurow, 1971; Cox and Vargas, 1966; Crehan, 1974; Hambleton and Gorth, 1971; Hsu, 1971; Ivens, 1970; Kifer and Bramble, 1974; Kosecoff and Klein, 1974; Roudabush, 1973). The second group of studies compares a number of proposed techniques (e.g. Cox and Vargas, 1966; Crehan, 1974; Haladyna, 1974; Hambleton and Gorth, 1971; Hsu, 1971; Ivens, 1970; Kosecoff and Klein, 1974). Review of the proposed techniques reveals that the C-V index or modifications of this index have been recommended more frequently than any other procedure as an appropriate item analysis technique for criterion-referenced tests. In addition, the majority of the comparative studies included the C-V index along with more traditional indices. The general conclusion is that tests constructed on the basis of the C-V index result in tests sensitive to instruction (Ivens, 1970, 1972; Ozenne, 1971). Another conclusion is that the C-V index results in a different judgment for a given item than traditional statistics (Cox and Vargas, 1966; Kosecoff and Klein, 1974). Only two studies included more than one new index in their comparisons (Crehan, 1974; Haladyna and Roid, 1976). The C-V index is significantly related to other new approaches--a Rasch statistic and an index recommended by Helmstadter (Haladyna and Roid, 1976) and when used produces tests with higher validity (Crehan, 1974). Two new approaches to criterion-referenced item analysis have not been researched--one, the R index and two, the B-S procedure. The R index is a refinement of the C-V technique. It makes fewer assumptions and may be a better estimate of an item's sensitivity 35 to instruction. The B-S procedure combines traditional methods with a set of rules to provide a guide for selecting items which are sensitive to instruction. For these reasons, the C-V index, the Roudabush sensitivity index (R) and the Brennan and Stolurow pro- cedure (B-S) were selected for further investigation. In the following chapter a theoretical basis for criterion- referenced testing or pretest-posttest situations is provided. It will be shown that the C-V index and R index can be explained in terms of a general model; and, as indicated above, it will be shown that the R index is a refinement of the C-V index which requires fewer assumptions. CHAPTER III THEORETICAL DISCUSSION In this chapter, a theoretical model for the pretest-posttest situation is presented. Two item analysis techniques, R and C-V, which were described earlier, are explained in terms of the general model. The results of a given item in any test can be represented by the following diagram: Table 3.1 Categories for a Given Item ACTUAL Does Not Know Knows OBSERVED Pass q21 q22 where q11, q2], q12 and q22 are conditional probabilities with qH + q21 ‘ ' and q12 l q22 = " The probability that an individual who does not know the answer to a given item will answer the item incorrectly is denoted by ql]. The probability that an individual who does not know the 36 37 answer to the given item will answer the item correctly is denoted by q2]. Similarly, q12 and q22 represent the probabilities that an individual who knows the answer will fail or pass the item, respec- tively. Now consider a pretest-posttest situation. This can be represented with three diagrams. Table 3.1 can be used to define the pretest results and a similar table with different probabilities (Table 3.2 below) can represent the posttest. These probabilities are defined in the same manner as above. Table 3.2 Categories for a Given Item POSTTEST-ACTUAL Does Not Know Knows Fa" Q11. q12' OBSERVED . ' Pass q21 q22 An additional 2 x 2 table (Table 3.3 below) defines the true propor- tions of the pretest-posttest situation. Table 3.3 True Proportions for a Given Item POSTTEST Does Not Know Knows Does Not Know n1 n2 PRETEST Knows n3 n4 38 In Table 3.3, fl, is the proportion of individuals who do not know the answer to a given item at both pretest and posttest. Sim- ilarly, n2 is the proportion of individuals who do not know the answer to a given item at pretest but learn it by the posttest. n3 is the proportion of individuals who know the answer at pretest but not at posttest; and n4 is the proportion who know the answer at both times. These proportions, a], n2, n3, n4, sum to one. These are true pro- portions. They are not the observed results of the pretest and posttest. The general model is then represented in matrix notation as P = Q I Q' 1 where I symbolizes the Kronecker product, and "1 Q=(‘111 q12) Q.= "2 3 , , n p4 q21 q22 q21 q22 H: The pk's, described in Table 3.4, are the observed propor- tions given the probabilities qij and qij. and the true proportions 'll'k. 39 Table 3.4 Observed Proportions for a Given Item POSTTEST Fail Pass Fail p1 p2 PRETEST Pass p3 p4 Expanding the model, p1 811811' I l q11q12' " + q12°11 1 ' 2 . "3 + q1zq12. 1'4 92 = q11q21' "1 + q11q22 "2 + q12q21 1T3 1 “12822. "4 p3 qz1q11' “1 T qz1q12' "2 + Q22q11' "3 T 922812. "4 p4 qz1q21' "1 + qz1q22' "2 + qzzq21' “3 + q22q22 "4 This model completely describes the results of a pretest-posttest situation. For example, consider p], the observed proportion of individ- uals who fail both the pretest and the posttest. Each of the actual proportions, n1, n2, n3, and n4 can contribute to the observed propor- t'°“° 1" the m0de' p1 = q11q11"'1 l q11q12"'2 T q12q11"'3 + 812812'"4° If we consider n], the proportion of individuals who do not know the answer at pretest or posttest, we can observe that some of the individuals in this category could have guessed correctly at either the pretest (q21) or the posttest (q21') or at both the prestest and the posttest. These individuals would not contribute to the observed proportion p], since they would have passed the item at one or both times. However, we can include q11 x q]]' x n] which is the 4o proportion of individuals who really don't know and didn't learn and failed to guess at either administration. Individuals who did learn the correct response from pretest to posttest can also contribute to p]. Those contributing would have failed to guess the correct response at pretest (qll) and would have answered incorrectly at the posttest (q12') even though they knew the correct response. Therefore, qn x qu' x n2 adds to the observed proportion p]. In addition, individuals who do know the answer at the pretest but don't know the answer at posttest (n3) contribute to p]. Ordinarily, we would not expect n3 to be a very large proportion. Individuals who can be classified in this manner could have failed to respond correctly at the pretest (q12) even though they knew the answer and could have failed to guess the correct answer at the posttest (q11'). Finally, individuals knowing the answer at both pretest and posttest could have answered incorrectly at both administrations (q12 x q12' x n4). Therefore, we can see, intuitively, that p1 is the sum of parts of each of the proportions n1, n2, n3, and an. The observed proportions p2, p3, and p4 can be explained in a similar manner. It should be noted that n1, n2, n3, n4 are separated among each of the observed proportions. If, for example, we add all the parts of n], which are distributed over p], p2, p3 and p4, then q11q11'"1 l q11q21'"1 I q21q11"'1 I q21q21"'1 5“°“'d equa' 1'1' Th'5 can easily be shown by factoring this expression: q11(q11 I q21 )"1 + q21(q11 i q21 )"1 = (91] + q2])(q11' + (121%1 = n] since q11 1 q21 = ' a"d q11 + q21 = '° 41 It can also be shown that all the parts of n2, n3, and n4, which are distributed over the observed proportions, p], p2, p3 and p4, do sum to n2, n3, and n4, respectively. There are 12 parameters in this model. If these parameters could be estimated, useful information would be available for both the item and the instruction. For example, if we, the proportion of examinees who learn the answer, could be estimated, then an evalua- tion of the quality of the instruction could be made. The estimate of this proportion would also indicate the item's "sensitivity to instruction." Estimates of the other parameters would also provide useful information. For an objective item, estimates of q]], qz], q]]' and q2]' can be made after consideration of the number of response choices. For example, a four-choice objective item would ordinarily lead to an estimate of .25 for q2] or q21', because an individual who does not know the answer has one chance out of four of choosing the cor- rect response. It is also generally assumed that q22 and q22' equal 1.0, because it is very unlikely that an individual who knows the answer will respond incorrectly. However, this may not be the case for a poorly-written item. For example, a distractor for an item may be also a correct response; or, the correct alternative could be worded so ambiguously that even the individual who knows the answer will not choose it. There is also the possibility that an individual will make a clerical error. 13 about the quality of an item. A bad item would be one where qz] Estimates of the qij's and q.."s do provide information 42 or q21' is high; that is, where the probability of guessing is high. A good item would be one where q22 and q22. approach equal 1.0. Suppose the parameters are considered in a slightly different manner. One could perhaps use the concepts of reliability and validity to describe these parameters. The nk's represent true values. Estimates of indices defined by the nk's are estimates of the validity of the item. For example, an estimate of NZ indicates how many or what proportion of the individuals not knowing at the pretest know at the posttest. The higher this value, or closer this number is to 1.0, the better the item is measuring what it is supposed to measure. In other words, indices based on the wk's are indicators of validity. In addition some of the qij's and qij"s can be considered to be estimates of reliability. For example, if ql], q22’ qlll’ and qzz' are close to 1.0 then the item is a perfect indicator of know- ledge or no knowledge. As these probabilities decrease the item is a less reliable indicator of knowledge or no knowledge. Assumptions can be made to simplify this conceptualization. In the general model Q does not necessarily equal Q'; different probabilities are defined for the pretest and posttest. It is pos- sible, however, that for any given item these probabilities would be identical; that is, that neither time nor instruction would change these item parameters. One could then assume that Q = Q'. Roudabush simplifies the situation even further. First, he assumes that n3 = 0. This implies that there is no forgetting; an individual who knows an item at pretest will know it at posttest. 43 Second, Roudabush assumes q22 = q22| = 1.0, ignoring the possibility that someone who knows the answer to an item could fail it. Under these assumptions the model reduces to: p1 "1 92 (Q11 0) » (Q11 0) '9 p = 5» o 3 q21 ' '“ q21 ' fl D4 4 But qn + q2] = l and n] + n2 + n4 = 1, so 2 P1 q11 "1 p2 _ q11 (' ' Q11) T'1 T q11"2 p3 - (' ' Q11) q11"1 p4 (' ' q11)Z"1 T (T ' q11)"2 T (' '"1 ' T'2) These four equations correspond to equations (1) through (4) pre- sented in Appendix I. "2 111+Tl' The sensitivity index is defined as R = 2 This is a reasonable sensitivity index, it is the proportion of indi- viduals not knowing the answer at pretest who learn it by the post- test. Roudabush solves the four equations above using the assump- tion that the expected observed proportions, p], p2, p3, p4 equal the sample proportions, fl/N’ f2/N, f3/N, f4/N respectively and obtains solutions for n] and Hz in terms of f1, f2, f3, and f4. The f1, f2, f3 and f4 equal the observed numbers of individuals in each category and N is the total number of individuals. These 44 solutions are then substituted in the definition of R and an estimate 1.2;: . Unfortunately, the general model cannot be heuristically of R is solved since there are seven parameters (unknown) and only three pieces of information. Therefore, we cannot estimate R without Roudabush's assumptions. We can, however, compare the true R and the estimated R for simulated data. A second index, suggested by Cox and Vargas, can be considered in the same theoretical framework. Cox and Vargas call their index the Pretest-Posttest Difference Index (C-V). This is defined as the percentage of students who pass the item at posttest minus the per- centage of students who pass the item at pretest. In terms of observed results, this is f2 + f4 ' f3 + f4 or I2_:_I3., N N N The C-V method can be represented as a special case of the general model by assuming that Q = Q', q22 = qzz' = 1.0, and q21 = q21' = 0. Then, p 11 p 1r 1 1 0 1 0 1T1 1 Tr1 ID2 = I 2 or p2 = 2 p3 O 1 0 1 1T3 p3 n3 D4 4 p4 "3 The C-V index can then be defined, using the notation of Table 3.3, as (n2 + n4) - (n3 + n4) or C-V (true) = n2 - n3. This is identical to the definition of C-V given by Cox and Vargas except they use the observed proportions as estimates of the actual proportions. This index indicates the sensitivity of the item to instruction. The 45 closer C-V (true) is to 1.0 the greater the sensitivity and the closer it is to 0.0 the less the sensitivity. If the equations above are solved heuristically for the true proportions they are found to be equal to the observed proportions. In other words, under these assumptions, the observed proportions are equal to the true proportions. These assumptions, however, are extremely restrictive; they do not even allow for guessing. In fact the C-V approach assumes no misclassification, i.e., no error. C-V is an estimate of C-V (true). Under certain restrictive assumptions C-V would equal C-V (true). We can compare C-V (true) with C-V for simulated data in order to observe the impact of less restrictive assumptions on C-V. Marx In this chapter, a theoretical framework is proposed for criterion-referenced testing in pretest-posttest situations. This framework suggests that 12 parameters completely describe the pretest- posttest situation. In addition the Roudabush (R) model and the Cox and Vargas (C-V) technique are explained in terms of the general model. The design of the research is discussed in the following chapter. The research is considered in two parts. In the first part of the chapter, the design of the simulation study is presented. The simulation study uses the theoretical framework proposed in this chapter to consider the impact of various assumptions on the C-V and the R indices. The design of the comparison of the C-V, R and B-S 46 techniques with actual data is presented in the second part of the next chapter. CHAPTER IV DESIGN Part A: Design of the Simulation The purpose of this part of the study is to answer three of the research questions posed in Chapter I. The questions that this part of the study will be directed to are as follows: 1. Do the C-V and R techniques adequately estimate the true values of the item parameters? 2. Does one technique estimate the true values better than the other? 3. Do the C-V and R techniques estimate some true values of the item parameters better than others? One approach to answering these questions would be to gener- ate hypothetical data with various item values. In other words, one approach would be to design and implement a simulation. Recall from the previous chapter that the theoretical model is represented by P.= Q2! 0' 3, where E: are the observed proportions of individuals corresponding to the true proportions, 47 =1=1=1 £3de :1 (see Tables 4.1 and 4.2 below), gjsymbolizes the Kronecker product, q11 q12 q11' qiz' = d '= | | g 00' o... - Q (q21 q22 an Q q21 q22 The Q13 5 and qIJ s repre sent probabilities and are defined according to Tables 4.3 and 4.4. Table 4.1 Categories for a Given Item--Observed Proportions Posttest Fail Pass Fail p1 p2 Pretest Pass p3 p4 Table 4.2 Categories for a Given Item--True Proportions Posttest Does Not Know Knows Does not know n1 n2 Pretest Knows n3 n4 49 Table 4.3 Pretest--Actua1 Does Not Know Knows Fail q11 q12 OBSERVED Pass qZ] q22 Table 4.4 Posttest--Actual Does Not Know Knows Fail q11' q12' OBSERVED Pass q21' q22' When the model is expanded, P_can be represented by the following: P = p' = qllqll "1 T qllqlz 1T2 T q12q11 "3 T q12q12"'4 ._ p . ' . ' p2 q11q21 T'1 T 911922 "2 + 412421 43 + qnq22 n4 p3 qz'q“ “1 T ququ 1T2 T q22q11 "3 T q22q12 T'4 4 q21q21 "1 T q21822 T'2 T 922921 "3 + 422422 44 TI The R procedure defines the sensitivity index to be “I "2 . but for computation uses the sample proportions. Therefore, R is P2-P3 P1+P2 tions. In addition the C-V index is defined as n2 - n3, but is again computed by calculating where Pk are the sample propor- computed using sample proportions and is p2 - p3. If numerical values of "k’ qij and qij' are chosen, then the expected observed pk can be computed. Random numbers can be generated 50 and then based on the values of pk the number of cases in categories 1, 2, 3, and 4 can be determined. (Categories 1, 2, 3, and 4 follow the same pattern as the notation for the "k and pk.) For example, suppose p1 = .1125, p2 = .5075, p3 = .0375 and p4 = .3425. Suppose also that a random number is generated. This random number is from a uniform distribution and is between 0.0 and 1.0. If it is less than .1125, then the number of cases in category 1 would increase by 1. If the random number is less than .6200 (.1125 + .5075) but greater than or equal to .1125, then the number of cases in category 2 would increase by 1. If the number is less than .6575 (.6200 + .0375) but greater than or equal to .6200, then the number of cases in category 3 would increase by 1. And finally, if the number is less than 1.00 but greater than or equal to .6575, then the number of cases in category 4 would increase by 1. Any random number generated would be counted in one and only one category. In this manner, simulated frequencies for the fail-fail group (category 1), fail—pass group (category 2), pass-fail group (category 3), and pass-pass group (category 4) are obtained. For this simulation sample sizes of 50 and 200 will be con- sidered. The sample size of 50 was selected because in most actual situations, 50 is the maximum number of individuals available. Some parameter values will be repeated in the simulation with a sample size of 200 in order to consider the stability of the indices. For each set of parameter values 1000 samples will be gener- ated. For each sample, the R and C-V indices will be computed. Of course, the true values remain the same for all 1000 cases. A number 51 of descriptive statistics will be computed based on the 1000 samples. These will include the means and the variances for the R and C-V indices and the largest and smallest values for each. In addition, skewness and kurtosis will be computed for each. The simulation is designed to consider a range of parameter values in order to see how close the estimate of the R and C-V indices are to the actual values.' Consider Tables 4.2, 4.3 and 4.4. The probability that an individual knows the answer yet fails to answer the item correctly, q12 or q12', is probably quite small. Since q12 + q22 = l and q12' + qzz' = 1, this assumption would imply that q22 or q22' is large. In addition, the probability that an individual can guess the right answer (q21 or q21') can be estimated by the number of options offered in the item. For example, a good estimate of q2] for a true-false item would be .50. For a multiple-choice item with four options a good estimate would be .25. The probability (q21') that the correct answer could be guessed given some instruction may stay the same as qZ] or it may decrease or increase. A11 possibil- ities were considered in the selection of the values of q21'. Table 4.5 lists the 21 different sets of parameter values that were selected for the simulation. Sixteen sets designate the probability of guessing (q21) to be .25 (four--option multiple-choice item). Eight of these retain this estimate for the posttest (q2]' = .25). Seven of these sets increase the probability of guessing for the posttest to .50 (q2]' = .50). This makes the logical assump- tion that instruction may improve the individual's chances of guessing the correct answer by eliminating two of the possible options. For 52 N. F. O. N. O.O O.O ON. ON. O. O. ON. ON. OON _N N. F. O. N. OO. O_. ON. ON. OO. O_. ON. ON. OON ON N. O. O. N. OO. O_. OO. OO. OO. ON. ON. ON. OON O_ N. P. O. N. O.O O.O O.O O.O O._ O.O O.O O._ OON O_ N. N. O. N. O._ O.O OO. OO. OO. OF. ON. ON. OON N. N. O. O. O. O._ O.O O.O O._ O._ O.O O.O O.O OON O. N. O. O. O. O.O O.O ON. ON. O._ O.O ON. ON. OON O_ N. _. O. N. O.N O.O ON. ON. O.. O.O ON. ON. OO O_ N. _. O. N. OO. OP. ON. ON. OO. O.. ON. ON. OO OF N. F. O. N. O.O O.O O.O O.O OO. O_. ON. ON. OO N_ N. _. O. N. OO. OP. OO. OO. OO. O_. ON. ON. OO P_ N. O. O. N. O.O O.O O.O O.. O._ O.O O.O O._ OO OF P. O. O. _. O._ O.O ON. ON. O.O O.O ON. ON. OO O OO. OO. O. _. O.O O.O OO. OO. OO. OP. ON. ON. OO O _. O. O. O. O._ O.O ON. ON. O._ O.O ON. ON. OO N OO. OO. _. O. O.O O.O OO. OO. OO. O_. ON. ON. OO O N. _. O. N. O.O O.O OO. OO. OO. O_. ON. ON. OO O N. O. O. O. O.O O.O OO. OO. O.O O.O OO. OO. OO O N. O. O. O. O._ O.O O.O O.O O._ O.O O.O O.O OO O N. O. O. O. O.O O.O ON. ON. O.O O.O ON. ON. OO N N. O. O. O. O._ O.O OO. OO. OO. O_. ON. ON. OO O OO OF NO _O .NNO .N_O __NO ._FO NNO N_O _NO __O z Ommmmz empwEOeOa :oOaOOOEOm asp com Om:_m> emumEOeOO umpumpmm m.¢ mpnmh 53 one set, the value of q21' is set equal to zero, implying that after instruction the individual has no chance of guessing the correct response. One set designates the probability of guessing (q21) to be .50 (true-false item). This estimate is retained for the posttest (q2]' = .50). The remaining four sets satisfy the assumptions of the C-V index. These assumptions include assuming the probability of guessing is 0.0 for pretest and posttest (q2] = q2]' = 0.0) and assuming the probability of getting the item right when knowing the answer is 1.0 for pretest and posttest (q22 = q22. = 1.0). Based on the assumption that the probability that an individ- ual who knows the answer yet fails to answer the item correctly (q12 or qlz') is quite small, q]2 was designated to be 0.0 11 times and .10 the remaining ten times. The value for q12. (posttest probability) was set at 0.0 for all but four parameter sets. For these, the value of q12' remained equal to q12 which had been set at .10. The values of n], "2’ n3, n4 were selected to represent reasonable situations. Two basic sets of values were chosen with n1, n2, n3, n4 equal to .3, .5, 0.0, .2, and .2, .5, .l, .2 respectively. Four sets of values were selected to consider the impact of extreme values on the indices. These sets (6, 7, 8, 9 of Table 4.5) con- sidered the possibility that the majority of individuals would fail the pretest and pass the posttest (8 and 9). As previously stated, for each set of parameter values 1000 samples will be generated. For these 1000 samples, the C—V and R 54 indices will be computed. Means, variances, highest and lowest values, skewness and kurtosis will also be determined for the C-V and R values. In an attempt to answer the research questions, the data will be considered in a number of ways. All descriptive statistics-- menas, variances, skewness and kurtosis--will be considered for each of the 21 parameter sets for both indices. Means of each of the C-V and R values will be compared to their true values and variances of these indices will also be considered in an attempt to answer the question of adequacy. For any given parameter set a mean value close to the true value in conjunction with low variance would imply some degree of adequacy. The second question, “Does one technique estimate the true values better than the other?", will also be answered by considera- tion of the data. One approach will be to consider how close the values are to the true value for each set of parameters for each technique. The variance, skewness and kurtosis values will also be considered. A comparison of the correlation coefficients between the true values and the means for each technique might show whether or not one technique estimates the true values better than the other technique. However, some caution will be used in the interpretation of the correlation coefficients and the comparison. The final question will also be handled descriptively. The actual data will be considered and an attempt will be made to locate values that are not estimated as well as others. 55 All questions will be considered descriptively. Each parameter set with results will be discussed with respect to the three basic research questions. Summary statistics will be presented in order to facilitate the understanding of the techniques and the conclusions reached about them. Part 8: Design of the Comparison Study With Actual Data The purpose of this part of the study is to determine the comparability of three item analysis procedures (C-V, R and B-S). Data were obtained from the Michigan Middle Cities Project. One hundred twenty-eight items were chosen from two subject areas, Reading and Mathematics. Two levels were considered--Middle and Upper. (These levels generally refer to grades three and four, and five and six respectively.) Each item was written for a particular objective. Each objective was tested by four items on a pretest, four different items on a posttest and all eight items on a retention test. The retention test was given approximately 40 days after the posttest. There were also two treatment groups where item data were collected. In one treatment, teachers were assigned objectives (treatment A). In the other treatment, teachers were allowed to choose objectives (treatment B). Sixteen objectives were chosen to complete the design which is represented in Diagram 4.1. The major question to be considered is "Do the C-V, R and B-5 item analysis procedures provide comparable results?" The analysis of this question will primarily be descriptive. To determine the 56 Objective Subject Level Treatment Number N Items 142 31 1-8 A 116 59 9-16 Middle 112 21 17-24 B 120 20 25-32 Reading 145 66 33-40 A 199 57 41-48 Upper B 182 30 49-56 166 18 57—64 108 52 65-72 A 111 43 73-80 Middle 107 42 81-88 B 109 37 89-96 Mathematics 198 22 97-104 A 176 46 105-112 Upper B 187 16 113-120 167 17 121-128 Diagram 4.1 Design of Administration of Items 57 comparability, of the indices C-V and R, a Pearson product moment correlation will be computed between the C-V and R values. The B-S procedure (see Appendix 11) does not allow for a single index. The B-S procedure involves the computation of four error rates. TER (theoretical error rate) is defined as (J-l)/J where J is the number of possible answers to an item--or it is simply the expected proportion of students answering a pretest item incor- rectly. The Base Error Rate (BER) is the observed proportion of students answering a pretest item incorrectly. The Posttest Error Rate (PER) is the observed proportion of students answering a post- test item incorrectly. In this situation the data used as the post- test data will be from the retention test. The Instructional Error Rate (IER) is the proportion of students answering incorrectly on a terminal test item which is administered to students who have been exposed to instruction. This last error rate is not included in any of the decision rules related to item revision. In addition two discrimination indices are computed, the Base Discrimination Index (801) and the Posttest Discrimination Index (POI). These are computed using the total score on the appropriate test as the criterion. For 801, the criterion will be the pretest and for POI, the criterion will be the posttest. Again in this situation the data used will be from the retention test. Two separ- ate statistics will be used to compute the discrimination indices, the phi-coefficient and the 8 index. The 8 index equals B/(B + D) - A/(A + C) where A, B, C and D are defined in Table 4.6. 58 Table 4.6 B--Index Total Test Score Nonmastery Mastery 1 A 8 Item Score 0 C D Mastery for the items on the pretest and retention test data was set at three out of four items. These five pieces of information, TER, BER, PER, 801 and PDI are then used in conjunction with some rules to determine the adequacy of the item. Appendix II provides a description of these rules. Since the B-S procedure does include several statistics, the comparison of the three indices will be done in the following manner. First the individual statistics, TER, BER, PER, 801, and PDI, which are necessary for the B-S procedure, will be computed. The appro- priate rules will be applied and a decision will be made about the quality of the item; that is, does the item need to be revised? Each item can then be assigned a "O" or a "1" depending on the outcome of the application of the rule. A "0" would indicate non-acceptance or revision required; a "1“ would indicate acceptance or no revision I required. There is a limitation in this procedure. The B-S process requires that the evaluator set various cut-off points. For example, the evaluator must decide an appropriate cut-off point between a high 59 and low error rate. Comparison of B-S with the R and C-V indices will be influenced by the selected cut-off values. To minimize the effect of this limitation various cut-off values will be set and several comparisons will be made. Point-biserial correlations will be computed between the B-S value of "O" or "1" and the C-V value and the R value. There is also a limitation with the data used in the computa- tion of the various indices. Retention data are substituted for the posttest data generally used in the computation of C-V and R indices. There may be some additional forgetting not normally found in a more immediately given posttest. However, since both R and C-V are computed using the same data, the effect on the comparison of the two should be minimal. The observed frequency of f3 (those who forget from pretest to retention test) might appear to influence C-V more than R since this frequency is included in the denominator of f2 _ f3 . > . R only 1ncludes f] + f2 + f3 + f4 ) . However, since the only difference C-V as well as the numerator < f3 in the numerator (1:24:32 1‘1 + 1‘2 in the two indices is the addition of f3 and f4 in the denominator of ‘ C-V, and if f3 gets larger due to the longer time frame, then f1, f2 and/or f4 would get smaller. Since f1 + f2 are the same in both and the total f1 + f2 + f3 + f4 is N, a constant, then the effect on both indices should make little difference in the comparison of the two. This same argument holds for the impact of a decrease or increase in f4 on the comparison of the two indices. A similar argu- ment can be made for the individual statistics of the 8-5 procedure. 60 Two additional questions that are of interest are as follows: Are the three procedures more comparable for items in Mathe- matics than for items in Reading?; and, Does the comparability of the three procedures depend on the grade level? It is anticipated that the procedures will be more comparable for Mathematics than for Reading. Items for Mathematics are con- structed more easily than Reading items because the subject area is more structured. In addition the items are generally of higher quality. The Reading items may be more ambiguous than the Mathematics items. It is also anticipated that the correlations among the indices would be almost identical for items given in the Upper grades and for items given in the Middle grades. There is no reason to expect the correlations to depend on grade level. Each of the two questions will be analyzed in two steps. A comparison of the C-V and R values will be considered separately then a comparison of the B-S with the C-V and R will be made. The first question can be analyzed in the following manner. First, a Pearson product moment correlation will be computed between the C-V and R values for items in Mathematics and separately for items in Reading. Then a comparison can be made between these two correlation coefficients. The null hypothesis can be expressed as Ho: pR = pM with the alternative hypothesis being H1: DR f 0M, where DM = the population correlation of C-V and R for Mathematics and pR = the population correlation of C-V and R for Reading. A 61 Fisher's z- transformation will be made for each of the sample cor- relations and a z-test will be applied. Secondly, the point-biserial correlation will be computed between the B-S and C-V values for Reading and Mathematics and between the B-S and R values for Reading and Mathematics. The second question will be considered in the same manner only correlations will be computed between the various indices for the two grade levels separately. A final question which may be considered is "Are the three procedures more comparable for items given in treatment A than for items given in treatment B?" The analysis of this question is similar to the analyses proposed above. The prediction for the cor- relations among the indices is that they will be higher for treatment B than for treatment A. This is due primarily to the fact that teachers were assigned objectives in treatment A. Instruction may not have been needed for these specific objectives or may not have been given adequately, so the item response data for treatment A may be unstable. Items from treatment B should more closely fit the ideal criterion—referenced situation, i.e., no knowledge on pretest and knowledge on posttest. Summar There are two parts to this study. Each part is designed to answer different questions. The first part, the simulation, will attempt to determine if the C-V and R indices adequately estimate the true values, if one technique estimates the true values better 62 than the other and if for some parameter sets the C-V and R indices are better estimators of the true values. Data will be analyzed descriptively for the 21 sets of parameter values used in the simula- tion. Additional questions, such as the stability of estimates for different sample sizes, will also be considered in the analysis of the data. The second part of the study is designed to determine compar- ability of the three item analysis procedures, R, C-V, and B-S. C-V and R values will be computed for 128 items and the B-S values will be computed on 64 of the 128 items. The relationships among the indices will be determined using correlation coefficients. Additional questions pertaining to the comparability of the indices with respect to subject matter, grade and treatment also will be considered. CHAPTER V RESULTS OF THE SIMULATION The simulation was designed to answer three questions: 1. Do the C-V and R techniques adequately estimate the true values of the item parameters? 2. Does one technique estimate the true values better than the other? 3. Do the C-V and R techniques estimate some true values of the item parameters better than others? The results of the simulation for the 21 sets of parameter values (see Table 4.5) are presented in Table 5.1. The C-V Index: Adequacy and Stability_ Assumptions Met Consider the statistics of parameter sets 3, 10, 16 and 18 (see Table 5.2). For these parameter sets, the assumptions for the C-V index are met. Recall that the assumptions for C-V include no guessing (q2] = q2]' = 0.0) and an individual who knows the answer will not fail to answer correctly (q22 = qzz' = 1.0). (The parameter sets 16 and 18 are identical to 3 and 10 respectively except N = 200.) 63 64 cc. on mn—. omcm. on mamm. omoo. ~0¢~.1 sumo.- mNo_.- mmvo. emoo. _~oo. oeoopo. ovoo. .epfimo. mmmw. owom. 000v. nvps. .—~ Nv. 0» ——. mwmo. on ~Ncw. umpo.1 ~v-.- m__~.- meo.- pw¢o. mmoo. mmoo. cwoopo. “coo. pom—co. mmmw. smoe. coco. n¢—s. .ON at. ca mc—. pawn. 0» mmOm. ——~p.u ~omm.- mm—o. mva. Fmvo. ammo. ~Noo. woecoo. omoo. swampo. Nmmm. mmom. ooov. mops. .m— wm. cu mNN. nm—s. 0» _~mm. mm~F.- ~m_m.c _Noo. vmmo. move. Nmmo. Nmoo. mooooo. «moo. -oo~o. mpov. wwsm. oooc. nv—s. .op cm. a» «N. mmum. cu scam. mooo. va~.1 mvmp.u «NMN. ovvo. memo. m_oo. -oooo. omoo. nopooo. “vac. Nmmu. ooov. megs. .5— m—o. 0» mam. moms. ou mwm. coco.- Nvmo. ~m_~.u ommv.u some. ommo. m—oo. oooooo. mpoo. —ooooo. moom. mowo. oo0m. om~o. .0— mom. on mmm. ass. on ¢noc. cemo.1 wovP.1 ~mo~.u mwpo.u coco. mmco. m—oo. Nemmpo. «woo. oooooo. moum. ~¢~o. oo0m. ammo. .mp om. 0» No. pno. cu “moo. Nooo. ccmc.1 mwop.a c~N_. ammo. manp. omoo. Ncnooo. —~Po. ”upwmo. m—om. cmom. oooe. nv—s. .0— No. an no.1 ppom. ou ~.- coco. wvo¢.u ”moo.- wNOm. woao. Nomp. coco. o—Nmpo. coFO. -m¢oo. v_0~. came. ooov. ncps. .m— Nm. 0» co.- ocms. 0» —.1 mmmo. momm.- ovo~.u oooo. —voo. ~NM—. wwoo. mco—No. capo. Nomoso. mnmN. pave. oooe. mops. .m— om. 0» No. a. cu mmmo. mao~.- emm~.u popo.u .o—~.p come. vowp. mmoo. «Nance. oo_o. ~¢n~—c. comm. Npow. ooov. nvps. .—— n. ca .0. «mom. 0» mo. —mmo.- mooc.- sumo. Nmmm. ammo. omo_. mmoo. mpoooo. mopo. -momo. Nvoo. oesm. ooov. mops. .o— Na. 0» cm. o.— cu m. ~o¢—.u Noom.- moo”. ono. —mno. osmo. peoo. m~oooo. Mmoo. wooooo. mooo. esww. ooom. mama. .o cw. ow on. o.— 0» mono. Nmo~.1 owmm.u omoo.1 wmmm. _u~o. memo. mmoo. eemmpo. Nmoo. mooooo. wopo. ppmm. comm. omwm. .w on. a» m—.1 come. 0» m.- Oops. vaN.1 oo—_.- empo.- ammo. m_N—. «Koo. ocmooo. Nv—o. mwoooo. mono. poo—. coop. pppp. .n Nm. 0» co.u coco. 0» omoo.u ~_N—.1 mmwv.- coo—.1 ocmfi. memo. Nmpp. mwoo. monco. mm_o. wommoo. Nomm. onsm. Como. ————. .o co. ea c—. Noo¢. ou emom. omvo.- N_ec.1 NFc—.1 Noom. MNwo. voo—. omoo. Nmoooo. —o—o. appooo. mace. was. ooov. nv—n. .m m. o» No.1 o.— o» ao~o.- mm_—.u —mom.1 oNoo.- o~o~._ mmwo. mm¢_. oNoo. Nm—moo. Fwwo. oewooo. Knew. mmoo. 000m. cmwo. .v Nu. cu mm. mm. cu mo—v. ammo. mmoF. mwm—.- mmom.1 some. mono. oeoo. oooooo. omoo. vooooo. moov. pmwm. ooom. omwo. .M No. on —. mNow. cu ~mmp. mac—.1 momc.1 mo—F.- Npmm. memo. mmop. mNoo. ommm_o. ~F_o. spoooo. comm. mowo. 000m. ammo. .N as. 0» o—. o.— 0» muqm. mac—.1 mm¢¢.- NNoo.- mmmv. Nmmo. ammo. oNoo. mnmooo. vNoo. wmooro. NNOe. mmmm. ooom. omNo. .p AOOOmV >-O AOOOOV >uu a >-u a >1u >10 emenzcm a cmemncm >19 a >1u a uom >1u omega a «menu Omocxmxm Omwczmxm ONOOHL3x O_O0uN:x cm m cm NO> cowumw>mo .a Lo> cowum_>mo cum: com: magh wagh NoquoLcn muspomaq manpoOa< umm NmuoEONOu zoom Noe OOOOOOOoOm m>_OQOLoOmo p.m m—nch 65 ooom. op oONN. mmm_.1 FNoo. Nwoo. opoo. o_o¢. oooo. .op ompo. op omom. oooo.1 ~o_~.1 mFoo. mooo. moom. ooom. .op oooN. on ooeo. Fmoo.1 NNmo. oooo. Nooo. Nooo. oooo. .op ooNN. op oomm. omoo. mmop.1 oeoo. mooo. mooo. ooom. .m 1 >1o coOuOO>mo >1o >1o > o eo mocmm OOmczmxm OOOoueag LO> mszomo< :Omz wage pm: mN< >1o Noe Ocowuossmm< mews: Opmm prmEOLOO ~.m wpnmh 66 Comparison--Assumptions Met Versus Assumptions Not Met The average absolute deviation from the true C-V value for the mean C—V for the sets where the assumptions are met is .0018. In comparison for the remaining parameter sets, where the assumptions for the C-V index are not met, the average absolute deviation from the true C-V for the mean C-V is .1096, a considerable difference. The variances for the parameter sets where the assumptions are met range from .0013 to .0088, but the variances for the sets where the assumptions are not met, range from .0016 to .0094. Ten of these 17 have variances equal to or greater than .0070. The variances are lowest for those parameter sets (15 through 21) which have sample sizes of 200 (.0013 to .0023). The kurtosis values for the distributions of the C-V index range from -.2957 to .1005. Only four values are positive; two of these are for those parameter sets where the C-V assumptions are met. Since the kurtosis values are not very large or very far from zero, it seems reasonable to describe the distributions as mesokurtic. The skewness values range from -.2693 to .0938. Fourteen values are negative. The skewness values are also not very large or very far from zero, so the skewness for any parameter set is slight. If the skewness and kurtosis values are considered together, then the distributions for all the parameter sets can probably be described as normal. A comparison of the averages of the absolute deviations from the kurtosis value of zero for parameter sets with N = 50 and for 67 parameter sets with N = 200 reveals that the latter average is slightly larger (.11 for N = 50; .12 for N = 200). A similar com- parison of the averages of the absolute deviations from the skewness value of zero for parameter sets with N = 50 and for parameter sets with N = 200 reveals that the latter are less skewed (.11 for N = 50; .04 for N = 200). The values, however, for N = 50 and N = 200 do not differ enough for one to infer that the greater sample size provides a more normal distribution. Comparison of the average ranges for parameter sets with N = 50 and N = 200, .54 for N = 50 and .29 for N = 200, does demon- strate that the C-V estimates are more stable with larger sample sizes. For those parameter sets that do not meet the C-V assumptions the average range is .47, N = 17. For those parameter sets that do meet the assumptions, the average range is .41, N = 4. There appears to be slight differences in the averages when sample sizes are also considered. See Table 5.3 below. Table 5.3 Average Ranges for the C-V Estimates All Sample Size = 50 Sample Size = 200 Meet .41 .53 .28 Assumptions (N=4) (N=2) (N=2) Does Not Meet .47 .54 .29 Assumptions (N=l7) (N=12) (N=5) All .46 . .54 .29 (N=21) (N=14) (N=7) 68 Final evidence of the adequacy is the correlation between the true C-V value and the mean C-V value. This correlation is .800 (N=21, p <.001). From the evaluation of the other statistics: ranges, kurtosis and skewness values and variances, in addition to the cor- relation cited above, one can infer that the C-V technique provides reasonable estimates of the true C-V value and these estimates are distributed normally. However, the technique does provide a more stable estimate for larger sample sizes (N=200). The R Index: Adequaqy and Stability Assumptions Met Consider the statistics of parameter sets 2, 3, 4, 7, 9, 15 and 16 (see Table 5.4). For these parameter sets, the assumptions for the R index are met. Recall that the assumptions for the R index are that guessing is the same for the pretest and the posttest (q2] = q21'), an individual who knows the answer will not fail to answer correctly (q22 = q22' = 1.0), and an individual who knows the answer on the pretest does not forget it on the posttest (H3 = 0). (The parameter sets 15 and 16 are identical to 2 and 3 respectively except N = 200.) Comparison--Assumptions Met Versus Assumptions Not Met The average absolute deviation from the true R for the mean. R for the sets where the assumptions are met is .0043. In comparison, for the remaining parameter sets where the assumptions of the R index are not met, the average absolute deviation from the true R for the 69 moms. op ommm. memo. omm¢.1 mpoo. mpoo. momo. ommo. .o_ ooNO. on oNoc. Novp.1 oopo.1 «moo. mooo. Nomo. ommo. .mp oooo.P op oooo. Noom.1 oNoo. mmoo. mpoo. omoo. ammo. .o oome. op ooom.1 ONmN.1 oopo.1 NOFo. omoo. PooP. FFOF. .N oooo.F ou oONo.1 Pmoo.1 oNoO.F pmmo. Om—o. mooo. ommo. .O oomo. o“ mofiw. mmop. NNom.1 omoo. opoo. meo. ommo. .m oNoo. oo Nmop. oom¢.1 Npmm. . NOFo. Pooo. ooNo. ommo. .N a eo moCOm OOmczmxm OOOopezx a LO> :mwwmwmmm cam: womb um: me< x No; OcoOpOEOOO< memzz Opmm ewumsmemo ¢.m mpnm» 70 mean R is .1426. The variances for the R values where the assumptions are met range from .0015 to .0221 but the variances for the remaining R values only range from .0026 to .0194. The reverse might have been expected. It would seem more likely for the R values to be more stable when the assumptions of the index are met. This does not seem to be the case; although, the differences in the ranges of the var- iances are slight. Other evidence of the stability or lack of stability of the estimates of the R index can be obtained by consideration of other distributional statistics; such as skewness, kurtosis and range. There are 15 total parameter sets where the kurtosis is positive. A positive value implies that the curve is leptokurtic (peaked). Two of the positive kurtosis values are near zero (parameter set #12, K = .0600 and parameter set #18, K = .0584). For these two parameter sets the distributions can probably be described as meso- kurtic. The remaining six parameter sets have negative kurtosis values; three of these are near zero (parameter set #7, K = -.0186; parameter set #15, K = -.0189; parameter set #10, K = —.0871). A negative kurtosis value generally indicates that the curve is platykurtic (flat); however, the three curves whose values are near zero could be considered mesokurtic. The largest value is 1.7626 for parameter set #4. This is one set where the assumptions of the R index are met. Ideally, the 1000 R values should be concentrated in a narrow range about the true value. There are 19 total parameter sets where the distribution is negatively skewed. Only two parameter sets have positively skewed 71 distributions. Parameter set #16 has a skewness value close to zero (Sk = .0242) which might indicate a non-skewed distribution. When the kurtosis value is also considered (K = -.4356), it appears that the distribution is slightly flat. However, the kurtosis value is not very large so the interpretation of the two statistics could be that the distribution of R values for parameter set #16 is fairly normal. Parameter set #16 has a sample size of 200. Parameter set #15, also with a sample size of 200, has a small negative kurtosis value and a small negative skewness value. Again one might infer that the dis- tribution is fairly normal. Perhaps, for parameter sets that meet the assumptions of the R index and have sample sizes of 200, the R values are distributed more normally. If the other five parameter sets with N = 200 (17, 18, 19, 20 and 21) are also considered, the skewness values are greater than the values for parameter sets #15 and #16. However, the skewness value for parameter set #21 (Sk = -.1492) is not very different than the value for #15 (Sk = -.l462). Also the kurtosis value is fairly small (K = -.1075). The assumptions for the R index are almost met in #21 except n3 does not equal zero. The kurtosis values for these five sets, are all small with three positive values and two slightly negative. It is interesting to note that the highest kurtosis value (absolute value) of the seven sets with N = 200, is set #16. A com- parison of #15 and #16 with the remaining five parameter sets with N = 200 seems to show that if the assumptions of R are met (or nearly met) the distribution is more nearly normal. 72 If the absolute deviations from the kurtosis value of zero are averaged for the parameter sets with N = 50 and for the parameter sets with N = 200 and these two values (.46 and .17, respectively) are compared, then further evidence is obtained that the distributions of R values for larger sample sizes are more nearly normal. A similar consideration of the absolute deviations from the skewness value of zero reveals that the average for parameter sets with N = 50 (.47) is larger than the average for parameter sets with N = 200 (.22). It seems then that for any given parameter set as the sample size increases the distribution of R values approaches a normal distribution. Now consider the ranges of the R values for the 21 parameter sets. For parameter sets with sample sizes of 50, N = 14, the average range is .72. For parameter sets with sample sizes of 200, N =.7, the average range is .36. The ranges, then, were decreased on the average by one-half when the sample sizes were increased. For those parameter sets that do not meet the assumptions and with sample sizes of 50, N = 9, the average range is .74. For parameter sets with sample sizes of 200, N = 5, the average range is .39. For those parameter sets that do meet the assumptions, the average range is .67 for sample sizes of 50, N = 5, and .26 for sample sizes of 200, N = 2. There is some reduction in the ranges when the assumptions are met; however, just the increase in sample size without meeting the assump- tions has a marked effect on the stability of R. Final evidence which might be considered in answering the question of adequacy is the correlation of the true R values with 73 Table 5.5 Average Ranges for the R Estimates A11 Sample Size = 50 Sample Size = 200 Meet .55 .67 .26 Assumptions (N=7) (N=5) (N=2) Does Not Meet .62 .74 .39 Assumptions (N=14) (N=9) (N=5) All .60 .72 .36 (N=21) (N=14) . (N=7) the mean of the estimated R values for each parameter set. This cor- relation is .759 (N = 21, p <.OOl). Even though this correlation is significant, it must be remembered that for any given parameter set there were many R values which greatly differed from the true R. Consideration of ranges, variances, skewness and kurtosis values reveals that the R technique more adequately estimates the true R when the sample size is larger, i.e. N = 200. In addition the R technique more adequately estimates the true R when the assumptions of R are met. The differences in the adequacy are far more dramatic, however, when the sample size is increased than when the assumptions are met. The C-V Technique Versus the R Technique When the distributions of the estimates of the C-V and R values are compared it appears that, in general, the C-V estimates are distributed more normally than the R estimates. The R values tend to be higher than the C-V values and there seems to be a ceiling 74 effect, i.e. the R distributions are generally skewed negatively and in almost every case approach the upper bound (1.0). The C-V distributions while skewed negatively in 14 cases seem to span a middle range of values. Consider the summary statistics for the C-V and R distribu- tions provided in Table 5.6 and the correlation matrix in Table 5.7. Table 5.6 Summary Statistics Comparing R to C-V Average Absolute Average 11.11.1101. From 52219212; iii-135?: Eié‘ifieii Range of True Value Values .0013 to -.2957 to -.2693 to C‘V “089' .0094 .1005 .0938 '45 .0015 to -.4356 to -.8651 to R '0965 .0221 1.7626 .1022 '60 Table 5.7 Corre1ations True C-V Mean R Mean C-V True R .820 .759 .608 True C-V .855 .800 Mean R .891 One can infer from these statistics that the C-V technique estimates the true C-V values better than the R technique estimates the true R values. The variances of the distributions of the C-V 75 values are smaller. The largest variance for the C-V values is .0094 while the largest variance for the R values is .0221. The range of the kurtosis values and the range of the skewness values for the C-V index are considerably smaller than the ranges for the R index. The average range of values for the parameter sets is smaller for the C-V index than for the R index. Finally, if the correlations of the mean index with the true value are considered, the C-V technique provides a closer estimate of the true C-V value (r = .80 for C-V compared to r = .76 for R). It is interesting to note that the means of the estimates of the Rindex are more closely related to the true C-V value (.86) than they are to the true R values (.76) or than the means of the estimates of the C-V index are related to the true C-V values (.80). In interpretation of the correlations, it must be remembered that the mgppg of the estimated values for a given parameter set (over 1000 values) are correlated with the true values. Means are more stable than the actual estimates. The other statistics, range, kurtosis, variance, and skewness, must be considered in the evalua- tion of the adequacy of the techniques. When all statistics are con- sidered, the C-V technique seems to provide a more stable estimate of the true value than the R technique and the distributions of the C-V values seem to be more normally-shaped than the R values for any given parameter set. 76 Consideration of C-V and R Techniques By Parameter Set The conclusion from the analyses cited in the previous sec- tions is that the C-V technique provides a more stable estimate of the true value than does the R technique. Now consider the parameter sets individually. Perhaps one technique is a better estimator than the other technique under certain conditions. If so, what are these certain conditions? Consider, first, the parameter sets where the assumptions for the index are met. Table 5.8 gives the summary statistics for R and C-V. It is apparent from these data that, over 1000 samples, the mean estimate for either index, is better when the assumptions are met than when they are not. (See column one of Table 5.8.) One perplexing fact is that the variances for those parameter sets where the R assumptions are met, span a larger range than do those parameter sets where the R assumptions are not met. However, if the size of the samples is also considered and only those parameter sets with N = 200 are compared, then the variances are less when the R assump- tions are met. Interestingly, the same unexpected result occurs if the variances of the C-V estimates are considered. Here, for sample sizes of 50, the range of the variances is slightly greater when the assumptions are met than when they are not met. This is not true, however, for sample sizes of 200. Caution must be used in interpret- ing these results, since the number of parameter sets used is quite small (see column six of Table 5.8). 77 NN oo. NNoN. ou Nmoo.1 oNoN.N op omm¢.1 NNNo. cu ONoo. Oomo. NN< N om. NONo. ou Noam.1 OONN. op ommc.1 Nvoo. ou ONoo. Nmoo. ooN u z e— NN. NNoN. o» Nmoo.1 ONON._ op NNom.1 NNNo. ou Nmoo. NNoo. om u 2 N13 c— No. Now—.1 on «NON.1 NONN.N op mNoN.1 capo. ou ONoo. oNON. NN< m on. Noqp.1 cu Noon.1 OONN. op mNo_.1 Ncoo. ow ONoo. oNON. ooN u z o cN. momm.1 op «NON.1 NoNN.N op oooo. capo. cu Nmoo. NoON. om u 2 pm: uoz Oco_uoE:OO< N OO. NNoN. on Nmoo.1 oNoN._ o» ommc.1 NNNo. o» mNoo. mooo. NN< N oN. NONo. o» NoON.1 ooNo.1 0» omm¢.1 vNoo. op ONoo. oooo. ooN u z O No. NNoN. 0» Nmoo.1 oNoN._ 0» NNom.1 NNNo. op Omoo. omoo. om u 2 pm: OcowuoEOOO< m NN ow. omoo. ou moON.1 moop. op NmmN.1 «moo. ou NNoo. oooo. __< N oN. omoo. ou mmN_.1 ONNo. op NmoN.1 ONoo. o» mNoo. ONoo. ooN u z O— OO. omoo. oa moON.1 moop. ou OOON.1 Omoo. ou oqoo. NNoN. om u z a Np NO. mmmo. o» mooN.1 mooN. ow NOoN.1 oooo. ou o_oo. oooN. NN< m mN. omoo. ou NNN_.1 ONNo. 0» NOON.1 mNoo. ow opoo. NNoo. ooN u 2 NF cm. Ommo. op NOON.1 mooN. o» OOON.1 emoo. op oOoo. ooNN. om u z um: uoz Ocowuo53OO< e NO. omoo. ou OON_.1 NNmo. op NNNN.1 oooo. on ONoo. NNoo. NN< N oN. oooo.1 op OONN.1 _Noo. op NoNN.1 NNoo. ou ONoo. NNoo. ooN u z N mm. omoo. 0» Nmoo.1 NNmo. op mNoN.1 oooo. oo mOoo. ONoo. om u 2 pm: OcowuaEOOO< >1o Oumm OONO> OONN Omapm> No OOmczmxm ONOopLOx Omocmpcm> ewumsmgaa . EoNN OONNON>mo mo gossaz mocma momem>< No moccm mo mmcmm No mocmm waspomno moONm>< OcowuaszOO< cam ONNm m—anm No :oNNONmuOOcoo :u_3 >1o com m Now OuNuONqum accessm w.m m—omh 78 The range of kurtosis values for R is greater when meeting the assumptions than when not. The opposite is true for the range of kurtosis values for C-V. Similarly, the range of skewness values for R is greater when the assumptions are met than when they are not and the range of the skewness values for C-V is smaller when the assump- tions are met than when they are not. Finally, the average range of the respective values is smaller for both indices when the assumptions are met. Sample size has a marked effect on the results of the simula- tion for any parameter set. Noted above was the effect that sample size had on the range of the variances. Also the mean of the esti- mated values is closer to the true value for both indices when the sample size is 200. However, the increase in sample size for both indices has a greater effect on the mean of the estimated values when the assumptions are met than when the assumptions are not met. Of course, the increase in sample size also decreases the range of esti- mated values for both indices. For R, this average range is reduced by 61 percent for the parameter sets meeting the assumptions, but only by 47 percent for those not meeting the assumptions. For C-V, the reduction in the average range is 47 percent and 46 percent respectively. The increase in sample size narrows the range of estimated values considerably. The ranges of skewness and kurtosis values are much narrower for the parameter sets where N = 200 than for the parameter sets where N = 50 for the R index. For C-V, the ranges are closer, although generally smaller for N = 200. There is one exception; the 79 range of the kurtosis values when the C-V assumptions are met is greater for N = 200 than for N = 50. However, there were only two parameter sets included for these categories, so the statistics must be interpreted cautiously. Two factors have been considered above; one, whether or not the assumptions of a particular index were met and two, sample size, i.e. what happened to the distributions of estimated values when the sample size was increased from 50 to 200. The analysis of the data with respect to these two factors seems to indicate that the C-V method provides a better estimate when the assumptions are met; although, the technique is still good under other assumptions. The R method seems to be unstable. The descriptive statistics indicate that the R method does not provide good estimates even under the best of circumstances. An increase in sample size helps the R method. The C-V technique, although a better technique with a larger sample size, remains stable with smaller sample sizes. Now consider the individual parameter sets. Consider only the mean of the estimates, the variances, and the ranges for each index for each parameter set. Table 5.9 indicates for each parameter set whether the absolute deviation from the true value is smaller for R or C-V, the variance is smaller for R or C-V, and the range is smaller for R or C-V. For each column the letter R or C-V indicates that the statistic is smaller for that technique. In 29 percent of the parameter sets the R technique estimates the true value better than the C-V technique estimates the true value. In less than 10 percent of the cases the variance for R is 80 . Table 5.9 1 Compar1son of R and C-V by Parameter Set P3424219" Assistant?“ l. C-V C-V C-V 2. R C-V C-V 3. C-V C-V C-V 4. R C-V C-V 5. C-V C-V C-V 6. C-V C-V C-V 7. R C-V C-V 8. R R R 9. R R R 10. C-V C-V C-V ll. C-V C-V C-V 12. C-V C-V C-V 13. C-V C-V C-V l4. C-V C-V C-V 15. R C-V C-V 16. C-V C-V R 17. C-V C-V C-V 18. C-V C-V C-V l9. C-V C-V C-V 20. C-V C-V C-V 21. C-V C-V C-V 1For each column the letter R or C-V indicates that the statistic is smaller for that technique. smaller, and in 14 percent of the cases the range is smaller. Consider the two parameter sets where the R technique appears to be the better technique (#8 and #9). In these parameter sets, it 81 was assumed that 80 percent of the individuals would not know the answer at pretest but would know it at posttest (112 = .80). (See Table 4.5 in Chapter IV.) In parameter set #8, it was also assumed that instruction would improve the chance of guessing (q2] = .25, q2]' = .50), and that for the pretest there would be some chance that an individual knowing the answer would fail the item (q12 = .10). Parameter set #9 assumed only that there was the same chance of guessing for both pretest and posttest (q2] = q21' = .25). This parameter set meets the R assumptions. It is interesting to note that for the parameter sets where an R occurs in any column the assumptions for the R index are met in six of these seven cases. Other than the two factors, sample size and meeting the correct R assumptions, there seems to be no pattern for the estimates being better for one parameter set than for another. It does appear, however, that the more assumptions of the R technique that are not met, the less accurate the estimates. The C-V technique seems to provide reasonable estimates regardless of sample size or meeting assumptions. Summary Three questions were considered in the designing of the simulation. These were: 1. Do the C-V and R techniques adequately estimate the true values? 2. Does one technique estimate the true values better than the other? 82 3. Do the C-V and R techniques estimate some true values better than others? The answers to these questions were discussed in this chapter. First, the adequacy of a technique (R or C-V) was determined by con- sideration of a number of descriptive statistics. It was found that for R, when the assumptions are met, the technique provides a more stable and accurate estimate. It was also found that when the sample size is increased from 50 to 200, the stability and accuracy increases greatly. A correlation coefficient of .759 between the true R value and the mean R value for 1000 estimates implies that the procedure provides a reasonable estimate of the true R. The C-V technique seems to provide a reasonably accurate and stable estimate regardless of whether the assumptions are met. The estimates, however, are more stable with larger sample sizes, e.g. average range is .54 for N = 50 and .29 for N = 200. A correlation coefficient of .80 between the true C-V and the mean C-V value for 1000 estimates implies that the procedure provides a reasonable esti- mate of the true C-V. Second, the C-V technique seems to estimate the C-V true value better than the R technique estimates the R true value. The average absolute deviation from the respective true values is smaller for C-V than for R (.0891 and .0965, respectively). In addition the range of variances is considerably smaller for the C-V estimates than for the R estimates (.0013 to .0094 for C-V and .0015 to .0221 for R) and the average range of estimated values is smaller (.46 for C-V and .60 for R). 83 The third question was primarily answered in considering the question of adequacy and stability. For both techniques the estimates are better when the sample size is larger (N = 50 versus N = 200). In addition, the R approach is better when the assumptions are met. This is not true for the C-V approach. The C-V approach seems to provide a good estimate under almost any assumptions. The next chapter describes the results of the comparison of the R and C-V approaches using actual data on 128 items. In addition a third approach, the B-5 method, is also used on 64 of the 128 items and the results compared to the R and C-V values. CHAPTER VI RESULTS OF THE COMPARISON OF THE THREE INDICES WITH ACTUAL DATA The purpose of this part of the study was to determine the comparability of the three item analysis procedures, C-V, R and B-5. For this comparison, data were obtained from the Michigan Middle Cities Project. Sixteen objectives were chosen from two subject areas, Mathematics and Reading. In addition two levels, Middle and Upper, were considered in the selection of the objectives. These levels refer to grades three and four, and five and six, respectively. Each objective was tested by four items on a pretest, four different items on a posttest and all eight items on a retention test. The retention test was given approximately 40 days after the posttest. There were also two treatment groups considered in the selection of the objectives. In treatment A, teachers were assigned objectives. In treatment 8, teachers selected the objectives. Diagram 6.1 shows the complete design. The major question considered was "Do the C-V, R and B-S item analysis procedures provide comparable results?" Three other questions were also considered: 1. Are the three procedures more comparable for items in Mathematics than for items in Reading?; 84 85 Objective Subject Level ‘ Treatment Number N Items 142 31 1-8 A 116 59 9-16 Middle 112 21 17-24 B 120 20 25-32 Reading 145 66 33-40 A 199 57 41-48 Upper 182 30 49-56 B 166 18 57-64 108 52 65-72 A 111 43 73-80 Middle 107 42 81-88 B 109 37 89-96 Mathematics 198 22 97-104 A 176 46 ' 105-112 Upper 187 16 113-120 B . 167 17 121-128 Diagram 6.1 Design,of Administration of Items 86 2. Does the comparability of the three procedures depend on the grade level?; and, 3. Are the three procedures equally comparable for items given in treatment A as for items given in treatment B? These last three questions are part of the major question and will be treated as such in the discussion of the results. Comparability C-V and R Consider the testing procedure for each objective. Four items were given on the pretest, four different items were used on the posttest, and all eight items were included on the retention test. For computation of a C-V index or an R index it is necessary to have data on a given item at two times, preferably a pretest and a post- test. In this situation, it was necessary to compute the C-V and R indices from pretest-retention test data and from posttest-retention test data. The indices can be computed from pretest-posttest data on parallel items, but the usefulness and meaningfulness of these data for item selection and revision is questionable. There are 64 items using pretest-retention test data for which C-V and R can be computed and 64 different items using posttest-retention test data for which C-V and R can also be computed. These two sets of data were considered separately in the analyses. The results of the correlations of C-V and R are presented in Table 6.1. 87 Table 6.1 Corre1ations of C-V and R r(C-V, R) r(C-V, R) N (Pretest- N# (Posttest- Retention) Retention) All 64 .80** 55 .76** Math 32 .88** 27 .81** Reading 32 .87** 28 .67** Upper 32 .81** 28 .62** Middle 32 .80** 27 .82** Treatment A 32 .79** 3O .69** Treatment 8 ' 32 .82** 25 .80** **Significant at p<:.Ol #Some values of R did not exist because there were no individ- uals in the combined categories of f1 (fail-fail) and f2 (fail-pass). The computation of R involves f] + f2 in the denominator and if this is zero, R does not exist. The correlations between C-V and R for the indices computed on pretest and retention test data range from .79 to .88. All these . correlations are significantly different from zero (p‘<.01). Using Fisher's Z-transformation, pairwise comparisons of the correlations between Mathematics and Reading, Upper and Middle, and Treatment A and Treatment 8 showed no significant differences. The correlations between C-V and R for the indices computed on posttest and retention test data range from .62 to .82. Again all these correlations are significantly different from zero (p<<.01). Using Fisher's Z-transformation, pairwise comparisons of the correla- tions between Mathematics and Reading, Upper and Middle, and Treatment A and Treatment 8 were made. There were no significant differences. 88 For both sets of data (items given on the pretest and reten- tion test and items given on the posttest and retention test), the analyses indicate that: 1. The C-V and R values are significantly related and the procedures would result in similar item selection; 2. The C-V and R values are not more related for Mathematics than for Reading; 3. The relationship between the C—V and R procedures does not depend on grade level; and, 4. There is no difference in the relationship between the C-V and R procedures when treatments are considered. B-S and C-V The B-S procedure requires that an item be given on a pre- test and on a posttest. In this situation, it was necessary to apply the B-S rules to pretest-retention test data only. There are 64 items for which a decision about item revision, using the B-S pro- cedure, can be made. Using the rules on posttest-retention test data is not meaningful. There is also one additional restriction. To apply some of the decision rules, it is necessary to select cut-off values. The analyses of the items using the B-S approach were based on an arbitrary cut-off value of .50 for the error rates. If the error rate was below .50 the error rate was considered low; if above .50, the error rate was considered high. The original intent was to select multiple cut-off values for the error rates. But the data indicated 89 that choosing different cut-off values would not change the decision about the revision of the items. Only 18 items met the criterion of no significant positive difference between the theoretical error rates (TER) and the pretest error rates (BER). These items all had values of BER greater than .50. TER, since the items were three-option multiple choice items, is always .67. It would be meaningless to lower the cut-off for the error rates since the same 18 items would be chosen. To raise the cut-off would exclude more items but since the stronger criterion of no significant positive difference between TER and BER is met for these 18 items the increase in the cut-off does not seem particularly reasonable. First the individual statistics, TER, BER, PER, 801 and PDI were computed for the 64 items. Then the appropriate rules were applied and a decision was made about the quality of the item; that is, does the item need to be revised? Each item was assigned a "O" or a "1" depending on the outcome of the application of the rules. A "0" indicates revision is required, and a "1" indicates no revision is required. See Appendix IV for the statistics on the 64 items and the resulting application of the rules. Application of the rules resulted in 18 items needing no revision. Point-biserial correlations were computed between the result- ing values from the B-S procedure and the C-V values. The correlations are presented in Table 6.2. The correlations between C-V and B-S values range from .45 to .84. All these correlations are significantly different from zero 9O Table 6.2 Correlations Between B-S and C-V N rp-bis All 64 .70** Mathematics 32 .69** Reading 32 .50** Upper 32 .70** Middle 32 .68** Treatment A 32 .45** Treatment B 32 .84** *1: Significant at the .01 level. (p <.Ol). Pairwise comparisons reveal the largest difference is between the point-biserials for treatment A and treatment B. These analyses indicate that application of the B-S or C-V procedure results in selection of many of the same items. In addition, the B-S and C-V procedures are slightly more comparable for Mathematics than for Reading; the relationship between the procedures does not depend on grade level; and the B-S and C-V procedures are considerably more comparable for treatment 8 than for treatment A. B-S and R The same restrictions apply to the comparisons of the B-S and R indices as to the comparisons of the B-S and C-V indices. Point- biserial correlations were computed between the B-5 values and the R values. The correlations are presented in Table 6.3. 91 Table 6.3 Corre1ations Between B-S and R N rp-bis All 64 .36** Mathematics 32 .39* Reading 32 .24 Upper 32 .37* Middle 32 .37* Treatment A 32 . .21 Treatment 8 32 .52** *Significant at the .05 level. * T Significant at the .01 level. The correlations between R and B—S values range from .21 to .52, considerably smaller than the correlations between the C-V and B-S values. Only correlations between all the R and B-S values and the R and B-S values for treatment B are significant at the .01 level. The correlations for Mathematics, Upper and Middle are significant at the .05 level. The correlations are not significantly different from zero for Reading and treatment A. Pairwise comparisons show that the largest difference is between the correlations for treatment A and treatment B. These analyses indicate that the relationship between the results of the B-S and R procedures is not very strong, but many of the same items would be selected with either procedure. In addition the B-S and R procedures are considerably more comparable for Mathe- matics than for Reading and for treatment 8 than for treatment A. 92 The relationship does not appear to depend on grade level. B-S and C-V and R The correlations between the indices, R, C-V and B-S for the 64 pretest-retention test items are all significantly differ- ent than zero (p <.01). The relationship between the R and B-S values is markedly different than the other two relationships. Table 6.4 summarizes the three correlations. Table 6.4 Corre1ations for All Items (N=64) R B-S C-V .80** .70** B-S .36** **Significant at the .01 level. These significant correlations indicate that the three indices are related. In particular the R and C-V indices are the most similar. The R index, however, does not appear to give results as similar to the B-S procedure as does the C-V index. Consider the correlations between the indices for each subject area, grade level and treatment for the 64 items (pretest-retention test). These correlations are reported in Tables 6.5 A and B, 6.6 A and B and 6.7 A and B. 93 Table 6.5 A Correlations--Mathematics R B-S C-V B-S .88** .39* .69** Table 6.5 B Correlations--Reading R C-V .87** .24 .50** Table 6.6 A Corre1ations--Midd1e R B-S C-V B-S .80** .37* .68** Table 6.6 B Corre1ations--Upper R B-S .81** .37* .70** 94 Table 6.7 A Corre1ations--Treatment A R B-S C—V .79** B-S .21 .45** Table 6.7 B Corre1ations--Treatment B B-S C-V .82** B-S .52** .84** *Significant at the .05 level. *TSigniiicant at the .01 level. Based on these correlations it appears that the three pro- cedures are more comparable for items in Mathematics than for items in Reading, and for items given in treatment 8 than for items given in treatment A. Although all the procedures are significantly related for Mathematics the relationship between the B-S procedure and the R index is markedly different than the R--C-V and C-V--B-S relationships. This same difference in the size of the relationships appears in all of the other comparisons, i.e. Reading, Middle, Upper, treatment A, and treatment B. The difference is less for treatment B correlations than for the other comparisons. 95 An alternate method of analyzing the comparability of the three approaches would be to consider the agreement among the three methods. If a cut-off value for the C-V and R index is chosen as .50, i.e. those items with an R or C-V value equal or above .50 are considered to be good items, then 32 items out of 64 items would be selected based on the R values and 11 items would be selected based on the C-V values. Of the 32 items selected based on the R values, 13 were also selected using the B-S procedure. All 11 items selected based on the C-V values were selected using the B-5 procedure. Similarly all 11 items selected based on the C-V values were selected using the R procedure (see Table 6.8). Table 6.9 represents the agreement among the three indices. There is complete agreement for 39 of the 64 items or 61 percent. Of the items where there is 100 percent agreement, 21 of the 39 were Reading items (54 percent); 20 of the 39 were given in the Middle grades (51 percent) and 15 of the 39 items were used in treatment A (38 percent). The disagreement among procedures is more noticeable between treatments. Of the 64 items there is agreement between the C-V and B-3 procedures for 57 items or 89 percent. There is con- siderably less agreement between the C—V and R and R and B-S, the percentage agreement being 69 percent and 64 percent respectively. Summary The purpose of this part of the study was to determine the comparability of three item analysis procedures; C-V, R and B-S. Sixteen objectives were chosen from two subject areas, Mathematics 96 Table 6.8 B-S, R and C-V Values for Items Given on the Pretest and Retention Test Item Identification* B-S R C-V R116Gl RAM 0 .1 .03 R116G2 RAM 0 .48 .25 R116G3 RAM 0 .11 .02 R11604 RAM 0 O 0 R12051 RBM O O O R12052 RBM O O O R12053 RBM O 1.0 -.25 R12054 RBM O 1.0 -.15 R142Gl RAM 0 1.0 .06 R142G2 RAM 1 1.0 .52 R142G3 RAM 0 .86 .19 R142G4 RAM 1 .88 .48 R11251 RBM O .71 .24 R11252 RBM O .75 .14 R11253 RBM 1 .09 .05 R11254 RBM O .33 .05 M10951 MBM O .33 .03 M10952 MBM O .27 .11 M10953 MBM O .5 .19 M10954 MBM 0 .21 .08 M108G1 MAM 0 .71 .38 M108G2 MAM l .61 .38 M108G3 MAM l .90 .73 M108G4 MAM l .83 .58 M10751 MBM O .06 .024 M10752 MBM O O O M10753 MBM O .18 .071 M10754 MBM 0 .06 .024 MlllGl MAM O .75 .28 M11162 MAM O .65 .26 M111G3 MAM O .47 .19 M111G4 MAM 0 .53 .21 M18751 MBU l .90 .56 M18752 MBU 1 .91 .63 M18753 MBU l 1.0 .63 M187S4 MBU l 1.0 .75 R18251 RBU O .67 .2 R18252 RBU O 0 O R18253 RBU O .33 .07 R18254 RBU 0 -.33 -.O3 M16751 MBU 1 .75 .53 M16752 MBU 1 .73 .65 M16753 MBU 1 .8 .71 M16754 MBU 1 .71 .59 97 Table 6.8--Continued Item Identification* B-S R C-V R16651 RBU O 1.0 .33 R16652 RBU 0 .2 .06 R16653 RBU O .67 .ll R166S4 RBU O .5 .17 M19861 MAU O .5 .14 M19862 MAU O .67 .18 M19863 MAU 0 .7 .32 M19864 MAU O .5 .14 M17661 MAU 1 .23 .15 M17662 MAU l .26 .26 M17663 MAU 1 .04 .04 M17664 MAU l .11 .11 R14561 RAU O .27 .09 R14562 RAU 0 .5 .20 R14563 RAU O .3 .15 R14564 RAU 0 .3 .11 R199Gl RAU 0 .4 -.l4 R19962 RAU O -.5 -.1l R19963 RAU O -1.78 -.28 R199G4 RAU O -.125 -.05 *The last three letters of the Item Identification refer to subject area (M = Mathematics, R = Reading); treatment (A or B); and grade level (M = Middle, U = Upper). Table 6.9 Agreement of the Three Item Indices 100% Agreement Items Items Unacceptable Acceptable Total All 28 (44%) 11 (17%) 39 (61%) Mathematics 8 (20%) 10 (26%) 18 (46%) Reading 20 (51%) 1 ( 3%) 21 (54%) Midd1e 17 (43%) 3 ( 8%) 20 (51%) Upper 11 (28%) 8 (21%) 19 (49%) Treatment A 12 (30%) 3 ( 8%) 15 (38%) Treatment B 16 (41%) 8 (21%) 24 (62%) 98 Table 6.9 A Agreement of the Three Item Indices 67% Agreement Items Items Unacceptable Acceptable Total All 23 (36%) 2 ( 3%) 25 (39%) Mathematics 13 (52%) 1 ( 4%) 14 (56%) Reading 10 (40%) 1 ( 4%) 11 (44%) Middle 10 (40%) 2 ( 8%) 12 (48%) Upper 13 (52%) O ( 0%) 13 (52%) Treatment A 15 (60%) 2 ( 8%) 17 (68%) O Treatment 8 8 (32%) ( 0%) 8 (32%) 99 NOONV N NNOOV N NNOV O NOOO O NOOO O NOOV O NOONV N NNOOV N NOOO O O NOOOOOOON NOOOV O NONOV NN NOOONVN NOOV O NOOV O NOOONVN NOOOV O NONOV NN NOOO O O OOOENOOON NOOOV O NNOOV O NOON O NOOV O NOOV O NOOV O NOOOV O NOOOV O NNOV O NOOOO NOONV N NNOOV O NNOONON NOOV O NNOV O NOOONVN NOONO N NOOOV O NOOV O ONOONz NOONV N NOOOV O NOOOV N NOOO O NOOV O NOOOV N NOONV N NOOOV O NOOV O OONOOOO NNOOV O NOOOV O NNOOV N NOOV O NOOV O NOOOV N NNOOV O NOOOV O NOOV O OONNOEOONOE NOOV O NOONV ON NOOV N NOOV O NOOV O NOOO N NOOV O NOONV ON NOOV O NNO O-O .O O-O .>-O O-O .O >-O .O O-O .>-O O-O .O O-O .N O-O .>-O O-O .O NONON wNompamou< OEONN mNnONomoumco OEONH pcwsmmgo< NNo OmuNucN EmpN mmech on» mo pcmamoeo< m m.o mpnmh 100 and Reading, two grade levels, Middle and Upper, and two treatments, assigned objectives (treatment A) and selected objectives (treatment B). A total of 64 items were analyzed using each of the three item analysis procedures. An additional 64 items were analyzed using only the C-V and R procedures. The major question to be answered was do the C-V, R and B-5 item analysis procedures provide comparable results? Three additional questions also were considered: 1. Are the three procedures more comparable for items in Mathematics than for items in Reading?; 2. Does the comparability of the three procedures depend on the grade level?; and, 3. Are the three procedures more comparable for items given in treatment A than for items given in treatment B? Correlation coefficients were computed between the indices for the 64 items given on a pretest and a retention test. The Pearson product moment correlation coefficient between the R and C-V indices was significantly different than zero (r = .80, pi<.Ol). The point-biserial correlation coefficients between the B-5 procedure and the C-V index and the B-5 procedure and the R index were also significantly different than zero (r = .70, p<:.Ol and r = .36, p<<.Ol, respectively). These correlations indicate that the three indices are related and provide reasonably comparable results. The separate analyses of the indices for each subject area, grade level and treatment indicated that the indices were more com- parable for Mathematics than for Reading, with all significant 101 correlations between the indices for Mathematics [r(R,C-V) = .88, p <.01; r(C-V, B-S) = .69, p<<.01; r(R,B-S) = .39, p‘<.051 and only two out of the three correlations significant for Reading [r(R,C-V) = .87, p‘<.01; r(C-V, B-S) = .50, pi<.01; r(R,B-S) = .24, not signif- icant]. The indices were also more comparable for treatment 8 than for treatment A, with all significant correlations for treatment B [r(R,C-V) = .82, pi<.01; r(C-V, B-S) = .84, p<=.01; r(R,B-S) = .52, p <.01] and only two out of the three correlations significant for treatment A [r(R,C-V) = .79, p , < , or = between two numerals in the range l-SO. Given a number sentence with the operation sign (+ or -) missing, the learner will complete them by writing in the correct sign. Given a sequence of numerals, involving skip counting, by two's up to 30, the learner will write the missing numeral. Given a sequence of numerals, involving skip counting, by five's and ten's up to 50, the learner will write the missing numeral. Given an oral numeral, not to exceed 50, the learner will be able to write it. Given an oral word problem requiring addition, with sums less than 18, the learner will find the sum. Given a marked clock face, the learner will state time to nearest indicated a hour. Given pictures of a circle, square or triangle, the learner will identify the shaded portion that corresponds to a, or k. 57. 58. 59. 61. 62a. 63a. 64a. 65. 66. 67. 68. 69a. 70a. 71a. 72a. 73. 161 Given three, one-digit numerals, with the sum less than 21, the * learner will find the sum. Given number phrases less than 20, the learner will supply the appropriate symbol of equality or inequality, > , = , or < . Given two two digit numerals, requiring no regrouping, the learner will find the sum. Given a problem or the form (two digit - one digit with no regroup- ing). the learner will find the difference). Given hours, minutes and days, the learner will indicate the correct relationship between them. Given line segments and a ruler, the learner will measure the line segments to the nearest half centimeter. Given a 20 centimeter ruler, the learner will construct a line segment of specified length, designated to the nearest half centimeter. Given an addition problem, of the form 21 + 34 = 34 + __3 the learner will give the missing addend. Given any one, two or three digit numeral, the learner will write it in expanded notation. Given two two digit numerals, requiring regrouping, the learner will find the sum. Given a problem of the form, two digit minus one digit, the learner will find the difference, regrouping if necessary. Given pictures of money or play money, less than or equal to $1.00, the learner will compare the values between coins. Given pictures of money or play money, less than or equal to $20.00, the learner will write the given money value using the symbols of dollar sign and decimal. Given a clock face with hands, the learner will write time in time notation to half hour and quarter hour. Given cup, pint, quart and liter containers, the learner will determine experimentally, the number of cups in a pint, pints in a quart, approximate quart in a liter. Given oral word problems involving addition and subtraction with numbers less than 18, the learner will solve them. 162 73.5 Given subtraction problems in both horizontal and vertical forms, 74. 75. 76. 77. 79a. 80. 81. 82a. 83a. 84. 85. 86. 87. 88. 89. with minuends not to exceed 18, the learner without regrouping, will find the missing subtrahend. Given a problem of the form (three digit minus one, two or three digit), the learner will find the difference when no regrouping is required. Given a problem of the form (two digit minus two digit), the learner will find the difference, regrouping if necessary. Given a story problem read orally by the teacher, the learner will tell which he must use to solve the problem (addition or subtraction). Given an oral word problem requiring subtraction, requiring regrouping the learner will state and do what operation is necessary to find the difference. Given several objects divided into (N's, l/3's, g's or whole) by comparing to the whole unit. Given number sequences in which some of the numbers are omitted, the learner will complete the number sequences up to 200. Given two, three digit numerals, the learner will apply the appropriate symbol between them ( > , < , =). Given play money, the learner will make change from $1.00 for any amount up to $.99. Given drawings of lines, the learner will point out which ones are (relatively) horizontal and which are (relatively) vertical. Given two three digit numbers, the learner will find the sum, regrouping, if necessary. Given a three digit minuend and a two or three digit subtrahend, the learner will find the difference, regrouping if necessary. Given a number and the consecutive multiples of ten or 100 between which it falls, the learner will choose the nearer estimate. Given column addition exercises involving three two digit addends, the learner will find the sum, regrouping if necessary. Given a pair of numbers or number phrases less than 1,000, the learner will supply the appropriate symbol > , <, or =. Given two addends less than 10,000, the learner will find the sum, regrouping if necessary. 90a. 91a. 92a. 93a. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 163 Given a clock face with hands, the learner will read the time to the nearest minute. Given a length expressed in centimeters, the learner will express it as a number of centimeters plus a number of millimeters. Middle Level Given an object or line segment, the learner will, without the use of a ruler, choose the correct estimate from a set of answers of this form. 2 millimeters, 2 centimeters 2 meters, 2 decimenters Given the terms, centimeter, meter and decimeter, the learner will state the relationship between them. Given a repeated addition sentence, the learner will represent it as a multiplication sentence with its product. Given multiplication problems using one as a factor, the learner will find the product. Given any multiplication combinations, less than 5 x 5, the learner will write the product. Given multiplication problems using zero as a factor, the learner will find the products. Given sets of not more than 20 elements, the learner will divide them into equivalent sub-sets. Given a mathematical sentence of the form (3 x 4 = 4 x __), the learner will identify the missing factor. Given basic multiplication problems, the learner will find the products, using the distributive property of multiplication over addition. Given any multiplication combination up to 9 x 9, the learner will write the product. Given multiplication problems in which the factors are whole numbers less than ten, and one factor is missing, the learner will record the missing factor. Given the basic division facts through the nines, the learner will find the quotients. 104. 105. 106. 107. 108. 109a. 110a. 111. 112a. 113a. 114. 115. 116a. 117a. 118. 164 Given a multiplication number sentence with two missing factors, the learner will supply any two basic factors to make the given multiplication number sentence true (ex: ___x ___= 16). Given any number as the dividend, and zero as the divisor, the learner will indicate that there is no solution to the problem. Given a word problem requiring multiplication, the learner will write the correct equation to go with the problem. Given a set of multiplication equations in which one factor is a multiple of 10,000 or 1,000, the learner will write related division equations. Given a multiplication problem of the form (one digit number x two digit number) the learner will find the product. Given a shaded region located on a piece of graph paper or some other grid, the learner will find the area by counting the num- ber of square units. Given pictures or models of geometric figures; cube, cylinder, sphere, the learner will identify them. Given two factors which are multiples of ten, the learner will find the product. Given a sentence involving the terms "in the morning, in the afternoon, in the evening," the learner will supply the appro- priate AM or PM notation. Given two times to the nearest half hour, the learner will find the length of the interval between them. Given two or three whole number addends less than 100,000 in horizontal or vertical form, the learner will find the sum, regrouping if necessary. Given subtraction problems with up to four digit minuends and subtrahends, the learner will find the differences, regrouping if necessary. Given Arabic numerals 1 through 39, the learner will convert them to Roman numerals. Given Roman numerals I through XXXIX, the learner will rewrite them to Arabic numerals. Given a numeral with up to four digits, the learner will rewrite the given numerals, using expanded notation. 119. 120. 121. 122. 123. 124. 125a. 126a. 127a. 129. 130. 131. 132. 133. 134. 135. 165 Given a completed division problem, the learner will identify the divisor, dividend, quotient and remainder. Given a series of four numbers, the learner will compute the average. Given a multiplication problem with two two digit factors, the learner will find the product. Given a division problem, with a two digit divident and a one digit divisor, the learner will determine the quotient, with or without a remainder. Given multiplication problems involving a multiple of ten times a multiple of 100, the learner will find the products. Given a multiplication problem with a two digit factor and a three digit factor, the learner will find the product. Given the length (whole numbers less than 10), of the sides of a rectangular region, the learner will find the area. Given a line segment to measure and a 20 cm ruler with milli- meter markings, the learner will express its measure in whole centimeters or millimeters. Given a sequence of metric pre-fixes, the learner will arrange them in order from smallest to largest. Given a fraction orally, the learner will write the fraction. Given a proper fraction, the learner will identify the numerator and the denominator of the fraction. Given a denominator, the learner will supply the correct numer- ator to make the value of the fraction equal to one, without the use of aids. Given a proper fraction with a denominator less than nine, the learner will explain the meaning of each fraction by making a drawing or by using fractional cut-outs. Given a simple fraction, the learner will give at least two equivalent fractions. Given fractions with like denominators, the learner will add to the sum of less than one. Given any five fractions with like denominators, in random order, the learner will write them in numerical order. 136. 137. 138. 139. 140. 141. 142a. 143. 144a. 145a. 146. 147. 148. 149. 166 Given a division problem with a three digit divident and a one digit divisor, the learner will determine the wuotient, with or without a remainder. Given two factors of up to three digits each, the learner will estimate the product by rounding both factors to the nearest ten and multiplying. Given the decimal fraction of no more than three places, the learner will name the place value of the digit. Given an addition or subtraction of decimal problem in vertical form with no more than five digits and no more than three deci- mal places, with each problem having the same number of decimal places, the learner will find the sum or difference and correctly place the decimal point. Given an expressed amount of money, the learner will multiple or divide the given amount by a whole number. Given numerals between ten and 5,000, the learner will round off numerals to the nearest 10's, 100's, or l,OOO's place. Given any numeral from 1,000 to 9,999,999, the learner will locate and separate the periods with commas. Given a story problem, with whole numbers and requiring only one operation (addition, subtraction, multiplication or simple division), the learner will choose the correct operation and do the computation. Given a six digit numeral in oral form, the learner will write the given six digit numeral. Given two times to the nearest minute, the learner will find the time interval. Given any four digit number, the learner will give the number that is 100 or 1,000 less than it is without using formal addition or subtraction. Given an exercise in multiplication, the learner will multiply a three or four digit factor by a two or three digit factor. Given a division problem with a four digit divident and a one digit divisor, the learner will determine the quotient with or without a remainder. Given division problems with multiples of 100 as dividends and two digit divisors, the learner will estimate the quotient by rounding off the divisors to the nearest ten. 150a. 151. 152. 153. 154. 155. 157. 158. 159a. 160. 161a. 162. 163. 167 Given pictures or models of prisms, cones and pyramids, the learner will correctly identify them. Given a numeral expressed as a power with an exponent less than five, the learner will express it as an ordinary base ten numeral. Given an addition or subtraction decimal problem in horizontal or vertical form, with no more than three decimal places, the learner will find the sum. Given a number with no more than three decimal places, the learner will round to the nearest whole number, tenth or hundredth as requested. Given a number less than 100, the learner will identify the factors of the given number. Given a number, the learner will identify multiples of the given number. Given division problems with two digit dividends and two digit multiples of ten as divisors, the learner will find the quotients, with or without a remainder. Upper'Level Given division problems with three digit dividends and two digit multiples of ten as divisors, the learner will find the quotients, with or without a remainder. Given word problems, requiring division, the learner will give the equation and find the quotient. Given the measurement of each side of a polygon, the learner will find the perimeter of the given polygon. Given a list of familiar objects, the learner will choose the volume measure (cu, cm., cu, dm., cu.m.) that would be nearest in size. Given a division problem with a two digit divisor, a four digit dividend, with or without a remainder, the learner will find the quotient. Given a story problem, with whole numbers and requiring only one operation, (addition, subtraction, multiplication or division) the learner will choose the correct operation and do the computation. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 168 Given a list of numbers, the learner will identify prime numbers less than 100 (by circling them). Given a pair of numbers, each less than 60, the learner will identify their greatest common factor. Given a fraction, the learner will reduce it to its simplest form. Given a number line segment (0, 1) with dots indicating division of the segment into equal segments, the learner will identify the fraction correspondong to a particular dot. Given an improper fraction, the learner will write it as a mixed number. Given a mixed number, the learner will write it as an improper fraction. Given a whole number and a mixed number, the learner will find their sum. (Given a whole number and a mixed number, the learner will find the difference. Given a decimal fraction, the learner will rename it as a common fraction. Given a common fraction whose decimal equivalent terminates in two places or less, the learner will write its decimal equivalent. Given a whole number and a fraction less than one, the learner will multiply to find the product. Given a multiplication problem with fractions less than one as factors, the learner will find the product in simplest form. Given a fractional number and a mixed number, the learner will find the product. Given multiplication problems having two mixed numerals, the learner will find the product. Given a common fraction whose decimal equivalent terminates in three places or less, the learner will rename the common frac- tion as a decimal fraction a = 5/10 = .5. Given any six digit numeral, the learner will rewrite it with expanded notation, first by using place value words, and then by using numerals. 180. 181. 182. 183. 184. 185. 186. 187. 187. 188. 189. 191. 192. 193. 194. 169 Given a pair of numbers, each less than 20, the learner will identify their least common multiple. Given two fractional numbers, that may or may not require renaming, the learner will find their sum. Given two mixed numbers that may or may not require renaming, the learning will find their sum. Given subtraction problems involving mixed numerals, the learner will subtract mixed numerals with renaming and find the differ- ence in simplest form. Given two unequal fractions, with denominators of 2, 3, 4, 6 or 8, the learner will tell which is greatest in value. Given a measurement involving two units in the same system, the learner will multiply the measurement by a whole number and regroup as necessary. Given a division problem with a dividend of no more than five digits and divisor with no more than three places, the learner will find the quotient, with or without remainder. The remainders will be written as fractions in simplest form. Given a list of fractional numbers, the learner will write the reciprocal of a number. Given two fractional numbers, less than one, the learner will find the quotient. Given a whole number divisor and a fraction, the learner will find the quotient. Given a whole number dividend and a fractional divisor, the learner will find the quotient. Given a fraction and a mixed number, the learner will find the quotient. Given two mixed numbers, the learner will find the quotient. Given a numeral from .001 through hundred millions, the learner will read and identify numerals, expanded numerals, or in word form. Given a list of numerals from .999 to 1,000,000, the learner will round off each numeral to the place value indicated in the heading. 195. 196. 197a. 198a. 199a. 200a. 201a. 202a. 203a. 204a. 205a. 206a. 207a. 208a. 170 Given a multiplication of decimal problems, with no more than five digits and no more than three decimal places, the learner will find the product. Given a decimal division problem in which the divisor and divi- dent have no more than five digits and no more than three decimal places, the learner will find the quotient. Given a set of equations, the learner will label, identify or compute as indicat-d, using the properties of addition and multiplication (dist., assoc., 1's and O's). Given a measurement such as 1.463 meters, the learner will express 1t as one meter + four decimeters + six centimeters + three millimeters. Given diagrams or models of points, lines and planes, the learner will associate each diagram with one of the words: point, line, plane. Given drawings of parallel lines and perpendicular lines, the learner will associate each diagram with the correct words (paral- lel, perpendicular). Given a circle and its related parts, the learner will identify the center, radius, diameter and circumference. Given diagrams of segments, lines, rays and angles, the learner will select and name each as requested. Given a set of pictured angles, the learner will select those which are right angles. Given the formula for finding the area of a triangle and the measures of the base and height of the triangle, the learner will find the area. Given an English or a metric table of equivalent measurements, the learner will convert from one to another within the same system. Given a circle with its radius or diameter, the learner will find its circumference. Given the formula for finding the area of a circle and the measurement of the radius or diameter of a circle, the learner will find its area. Given a protractor, the learner will read the measure of any given angle from O0 to 1800--within two degrees. 209a. 210a. 211a. 212. 213. 214. 215. 216. 217. 218. 219. 171 Given a drawing of a rectangular solid with its dimension (small whole numbers), the learner will compute the volume. Given a coordinates, the learner will locate the points on the grid. Given three pairs of coordinates and a grid, the learner will construct a line graph. Given a square subdivided into an area of 10 x 10 unit squares, some of which are shaded, the learner will state the indicated ratio and percent represented by the shaded area. Given a list of ratios, the learner will express an equivalent ratio of the given ratios. Given a list of one or two digit decimal numerals less than one, the learner will express them as percents. Given a ratio and the numerator or denominator of an equivalent ratio, the learner will write the missing numerator or denomin- ator of the equivalent ratio. Given a set of percents and two digit decimal numerals, the learner will write them as fractions in simplified form. Given a set of proportion problems, where the given terms and the answer are each whole numbers less than 100, the learner will find the solution. Given a percent problem, the learner will write the appropriate proportion needed to solve the problem in the form, 2 n -..... 11: 25 0.2.25. 8 10 ’8 TOO’ n 100 Given a set of problems, involving the three types of percent, the learner will write the appropriate proportion and solve the problem. APPENDIX VII Computer Program for the Simulation 172 173 "000 m0vm~ .h4.>~u~ .h4.>.u~ Duo Juqu H own on o o 00 u» 00 gnaw om Up 00 .Num Ob Up 00 .na .>ou.omwmuvm h~2.~u o. ubmwauo oouhmwm oflduc oumuz oflmuz onumz muzonflfi ocu U0 ooonozaw o.xnocomquoAXuocommwdc«xHoNonuvco«XNoNov muzohaz .c~a.muaomuuou~Q.&QNNO.KQN~O.¢QHNU.1QH«O.N quuzqm mu amm234 m4QUIHH>QOIH oON\An+>QUI-hQUI >QOIHQUL mhm Uh UU H+.J¥vOZH~J¥u04 mmn Oh U0.>QDLoon«XXVwoquQDLoPJoa!¥wm~u~ «M.MNJ¥ 00m 00 OON\A>QDL~vhQOI oON\~QDI-bQOLH OnINQUIu u+>QDIuN>QUI~ oON\Au+>&UI~vhQUL >QDIHQUL 00m Uh 00 u+AJ¥~mZH~JXVW£ «mm Oh U05>QDLoon~¥¥vQOI~vhQUL oON\~QUI~vhQOL~ O~INQUL~ mULouNXX mhm UQ OUaJxvmé ONAJXuOL “HoanX mow OD WDZHhZUU mummohwoafifi~ozoz§Ou<>v\diONODxa Oomla“mozuhquqmiwm<>¥wa4>v\¢zmflw3xa A.WUZVthJmioumfiua<>v\mZOHO¥m aamOZ.hv\mimNm¥m Amuzimuzvh<04&\nanDW+OZDm.IuwQZJmAAmO MDzthDU 00¢ szhUIszMhD+¢ZONQzQ uQZM»O+mzuNMZU uzwhOIQAMhOIlzwaN«Q2uhO ((uzolnfifivaQZuhO azwbmfiuazwhm+vzwflczw Hazwhm+m2mflmzw ozwhmidzMPmrozwhmuadzmhm 4vba0mfloum u<>uhuUWHmOm vh Q A .m K «moztmuzuh wazuhzuu omN mom+UmOszHOm023m AXWVJNAXVHQHQOW qum+omm23mucmw23m (~42 ooN wm