'1W"‘.— 1- —' . 113111;? 5‘1 11 “11“‘11111 1111 *1, . 1‘ 11’ ' 1311.11" 1.91 __-.._ .. 9“.“ - .. . 4w . . “.30.. th... ~—1"' " - ' 'JW'n-‘W ' -‘. - .— m. ‘.“‘""2"‘ . “n, _ t.‘ ‘ ~‘hv . . , “— N. .~v“. . - .- '- E~§$ .. u-c.-. .— w 1"; .— .. ._ .n. - - “n - A. -. ' ""‘""' I.—-‘“m-Iw '3‘, o @— v.2: z ”—4.... A...“ -FI‘V‘-’ l-"l-I a—n “ mmw hon—1m m o c. u . _.W~..-a o "- 09- ' -—....._..~ . . «v «OF-«unw- 1 u‘ ' " 'fi. a 5%... ‘ nu N o—-:.‘ a: ‘ - r-mw- c m~_mv. a . . . ¢ . a . - .p . _ ._ luv- '"* W - ._. . . .-... . . '- .. W‘ 4.-- .- roe-9 --. .. .. . -. . 9 u ‘ "I 1. s - —-~a - ' .nn 1 .‘r‘ M. . w v “0—“ .- -t . amt-a:- n . . mwm ' u 4 ‘ o . .huo..4- 4n .- «M_’.-—o *vonov Coo: — , . .« . ol- . o ' a. .M . paw-o... 1""! _ . -.. . °, . . c 4. - . '. 3}. ‘ . ' .41“ f1: .1 . . A y! "sea-4. .11: $21 “‘6 £131?!“ :11} ”1:11 1111 t ." .1' ........ d I n .l ‘ t. . ‘ N.“ 4 ‘7.” 1L 1. ‘3’ 3 . 1 "fig up 0 'w ‘3‘. 4‘? ‘ .“..—.o‘-.“ '49-». "34.44:.” “V. a“ n‘ — .~-_.. —',4. ‘ m... um . . . . '. m .a’ _. K23} ...‘ . . _, , u... .. -- .. u 4 n- .- u "’ ' 9 '..."I:u.w A. n .' 0' ‘1‘ l J .I C . - - a‘wwW-‘L J 1 .— ~ .97"), a u . a g 1 <1 - a - o .nwa...u < . ....... . . 0 v. o la. o gnu...“ ... . ' '2‘," ,5...‘....4~... ._ .757. ‘. . .. - .....— - --... ' ..., A -u...__.--. :"mv‘..~! .4. nous-0‘4 v-a .... ‘- .. 7’ ' .....4.. .44 . u . . -~ A .-. ..v -o .- '133‘1-13V1“W .::;.~. WW" ‘5.” ' 3:. ' ‘1‘”: {I L";. :1”... 9': 5.1.5:. M A "‘ "u". .. ...__,,, .. ‘ .- .‘ ... .. V . A'" n a . ./. .. . .. -... . ‘. .. ,._,,,‘_ '73: 1, -—-—o~>— .-_‘ wmfimwu‘fiC o ...-... ‘ c . . , .— .- .1 .w~-, - 4 o _ . a.) ......«............. m. $2.3 m ..- N' W. .. 3:, .. .. -.. 2' . .z. - . .5 ”mm. W‘ "‘3 74" 3...;- -.4 ‘01-. . 4...~ 4 . t. . Q-“ o 1E"‘ I .1111? .3: 11131212 '1- . 1.1:. .. » . . #111,731: "1 , r3 2-". ‘;?1..1...1v1.'1’, 111111. 1.11:. “1111'“ '1' "11 "11 1' ‘1': 11 .111 1‘ "“‘ "’ “1'11”“ ' ' 1 1“ "' 1*"‘11“11 I“ '1“ 1 :‘18‘1‘ "“11 1111‘ 11.11,1.1111111.1 11‘11‘91‘1111 “ 1' 11111911 11;: 111k!“ '1‘: 1111111 11151131! ‘3" 1 "i'.‘“..~1‘fus“’ 121' "1",? 311 3111111 1111.11'111 .111 .11 !1131111111 1111111111 .1. 1‘ 1:111"1 ”‘11 11111‘:11 1111;. 1111 11.13.11“ 11111113111111 “'1‘ '111 111 4111111111111. .1"; 11‘. M1111 11:15 '11""111! 11j1'1 :13: if ”nu 3111‘ 1‘1"; .1”; 1‘I-1‘111gwl1i551‘C ‘3 111 11:11“ 1,,1'111;;%fi ' 1’3‘1'-11.'.1"27“'1- ~‘ 11‘ ..1"11*111111 1% 1 11'11111 1'11"! ""11..1‘;11"‘3 11‘3'11‘511 11111111111311 "1‘" "“ WM 1- .11111 n: 111 "I 1111' ""13' 11111 31311 15111 .11 .1:-1. ‘1 1:1‘*1.1'1..11,‘ $111111” 1:1111 111 1111:11:12:11-1:11111'r.1.-111111111"1‘51'1‘ 1:111:1'1'iz1-111111'1' 1111 1:11“ 11 11 1” 1 ‘ ‘1: =1.;..=~.--'1.'1‘” 1.:1‘1':1-' "'11 ' “1» 1111.... 1'11 :1" 1‘ 1“11 1' .1 .3 2111111.. 111111111111111 {11,111 1 1'1? ‘5 Ldl film ”1‘11“ ‘ {1‘ 1 f“’1,«,. 1-1 I‘ 1‘1 3'1. 111111111 111 1111.11" "1111111111 19111111111111111““' "‘1 " '5’} .‘_—... In.“ Nx LHEEMRY Michigan State University *— I This is to certify that the dissertation entitled TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS presented by Loraine Son has been accepted towards fulfillment of the requirements for Ph.D. Psychology degree in 7 /Z/// 7’ ¢Z—f (EVA/w; v Major professor Date 2/2483 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES m \- RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. i 1 TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS by Loraine Son A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Psychology 1983 The the aajc: coeffici categori upon Uh: weenie: differe: Shapes, binoma the Stu tims c f1“mine SiZes. a DSYci as 335‘ invest ABSTRACT TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY or CRITERION-REFERENCED RELIABILITY COEFFICIENTS By Loraine Son The present study examined the bias and sampling variability of the major single test administration criterion-referenced reliability coefficients given various test parameters. These coefficients were categorized by the type of loss function (squared error or threshold) upon which they were based and by whether they included a chance agreement correction. The extent of bias was studied as a function of different parallelism conditions (classic versus random), distribution shapes, and cut-off scores. Distributions not belonging to the beta- binomial family and the random parallelism condition were included in the study to examine the robustness of several coefficients to viola- tions of their underlying assumptions. Each coefficient's sampling fluctuation was investigated for various test lengths and sample sizes. Data from the Michigan Educational Assessment Program and from a psychology mid-term exam were used to generate item domains as well as test scores, and were altered to reflect the various parameters investigated. For each cell of the design, population coefficients were computed from either randomly or classically parallel alternate eras. populati single t samples. the coef r3, 0: coeffic: the cla. paralle not 31w apprcac became geneeus of the genera: kappa I beta-b test 1 PQCOQ: categ: forms. To determine the magnitude and direction of bias, the population values were compared to the mean of the corresponding single test administration sample estimates taken across many samples. The standard deviation of these sample estimates indicated the coefficient's sampling variability. For distributions derived from homogeneous item domains, all the coefficients, except the kappa estimates, were robust to violation of the classic parallelism assumption. For the other distributions, the parallelism condition did affect the coefficients' biases, although not always in the expected direction. Generally, as the cut-off score approached a distribution's mean, the squared error coefficients became more biased for randomly parallel tests consisting of hetero- geneous items. The cut-off score also significantly affected the bias of the threshold agreement coefficients. However, the results, generally, did not follow a pattern. The hypothesis that the po and kappa estimates would be more biased for distributions which were not beta-binomial was unsupported. Sampling variability decreased as the test length and sample size increased. Based on these results, recommendations were made about which coefficient to use within each category given various test conditions. ACKNOWLEDGEMENTS I wish to express my most sincere and warmest thanks to my chair- person, Dr. Neal Schmitt, who introduced me to this area of study and whose suggestions guided me through the most difficult times. Without his sage advice as well as his calm, rational, and nurturant manner during these periods, the project may never have reached its final stages. His patience and support have been invaluable. I also wish to express my appreciation to Dr. Raymond Frankmann, Dr. William Mehrens, and Dr. Frederic Wickert for their timely, needed advice and their willingness to respond with constructive feedback under the pressure of a short time frame. Apart from my committee members, other individuals have contributed notably to this project; I am indebted to Bryan Coyle for his kindness in offering to apply his computer expertise when needed, and to Kathy Sigafoose as well as those who assisted her, Kathy Cooper and Janet Larrimore, for the hard work and time they devoted to typing and preparing this manuscript. Finally, I wish to thank my parents for not only providing love, support, and patience through the more trying times of this project, but also for their understanding, kindness, devotion, and sense of humor which have guided me throughout my life. For these priceless contributions and with much love and respect, I dedicate this work to them. 11 "2' Us LIST “P an LED; v: FA 6‘ A..V fluv HHM n .Y. wugI - R-R FE<~ TABLE OF CONTENTS Page LIST OF TABLESOOOOOOOOOOOOCOOO0.0.....000.00....OOOOOOOOOOOOOOOOOOOOV LIST OF FIGUREOOOOOQOO0.00.00.00.0000......00.0000000000000000000Vii TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS..................1 Criterion-Referenced Measurement and Tests.....................2 Reliability....................................................9 Reliability Formulations Based Upon Squared Error Loss.............................................17 Reliability Formulations Based Upon Threshold LOSSCOOIOOOOOOOOOOCOOOOOOOOOIOOOOOOOOOOOOOOOOOO0..0.00.31 syntheSiSOOOOOOOOOOOOOOOOOOOOOOOOOO00......0.00.00.00.00050 METHODOOOOOOO00.0.00...OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOI00.66 Data Base.....................................................66 Procedure.....................................................67 Test Characteristics.....................................67 Data Generation..........................................69 Determination of Bias....................................77 RfiULTSOOCOOCOOOOCOOOOOOOCOOOOOOOOOOOO00......00.0.000000000000000085 Population ValueSOOOOOOOOOO0.0.0.0...OOOOOOOOOOOOOOO0.0.0.0.0085 Bias .0.0000COOOOOOOOOOOOOOOI.OOOOOOOOOOOOOOOOOOOOOOOO00......96 Sampling Variability.........................................118 DISCUSSION.................................J......................121 SUMMARY AND CONCLUSIONS...........................................125 APPENDICES A1 Meag Bias and Standard Deviation of Livingston's R (3.1 ) Across Samples of 25 Examinees...............132 A2 Meag Bias and Standard Deviation of Livingston's 3 (3,1 ) Across Samples of 35 Examinees...............13A A3 Mesa Bia§ and Standard Deviation of Livingston's 5 (3,1 ) Across Samples of 50 Examinees...............136 Ah Mean Bia§ andAStandard Deviation of Brennan and Kane's ¢(l) Across Samples of 25 Examinees........138 iii A5 A6 A7 A8 A9 A10 A11 A12 A13 A13 A15 A16 A17 A18 A19 A20 A21 A22 A23 A2“ Mean Bias and Standard and Kane's $(A) Across Samples Mean Bias andAStandard and Kane's ¢(A) Across Samples Mean Bias and Standard and Kane's 3 Across Samples of Mean Bias and Standard and Kane's 5 Across Samples of Mean Bias andAStandard and Kane's ¢ Across Samples of Mean Bias and Standard .fio Across Samples of Mean Bias and Standard p0 Across Samples of Mean Bias and Standard §o Across Samples of Meafl'Bias and Standard g Across Samples of Meag Bias and Standard p Across Samples of Meag Bias and Standard go Across Samples of Meafi'Bias and Standard g Across Samples of MeaE Bias and Standard p Across Samples of Meag'Bias and Standard Megg Across Samples of Bias and Standard of of of of of 25 of 35 of Brennan 35 Examinees........139 Brennan 50 Examinees........1u0 Brennan Examinees...........1u1 Brennan Examinees...........1u2 Brennan 50 Examinees...........1fl3 Deviation of Marshall's 25 Bxaminees.....................1AA Deviation of Marshall's 35 Examinees.....................146 Deviation of Marshall's 50 Examinees.....................1A8 Deviation of Subkoviak's 25 Examinees.....................150 Deviation of Subkoviak's 35 Examinees.....................152 Deviation of Subkoviak's 50 Examinees.....................15A Deviation of Huynh's 25 Examinees.....................156 Deviation of Huynh's 35 Examinees.....................158 Deviation of Huynh's 50 Bxaminees.....................160 Deviation of Subkoviak's Deviation Deviation Deviation Deviation Deviation §_Across Samples of 25 Examinees......................162 Mean Bias and Standard Deviation of Subkoviak's EAcross &mples Of 35 ExamineeSoeeooooooooeeeeeo0000.16“ Mean Bias and Standard Deviation of Subkoviak's §_Across Samples of 50 Examinees......................166 Mean Bias and Standard Deviation of Huynh's 5 Across Samples of 25 Examinees......................168 Mean Bias and Standard Deviation of Huynh's g Across Samples of 35 Examinees......................170 Mean Bias and Standard Deviation of Huynh's g Across Samples of 50 Examinees......................172 LIST OF REERENCESOOO0.000000000000000IOOOO0.0.00.00.00.000000000017u iv 10. 11. 12. LIST OF TABLES Page Characteristics of Each Randomly Parallel Alternate FomOOOOOOOOOOO0.0000000000...0.00.00.00.000000086 Characteristics of Each Classically Parallel Alternate Form...00.00.000.00...OOOOOOOOOOOOOOOOO0.0.0.00088 Classical Reliability of Randomly and Classically Parallel Alternate Forms for Each Distribution/ Test Length combimtionOOlOOO0.000.000.0000.000.000.00000090 Altsrnate Form Population Values of Livingston's K (£,T4) for EaCh cell or the DeSIEnOOOOOOOOOOOOIOO0.00.0090 Population Values of Brennan and Kane's for Each Cell of the Design...............................91 Population Values of Brennan and Kane's for EaCh cell or the DeSignOOOOOOOOOOOOOOOOO00.0.00000000092 Alternate Form Pepulation Values of 20 for Each cell Of the DeSignOOO0.0...00.0.0.0?00000000000000.0.0.00093 Alternate Form POpulation Values of Kappa for Each cell Of the DeSignOOOIOO0.00....000......00.00.000.000...09“ Mean Bias (Across Cells) of Various Coefficients in Estimating the Reliability of Classically and Randomly Parallel Alternate Forms.........................98 Mean Bias Across Cells of Each Reliability CoeffiCient for EaCh DistributionlOOOOOOOO0......00.0.000099 Mean Bias Across Cells of Each Coefficient for E30!) Cut-Off scoreOOOOOOOOOOOOOOOOOOOOO'OOOIOOOOO000000000102 Mean Bias Across Cells of Each Coefficient for Every Parallelism/Distribution/Cut-off Score Combination.......1ou Me; 41! He; 1’4. Me: 150 ,- \ Rev 17. Ac 13. 1“. 15. 16. 17. Mean Standard Deviation Across Cells of Each coeffiCient for EaCh Test LengthOOOOOOOOOOOOOO00.0.0.0...119 Mean Standard Deviation Across Cells of Each Coefficient for Each Sample Size.........................119 Mean Standard Deviation Across Cells for Each &mple Size/Test Length combination I O O O O O O O O O O O O O O O O 0 O O O 0120 Direction of Bias of Each Coefficient for Each Parallelism/Distribution/Cut-off Score Combination.......126 Recommended Corrected/Uncorrected Squared Error and Threshold Agreement Coefficients for Each Parallelism/Distribution/Cut-off Score Combination.......129 vi Figure 1. LIST OF FIGURES Page Joint Distribution of True and Obtained ClaSSificationSO0.0000000...OOOOOOOOOOOOOOOO0.00000000.0000015 Mastery Testing Reliability Formulations.....................51 Advancement Scores for Each Combination of Test Length and cut-Off LeveIOOOOOOOOOOOOOOOOOOOO0.0.0.0..00....69 Skewed Population Frequency Distribution of Domain scoreSOOOOOOOOOOOOOO..00...IOOOOOOI0.00.00.00.00000072 J-shaped Population Frequency Distribution of Domain scoreSOOOOOOO0.000000000000000000000...0.0.00000000073 Bimodal POpulation Frequency Distribution of mmain scores...OOOOOOOOOOOOOOOOOO0.0.0.000...0.000.000.00075 Normal Population Frequency Distribution of Domain ScoreSIOOOOOOOOOOOOOOO00...00.000000.00000000000000076 Formulas for Both Criterion-Referenced Reliability Population Coefficients Computed from Alternate Forms and Single Test Administration Sample Estimates..................................................79 vii TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY 0F CRITERION-REFERENCED RELIABILITY COEFFICIENTS Reliability denotes the consistency of measurement or the extent to which scores are reproducible over repeated testings on different occasions, or over different sets of parallel or randomly parallel items, and/or under other small variations in conditions (Anastasi, 1976). Although it is frequently stated that a tg§t_is reliable, reliability actually refers to the consistency of the score interpre- tation obtained from the test, not to the test, in and of itself. In industrial-organizational psychology as well as in other applied sciences, norm-referenced interpretations of test scores have become the sine qua non of measurement. In norm-referenced measure- ment, an individual's score is given meaning by determining the individual's relative standing within a normative group (POpham & Husek, 1969). Quite appropriately, those reliability coefficients associated with classical test theory and norm-referenced measurement (e.g., correlation between two test administrations, coefficient alpha) indicate the extent to which examinees' relative standings remain consistent. However, norm-referenced interpretations of data do not satisfy all the measurement needs of psychologists. For example, in many situations, scores ultimately serve as a basis for making dichotomous decisions (e.g., accept/reject) or placing individuals into groups (e.g., successful/unsuccessful). In order to address these and other measurement needs, many educational psycholo- gists and measurement experts have turned to criterion-referenced measure: indicat: data arw No: Glaser As prev relevan per form these 3 scores, 0n levels ment C0 Klaus, demons: are con testing pefcen: anCEOr j 13amb1e made as partiCu. Perform; class: a. fik. hiS/he, measurement and, in so doing, have had to create coefficients indicating the extent to which criterion-referenced interpretations of data are reliable. ggiterion-Referenced Measurement and Tests Norm- and criterion-referenced measurements were distinguished by Glaser (1963) on the basis of the standard used to interpret scores. As previously noted, the former uses the test scores of members of a relevant group as the standard for Judging an individual's performance. Consequently, the mean serves as the anchor point of these scales and raw scores are typically converted into standard scores, percentiles, stanines, or ranks (Eignor & Hambleton, 1979). On the other hand, criterion-referenced measurement uses defined levels of criterion behavior along an achievement, skill, or attain- ment continuum as the performance standard (Glaser, 1963; Glaser & Klaus, 1962). More specifically, the behaviors required to demonstrate competence at each proficiency level are identified and are compared to the behaviors exhibited by an individual on the testing instrument. A typical criterion-referenced measure is the percentage of items answered correctly. This type of scale has two anchor points, one at each end of the scale, i.e., 0% and 100% (Hambleton & Eignor, 1979). In most circumstances, some evaluation is made as to whether or not an individual's score indicates mastery of a particular skill, objective, etc.; a minimally acceptable level of performance, cut-off score, is established and an individual is classified as either a master or non-master according to whether his/her score is above or below this predetermined level of competence (Buck. several priate referer "T nn‘ep“ .A‘U empley based u necessit‘ indiVidug objeCtiVQ skills a p-"Oficie: tims hav organiZa-tn Placement multiple { 1962; 001: been the d 0:138:31 ‘ ye: IndiViA U37! (Buck, 1975; Hambleton & Novick, 1973). In other applications, several cut-off scores may be used to divide the examinees into appro- priate groups. Contrary to norm-referenced measures, criterion- referenced scores indicate what an individual can and cannot do ”independent of reference to the performance of others" (Glaser, 1963, p. 520; Glaser & Klaus, 1962). In short, norm-referenced measures employ a relative standard, while criterion-referenced measures are based upon an absolute standard (Glaser, 1963). Situations where criterion-referenced measurement would be more appropriate than norm-referenced measurement are easily discernible. The standards used indicate that both are appropriate for making various decisions about individuals (Popham & Husek, 1969; Wardrop, Anderson, Hively, Hastings, Anderson, & Muller, 1982). In education, criterion-referenced measurement gained prominence partly due to the necessity of diagnosing student needs and assessing performance in individualized instructional programs (Mehrens & Ebel, 1979). The objective of measurement in these instances was to determine what skills a student possessed or to simply assess whether a student was proficient in a particular area. Similar types of score interpreta- tions have long existed in various aspects of industrial- organizational psychology such as performance appraisal, job placement, training performance, and personnel selection via the multiple cut-off and the multiple hurdle techniques (Glaser & Klaus, 1962; Goldstein, 197“). In all these areas, a frequent concern has been the determination of an individual's performance independent of others' performance, and decisions have been typically made about the individual's mastery of an objective. Such score interpretations have been 93’ 1“: al., ->‘~ Cri decision referenc Popham 5 the with is desir (Pcpham this cas post-tee As measure: drawing (Glaser, tation a built us of the Q can do ( "Search teSt3_ Var offered_ been particularly prevalent within a free quota system (Wardrop et al., 1982). Criterion-referenced measurement is also appropriate for making decisions about treatments (e.g., training programs), while norm- referenced measurement is not as suitable (Mehrens & Ebel, 1979; POpham & Husek, 1969). The latter technique is designed to increase the within group variance, while having a small within group variance is desirable when evaluating the effects of different treatments (Popham & Husek, 1969). A typical criterion-referenced measure in this case is the proportion of individuals achieving mastery on a post-test. As can be seen in the above examples, criterion-referenced measurement is not concerned with rank-ordering individuals, but with drawing conclusions about an individual's behavioral repertoire (Glaser, 1963). The psychometric implications of this score interpre- tation are far-reaching. Several experts have suggested that tests built using classical methods do not provide adequate representation of the content needed to make generalizations about what an individual can do (Glaser & Nitko, 1971; Hambleton & Novick, 1973). In response, researchers have focused on the development of criterion-referenced tests. Various definitions of criterion-referenced tests have been offered. Ivens (1970) proposed the following general definition: "A criterion-referenced test is one composed of items keyed to a set of behavioral objectives" (p. 2). In comparison, a very specific and restrictive definition was offered by Harris and Stewart (1971): A pure criterion-referenced test is one consisting of a sample of production tasks drawn from a well-defined population of performances, a sample that may be used to estimate the proportion of performances in that population at which the student can succeed (p. 1). Similarly, Glaser and Nitko (1971) advanced the following definition: "A criterion-referenced test is one that is deliberately constructed to yield measurements that are directly interpretable in terms of specified performance standards" (p. 653). Expanding upon this definition, Glaser and Nitko (1971) stated: Performance standards are generally specified by defining a class or domain of tasks that should be performed by the individual. Measurements are taken on representative samples of tasks drawn from this domain, and such measurements are referenced directly to this domain for each individual measured (p. 653). These definitions of criterion-referenced tests are sufficiently dif- ferent that a particular test could be classified as either norm- or criterion-referenced, or could contain characteristics of each depending upon the definition adopted (Hambleton & Novick, 1973). However, all these definitions imply that criterion-referenced tests are constructed by and dependent upon the existence of a well- specified content domain as well as procedures for generating samples of items from this domain (Hambleton & Novick, 1973). Some measurement experts questioned the accuracy and relevance of distinguishing between norm- and criterion-referenced tests (Brennan, 1979; Hambleton & Novick, 1973; Mehrens & Ebel, 1979). On the other hand, Hambleton and Eignor (1979) contended that a distinction should be made since a methodology now exists for constructing the latter tests. Mehrens and Ebel (1979) observed that any test, whether it be criterion- or norm-referenced, represents a specified content domain. Moreover, a criterion-referenced test can be used to make norm-referenced measurements and, conversely, criterion-referenced measurements can be derived from norm-referenced tests, although neither of these usages may be very satisfactory (Hambleton & Novick, 1973). Given these facts, the primary distinction appears to be be- tween norm- and criterion-referenced measurement (i.e., interpretation) rather than between different types of tests (Brennan, 1979; Ebel, 1971; Hambleton & Novick, 1973; Mehrens & Ebel, 1979). Of course, choosing a particular type of measurement prior to test construction has different implications for the method used to determine the items to be included on a test. However, Brennan (1979) proposed that different methods of test construction and item analysis produce changes in the definition of the item domain relevant for each measurement type rather than effect changes in the measurements them- selves. The point is that scores can be interpreted using either standard for most tests. The legitimacy of such an interpretation is a different issue and depends upon the manner used to construct the tests as well as how restrictive a definition one adopts for a criterion- referenced test (Mehrens & Ebel, 1979; Wardrop et al., 1982). In preparing test items for a criterion-referenced score inter- pretation, the overriding interest is how well the item samples the content domain or criterion behavior (Wardrop et al., 1982). The reason for such concern is the need for generalizing from specific test item responses to the whole domain of behaviors in order that inferences can be made about what skills the examinee possesses T— . 1‘ v w ,Jflnm“) ‘1‘ ._ha i h“_,.,- It‘ fid‘v «- EQ. ‘. u - at. .— -*. , n. C n w 5 .~ . -.C‘fl~ (Hambleton & Eignor, 1979). Although test development for norm- referenced measurement is frequently concerned with defining the domain of interest, criterion-referenced testing involves far more concern with this issue and with obtaining a representative sample of items from this domain (Hambleton & Novick, 1973; Wardrop et al., 1982). In short, the basic steps in constructing tests specifically for criterion-referenced measures are specifying the domain, writing items reflecting these specifications, and selecting items via a sampling procedure (random or stratified random sampling, repre- sentative sampling) which assures representativeness. Similarly, the primary approach for conducting an item analysis after test construction is to have content specialists judge whether each item appropriately measures some part of the content domain as well as whether the items adequately sample the domain (Buck, 1975; Hambleton & Signor, 1979). An objective of tests designed for norm-referenced measurement has been to maximize variability so that individuals can be reliably rank-ordered. Norm-referenced measures, such as standard scores, depend upon the existence of variability since they are derived by comparing an individual’s scores to the scores of a relevant group. Variability is partly achieved by using classical test development methods to analyze items. The assumption underlying classical methods is that a measurement procedure should provide the most discrimination possible among individuals on a particular characteristic. Conse- quently, items are largely analyzed and chosen based on their statistical characteristics, e.g., discrimination index, difficulty level, and item-total correlation. ' '1 ..‘-n_-—=..—. v .z—rv—v' HR: On the other hand, the need for a "criterion-referenced test" to produce variability has been the topic of some debate. Some theorists have contended that variability is irrelevant to criterion-referenced measurement since these scores derive their meaning through a direct comparison with the performance criterion (Millman & Popham, 197A; Popham & Husek, 1969). In addition, many applications (e.g., a post- training test) exist in which the goal may be to have every examinee in the sample achieve mastery and, in so doing, to actually restrict the test score range. In contrast, Woodson (197ua) has argued that a ”criterion-referenced test" must produce variability or else it is not informative or useful. The premise for Woodson's argument was that a test should be analyzed and developed on observations representative of those within the range of interest and, as a result, should discriminate between different observations of the characteristic. Using this approach, one would include pre- and post-training test scores in the range of possible observations used to calibrate an instrument (Woodson, 1979a). No variance may exist within the pre- training test nor within the post-training test, but the test should discriminate between these two testing observations. In contrast, Millman and Popham (1974) contended that the population of observations for a test designed to elicit criterion-referenced measures is "a domain of items and the responses of a single individual to them" (p. 137). Furthermore, they stated that if items were chosen on the basis of their ability to discriminate between observations, the test would not contain a sample of items truly representative of the content domain. The major difference between these two positions clearly lies in defining the appropriate group to be used for calibrating the scale (Woodson, 197flb). Proponents of both sides agree, however, that items should not be chosen so as to maximize test score variance (Woodson, 1979b). Therefore, the variability can be expected to be lower than for "norm-referenced tests". Moreover, in typical usage, the test score variance may be very limited or non-existent if a test is administered to a sample of examinees who have just completed an instructional program. Reliability The possible absence or dimunition of score variability and, more importantly, the type of score interpretation associated with criterion-referenced measurement make classical reliability estimates inappropriate for indexing the consistency of these measures. In classical test theory, the reliability coefficient equals the squared correlation between true scores and obtained scores, i.e., the ratio of true to obtained score variance. All of the practical formulations (e.g., correlation between classically or randomly parallel tests, coefficient alpha, split-half reliability) for estimating this ratio require the computation of a correlation coefficient whose size is largely a function of the amount of variability in the sample. As is well known, the more heterogeneous the sample, the higher the relia- bility coefficient. This fact is easily seen from the equation for reliability: r = 11 reliability and s1 , sez, and st2 denote the true score variance, the $2 / §§2 = 1 - (§§2 / s22) where};1 equals the error variance, and the total score variance, respectively. For any given test, the error variance remains the same from sample to sample, regardless of the size of the total variance, because the size of the 10 error only depends upon the test's inability to provide accurate measures of individual true scores (Magnusson, 1967). However, the total and true score variances increase when a more heterogeneous sample is given the test, resulting in a larger reliability coefficient. Conversely, when no true score variance exists, the reliability equals zero (unless §§2=O, in which case, reliability is undefined). Due to this dependence upon score variability, a test used for criterion-referenced measurement might be highly consistent in a test-retest sense, and yet the classical reliability estimates might deem it to be unreliable because almost everyone has received the same score. A criterion-referenced measure might even have a negative internal consistency index and still be a reliable measure (Popham & Husek, 1969). In short, classical reliability estimates provide an unjustified pessimistic view of the consistency of criterion-referenced measurement due to the farmer’s dependence upon variability (Buck, 1975). High classical reliability estimates can be used to support a claim of consistency, but low estimates do not indicate a lack of reliability (Popham & Husek, 1969). As noted previously, criterion-referenced measurement most commonly involves mastery assessment where one cut-off score is used as the performance standard. Therefore, reliability for this type of measurement should assess the dependability of the mastery decision. Clearly, classical reliability estimates are insensitive to this type of consistency. Since reliability is based on the relationships between true, observed, and error scores, this viewpoint can be presented more clearly by determining the impact of mastery score interpretation upon these variables and their relationships. Marshall 11 (1978) provided an excellent discussion in this area, and much of the following material was derived from his presentation. In classical test theory, the relationship among obtained score (3), true score (I), and error (E) is expressed by the well-known equation §=1+§. The distributions of true and error scores are continuous, while 3 has a polytomous or many-valued discrete distribution (Marshall, 1978). Theoretically, obtained scores could have a continuous distribution, but measurement instruments do not provide the necessary discriminations (Marshall, 1978). The effect of mastery testing upon this basic equation can be easily seen if testing is viewed within a decision-theoretic framework (Hambleton & Novick, 1973). In mastery testing, one wants to decide whether an examinee's true performance level is above or below a threshold or cut-off score; mastery testing can be viewed as a classification problem (Hambleton & Novick, 1973). Therefore, the comparable equation for mastery testing is Q=Q+M where D, C, and M represent the obtained classification, the true classification, and the instance as well as the direction of misclassification, respectively (Marshall, 1978). This model differs from its classical test theory counterpart in that all the variables in the equation are discrete as well as dichotomous given the absolute value of the misclassification error (Marshall, 1978). Viewed in this way, mastery testing results in a Platonic true score model (Marshall, 1978). Using a Platonic inatead of a classical true score model for mastery testing has implications for the determination of reliability. First, according to Marshall (1978), statistics such as a mean or a model since swat be a point is d: properties interval e- .‘ication e Second, me In classic error. E: defined ‘ J. '1 ioeu the (Parsl'Iall 30‘ be hit 138ue in examinee a r“atest SWiihat‘: Per-.3511 t'epeateq} Plateau “liabij one depe {gamble reliabi 6 “he.“ a «'3 12 a mean or a variance are "theoretically not meaningful" for the former model since observed and true scores in the Platonic true score model cannot be attributed with more than ordinal properties (p. 4). (This point is debatable; one could argue that these scores have interval properties when only two mastery levels exist since there is one interval equal to itself.) The absolute value of the misclassi- fication error can also be assumed to be ordinal (Marshall, 1978). Second, measurement error is defined differently for the two models. In classical test theory, one is concerned with the size of the error. However, in the Platonic true score model, error can only be defined in terms of the existence of misclassification, not its size, i.e., the examinee is either correctly or incorrectly classified (Marshall, 1978). Moreover, these two types of measurement error need not be highly correlated (Marshall, 1978). Given these facts, the issue in assessing reliability for mastery testing is whether an examinee is assigned to the same mastery state on parallel tests or on a retest (Hambleton & Eignor, 1979; Hambleton & Novick, 1973). Swaminathan, Hambleton and Algina (197”) defined mastery testing reliability as ”the.measure of agreement between the decisions made in repeated test administrations" (p. 26%). Consequently, given the Platonic true score model, the appropriate loss function for reliability estimation is threshold loss, where loss is either zero or one depending upon whether the two testing procedures assign the examinee to the same or to different mastery states, respectively (Hambleton & Novick, 1973; Marshall, 1978). The correlational reliability estimates use a squared error loss function and are, therefore, inappropriate (Hambleton & Novick, 1973). Some cor: fioiects, do L appropriate .t (197 ) examine measure the so classificatio-r states. The; theoretical a: ficient is or. argue that tr. underlying ya is somewhat a Problem is a correctly re: classificati We “eSativ negative Qla negating ( yE Inf We: ali or ma me other . PESpQQt to catio . n In Contra diA‘ 13 Some correlational statistics, the phi and the tetrachoric coef- ficients, do use a threshold loss function and, therefore, might seem appropriate for assessing reliability in the Platonic model. Marshall (1978) examined the ability of these coefficients to accurately measure the squared correlation between the obtained and true classifications (i.e., classical reliability) given two mastery states. The phi coefficient was found to be deficient on both theoretical and statistical grounds. As is well known, the phi coef- ficient is only appropriate for true dichotomies. One can easily argue that the mastery/non-mastery dichotomy is artificial since the underlying variable is continuous and the setting of the cut-off score is somewhat arbitrary (Glass, 1978; Marshall, 1978). The statistical problem is that phi can be negative when a negative value does not correctly reflect the relationship between the true and obtained classifications. More specifically, if either the true positive or true negative classification is zero and the false positive and false negative classifications are non-zero, the phi coefficient will be negative (Marshall, 1978). This would mean, for instance that even though there were only a few true non-masters (5%, say), if they are all misclassified then phi is negative, even though 90% or more of the examinees are correctly classified as masters (Marshall, 1978, p. 7). One other problem with phi occurs when no variability exists with respect to the true mastery status and/or the obtained classifi- cation. In this instance, phi is undefined. Contrary to phi, the use of the tetrachoric correlation coefficient is appropriate when the two variables are artificially dichotomized. However, this coefficient assumes that the two variables 3‘3"" score distrib Since 317 men (1975 coeffiCieD-t' tantial 511335 cannot be 301'" data points 5: the phi we”: previously dis Since none Narshall (1978‘ to the obtaine: Bare the true respectively. frequencies shc (M) (3+2) (M2 Problems with ‘ one, regardles: Second, the ra‘ Final 1), if th! equal zero, an: Similar to thy. U1 conclusion, L '0 Obtained ma: 0? ‘. “he data eye 0 .Uncnon. 1U variables-have a bivariate normal distribution while mastery test score distributions are often bimodal (Marshall & Berlin, 1979). Since dichotomous variables have ordinal data properties, Marshall (1978) also considered the Spearman rank order correlation coefficient. However, this statistic is inappropriate when a sub- stantial number of tied ranks exist (Marshall, 1978). This problem cannot be solved by computing the Pearson g_using the tied ranks as data points since the resultant formula is algebraically equivalent to the phi coefficient and, consequently, is subject to the problems previously discussed (Marshall, 1978). Since none of the correlational approaches proved satisfactory, Marshall (1978) examined the ratio of the true classification variance to the obtained classification variance: n(1-n) /.p(1fip) where n and ‘2 are the true and obtained proportions of mastery classification, respectively. This formula can also be expressed in terms of the cell frequencies shown in Figure 1, i.e., fl(1-") / 2(1gp) = (5+8) (9&2) / (Ayn) (Bin) (Marshall, 1978). Marshall also found several statistical problems with this formula. First, if A;Q or a;g, the ratio equals one, regardless of the frequencies contained in the other two cells. Second, the ratio can be greater than one if o557<2 or £_ 6221 (Brennan, 1979). Correspondingly, Schmitt and Schmitt (1977) found that the average £_- 20 over 1H7 criterion-referenced tests was equal to .53 while the average 32(X,Tx) was .67, and the difference between these coeffi- cients increased as the distance between the mean and the cut-off increased. Likewise, Downing and Mehrens (1978) found that the mean value of K2(Z,TK) taken over 33 achievement tests was greater than the mean values of Kfi-ZO and Kfi-21. Livingston (1972b) presented two reasons why the value of K?(X,Ix) and 9(l) should change as the distance between the mean and the cut-off changes: (1) if an indivi- dual's obtained score is farther away from the cut-off, his true and obtained scores are more likely to lie on the same side of the cut- off. "Then, if two groups of scores have equal variance and equal reliability in the norm-referenced sense, the group of scores 28 whose mean is farther away from the criterion score must have the greater criterion-referenced reliability" (p. 18); and (2) a change in the cut-off leads to a different interpretation of scores. Similarly, Brennan and Kane (1977b) viewed an increase in the distance between the mean and the cut-off as an increase in the ability to detect the signal. Others have stated that these coefficients' sensitivity to the relative position of the cut-off is either inappropriate or undesirable (Harris, 1972a, 1973; Shavelson, Block, & Ravitch, 1972). Shavelson, et al. (1972) believed that the cut-off score's effect on the size 0f.£?(Z,Tx) means the latter does not directly reflect the measurement's repeatability. (This argument could also pertain to 9(2).) However, given the desire to interpret scores in relation to ,9 , the difference between the mean and Cx reflects true variance and, therefore, K?(K,Ix) does reflect the measurement's consistency (Livingston, 1972c). Harris (1972a) proved that K?(£,Ix) equals the norm-referenced reliability coefficient computed on pooled data from two populations equal 02 g cut-off. According to Harris (1972a)..K?(K.Ix) is deficient because and means equidistant above and below the , 9 2 having equal oI ceiling and floor effects do not always allow one to postulate the existence of two means equidistant from 9x. Therefore, the higher reliabilities obtained with 52(X'3x) are simply due to implicitly increasing the range of talent (Harris, 1972a). In rebuttal, Livingston (1972a) stated, "Criterion-referenced test score interpre- tations do not require that the criterion score be conceptualized as the mean of some distribution" (p. 9). Simply stated, one must reject 29 the notion that the first moment of a distribution has to be the mean (Lovett, 1977). Livingston's coefficient was also criticized because the standard error of measurement remains constant even though K?(§,Ix) increases as the cut-off score moves further from the mean, i.e., the use of the higher K?(l,1x) as opposed to a classical coefficient does not lead to a more dependable estimate of where a particular examinee truly falls relative to Cx (Harris, 1972a; Shavelson, et al. 1972). This criticism also applies to 9(k) (Brennan & Kane, 1977a). However, reliability refers to the dependability of a group of scores, not a single score (Livingston, 1972a). When a mastery decision must be made for every group member, the larger value of 52(X,Tx) implies a more reliable overall estimate of each member's mastery state (Livingston, 1972a). The situation is analogous to the effect that an increase in variance has upon a classical coefficient. Moreover, the standard error of measurement and the squared error criterion- referenced reliability coefficients provide different information: the former measures the variability of an individual's scores independent of the cut-off score, while the latter indicates the consistency of scores relative to the cut-off (Berk, 1980). In another critique, Harris (1973) showed that the squared standard errors of estimate associated with a linear prediction of true score and of the observed score on a parallel test increase when g?(X,T£) is substituted for.,:1:,1 in the regression equations. Livingston (1973) considered this substitution inappropriate because .g1, not_E2(§,T£), is the least squares linear regression coeffi- cient. Replacing};1 by g?(K,TX) removes the regressor for the 30 mean and clearly results in an increased residual variance (Livingston, 1973). Finally, the appropriateness of 52(5’21) and 0(k) was questioned because these coefficients increase as the cut-off moves from the mean toward either mode of a symmetric bimodal distribution (Marshall, 1976; Marshall & Serlin, 1979; Subkoviak, 1976). Intuitively, one would expect the Opposite to be true, i.e., K?(£,Ix) and 0(A) should be greatest when the mean equals the cut-off since the mean is the point of lowest score concentration and, therefore, the point at which more people should be reliably assigned to mastery states (Marshall, 1976; Subkoviak, 1976). Clearly, this counterintuitive relationship also applies to a unimodal skewed distribution (Marshall & Serlin, 1979). In short, the magnitude of KZ(X,TX) and 9(k) is sensitive to the distance between the mean and the cut-off, but not to the cut- off's relative position to the mode or to heavy score density areas (Marshall & Serlin, 1979). This criticism is unwarranted given that K?(3,Tx) and dKA) define reliability as the average squared deviation from the cut-off. Like any mean, this average is heavily affected by outliers present in skewed distributions. As the cut-off approaches the mode of such distributions, the outliers become even more influential resulting in an increased average squared deviation. An analogous process occurs for bimodal distributions. In summary, when the cut-off approaches heavy score density areas, more individuals are likely to be misclassified but the reliability in terms of the average squared deviation increases and is appropriately reflected by the magnitude of K?(X.Ix) and ¢(l). 31 Reliability Formulations Based Upon Threshold Loss Carver. Carver (1970) introduced two methods for assessing the reliability of mastery measurement. One method consisted of admin- istering the same test to two comparable groups and comparing the percentages of examinees achieving mastery in each group. In the other procedure, the percentages of examinees achieving mastery on two parallel tests are compared. Both procedures are subject to the same limitation; the two percentages compared can be equal even if the measure unreliably classifies every individual (Subkoviak, 1978b). For example, according to the second procedure, perfect reliability can be obtained when "01 of the examinees are classified as masters based on the first test administration and a different #01 are classi- fied as masters on the second administration (Subkoviak, 1978b). Another problem with these procedures is that they do not allow consistent non-mastery decisions to contribute to the reliability measure 0 Hambleton and Novick. Hambleton and Novick (1973) suggested that reliability be expressed as the proportion of times a consistent mastery decision is made with two parallel measurement procedures. (Hambleton and Novick (1973) do not use the proportion correct score as an examinee's true score estimate nor as the whole basis for mastery classification. First, they recommend using a Bayesian estimation procedure to determine the probabilities of an examinee being a master and a non-master. Then, based upon the criterion of minimizing threshold loss, these probabilities and the estimated 32 losses caused by making erroneous decisions are used to classify an examinee.) Given,g_mastery states, their index can be expressed as: where‘pii is the proportion of people classified in the ith mastery state on both test administrations (Hambleton a Eignor, 1979). This coefficient is frequently called the coefficient of agreement. Although they certainly were not referring to mastery testing at the time of their writing, Goodman and Kruskal (195") had suggested using this index as a reliability measure for two polytomies consisting of the same classes. The upper limit offlpo is, of course, one. Its size is partly a function of the magnitude of the cut-off score relative to the examinees' ability level. For example, 20 will be high when the cut- off score is very low and the examinees have just completed a training program relevant to the tested skill (Millman, 197“). In other words, .3g does not take into account the proportion of agreement expected merely by chance (Kane & Brennan, 1977; Swaminathan et al., 197R). This fact has led to criticism of this index since, as long as the base rate for one category is high,‘po can be high even if the measurement procedure does not contribute to correct classification. Goodman and Kruskal;AKoslowsky and Bailit. In 19SH, Goodman and Kruskal advanced an alternative to 20' They recommended using their index when no relevant continuum underlying the classification scheme existed and when the classifications did not have ordinal properties (Goodman & Kruskal, 195“). One can easily argue that mastery 33 measurement satisfies neither of these conditions. However, using their measure is possible when only two classifications exist since the ordinal properties are largely irrelevant and since interest lies in evaluating mastery, not an examinee's score on the underlying continuum. The proposed reliability measure is: Ar . flag, - [1/2 (12% + 2&3] - 1 - E/z (PM. + P.1~L)] where 2;; is defined as previously, and BM- andf.‘M represent the marginal proportions corresponding to the modal category for rows and columns, respectively. The numerator equals the decrease in the probability of misclassification occurring when an examinee's mastery status is known on one test as opposed to when no information is available (Goodman & Kruskal, 1959). In the latter case, the best guess of the examinee's status is the modal class (Goodman & Kruskal, 195“). The denominator equals the probability of misclassification given no information, and the coefficient equals the proportionate decrease in the probability of misclassification as one moves from the no information situation to a situation where the individual's status is known on one test administration (Goodman & Kruskal, 195“). Koslowsky and Bailit (1975) expanded upon this formula to deter- mine the reliability of a series of items. This extended index can be used to assess the reliability of a series of mastery decisions. Their measure simply equals the average oflhrtaken over all the \ ii - [1/2 (By: +£.Mfl )5 mastery decisions (2): / >32 " 1,! l - [1/2 (PM. + p M] 1" \ . J: 3“ A problem with A; and A; is that they are indeterminate when all examinees are masters (non-masters) and both test administrations classify them as such. Clearly, the measurement is perfectly reliable in this case. Koslowsky and Bailit (1975) suggested automatically assigning a value of 1 to Kr when this situation occurs. Cohen (1960) questioned the appropriateness of it as a reliability index since using the modal category as the "best guess" in the no information situation is more logical within the context of prediction rather than reliability. Swaminathan, Hambleton,gand Algina. To eliminate the influence of chance agreement found with go, Swaminathan et al. (197“) proposed using Cohen's coefficient kappa, K. This coefficient is defined as: (39-13.) 58(1 ) Be where-pc is the proportion of agreement expected by chance alone m - or re p 1 (Cohen, 1960). The symbols Bi and 9.i represent the i.-. 151- - marginal proportions in a joint classification of the same decision categories on two test administrations, or the proportion of examinees assigned to a mastery state, i, on the first and second test administrations, respectively (Swaminathan et al., 197“). Therefore, 9c is actually a function of the group composition and is the proportion of agreement one would obtain regardless of whether or not the two administrations were statistically independent (Hambleton & Eignor, 1979). 35 The numerator of K equals the difference between the obtained and the chance proportions of agreement while the denominator equals the maximum value this difference can assume (Millman, 197“). Therefore, K_measures the proportion of agreement obtained over and above that expected by chance alone and is, in a sense, independent of the proportion of masters and non-masters in a particular group (Hambleton a Eignor, 1979; Swaminathan et al., 1979). A limitation of K, as well as of go, is that their computation requires two test administrations. Since obtaining data on a parallel test or a retest is not always feasible, an index of classification consistency estimated from a single test administration is definitely needed. Subkoviak. Subkoviak (1976) offered a single test administration estimate of 90’ He first defined the coefficient of agreement for person Vi? as the probability of i being placed in the same mastery state on two parallel tests: where K and K’ represent the two test administrations. The first term on the right of the equation denotes the joint probability of person i being consistently classified as a master, and the second term represents the joint probability of a consistent non-mastery decision. Subkoviak then defined the coefficient of agreement (30) for a group of g examinees as the mean of the individual 20(i): p = 33— }? (ED/l! -_ i=1 2 36 To obtain estimates of 29);) from a single test administration, Subkoviak assumed: (1) scores on the two tests were independent for a fixed examinee; and (2) given an individual's true score, the condi- tional obtained score distributions on both tests were identically binomial. These assumptions led to the following equation for 29(1): 2 (9 - (2(XiZCX))2 + (Heal zcxnz where - - - B (P. 31 3-51 maize; = it 2:92; (143;) ‘i ‘23 In the latter equation,_l_’i denotes an individual's true probability of obtaining a correct item response, Q equals the number of test items, and Xi represents individual i's obtained score. Once 2(zizgx) has been calculated from the data obtained on one test administration, both Po(;) and p0 can be easily computed. The key to determining 2(32293) is estimating 2;. One could choose the maximum likelihood estimate which equals xL/g (Subkoviak, 1976). However, the standard error of this estimate is {§;(T:f;77§ which is relatively large when n 5&0 (Subkoviak, 1976). Due to this limitation, Subkoviak (1976) recommended using a regression estimate of 2i when the observed scores approximately follow a negative hyper- geometric unimodal distribution. Specifically, he proposed the following equation: 2. = [0.21 (Ki/9)] + [ ] This formulation was derived by defining the base rate for mastery classification as the average probability (taken across examinees) of being designated a master. Marshall and Haertel. Marshall and Haertel (1975) also proposed a single administration estimate of_po, known as coefficient beta (8). Their coefficient equals "the mean of all possible split-half coefficients of agreement" and is, consequently, analogous to coeffi- cient alpha (Marshall & Haertel, 1975, p. 3). “1 To derive 8, scores on a hypothetical Zn-item test must first be simulated from examinees' scores on an n-item test. Using the binomial error model, this simulation is accomplished via the following equation: a B - u - z a (2n) (ll/o)“- (l-ol/n»213W where fix denotes the frequency of score 3 on the n-item test, and -w equals the frequency of score W on a 23-item test. Using these simulated scores, Marshall and Haertel define 8 for an n-item test as: l_v 8 a v 2 p o=1 Q where p0 is the proportion of agreement consistency between two split- half tests of n items each, and u is the number of possible splits which can be obtained from the 29-item test. The latter quantity 2n equals (2;): Marshall and Haertel's computational formula for B is: Qx-l 2-x-2 3+Qx-l a - + - o - .. - ~ 0 - B llafwfos! EC uh, egos (CE 1). 9‘3 maize hi, guys 93.) _ _ 5 _ .3 22 + 2 N where fl=_-I~(_:xjg .5 = number of examinees ,W : examinee's score on a 23-item test B = number of test items 3" = frequency of score W cut-off score on an g-item test 10 I u D(H 2n-W\ 2n Qw(a,p) 71 j n—j/ / Q or the proportion of splits - =§~ -- “2 resulting in a half-test score of from a to b inclusive, given a total score of W. As can be seen, 8 is the mean of its additive parts and, therefore, each examinee's score makes a specific contribution to beta's magnitude (Marshall, 1976). The further the score departs from the value 2Qx-1, the more it contributes to the size of B (Marshall, 1976). A score equalling 2§x—1 makes a zero contribution; at this particular value, the examinee must always be classified as a master on one half of the test and a non-master on the other half (Marshall, 1976). Similar to Subkoviak's model, the validity of using the binomial error model in Marshall and Haertel's formula is questionable. However, results of a study investigating the bias of various estimates showed that 8 produced quite accurate estimates of £0 when items were not homogeneous, particularly for longer tests (n=30, 9:50) (Subkoviak, 1978a). One drawback of this model, as noted in a personal communication from Marshall (1980), is the use of the proportion correct score as the true score estimate in computing MK. As previously mentioned, the standard error of this estimate is reasonably large when n 5 “0. Apparently, Marshall no longer recommends this procedure (Marshall & Serlin, 1979). A regression or Bayesian estimate can easily be incor- porated into the procedure. In one study, Marshall and Serlin (1979) actually used a predictive Bayesian beta model as well as other models to obtain the frequency distribution for a ZQ-item test. “3 Huynh, Huynh (1976) developed a single administration estimate of £9 and kappa based upon Keats and Lord's beta-binomial test score model. Like Subkoviak's and Marshall and Haertel's formulations, this model assumes an examinee's scores given his/her true score follow a binomial distribution (Huynh, 1976; Keats & Lord, 1962). According to Huynh (1976), assuming similarity of item difficulty and item content (i.e., item exchangeability) is reasonable for criterion-referenced measurement because all items should measure a single trait. Moreover, his p2 appears robust with respect to violation of the former assumption (Subkoviak, 1978a). Specifically, violation of this assumption resulted in slightly conservative estimates of reliability for a 10-item test and had little effect on longer tests (Subkoviak, 1978a). The Keats-Lord model also assumes true scores follow a beta distribution. The beta distribution family includes a wide range of shapes although multi-humped distributions are not included (except for a U-shaped function where the modes occur at 0 and n). The para- meters of the beta distribution, a and B, can be computed from the mean and standard deviation of a large sample score distribution: ' 1 o ' (-1-+'--) ' U o 21 1‘- '” u (ti-u ) n E K B a - a - B.+ 2.. where o21 = KB-21= 551 1 ‘ 2 (Huynh, o21 “ BU . 1976). Under the beta-binomial model, the observed score distribution has a negative hypergeometric distribution with the following density: (1:) Mods. n+8-25) §(o.B) nu where §_denotes the beta function (Huynh, 1976). Huynh (1976) has provided computational formulas for evaluating §(§). Estimating?o and kappa also requires determining the joint distribution of equivalent test forms, §(§,y). Assuming local independence with respect to the true score, f(;,y) can be simulated. This distribution follows a bivariate negative hypergeometric or beta-binomial distri- bution with the following density: n n £(§.y) -(§) ‘§)§(a+§+y. 29+B-z-y) (Huynh. 1976). ‘ _1.3(a.8) ' ‘ Huynh (1976) also presented computational formulas for §(§,y). Given a particular cut-off score, these formulas can be used to calculate the proportion of examinees who would be placed in the mastery category on both test forms @911), the proportion who would be consistently classified as non-masters (POO)’ and the proportion who would be given mastery status by only one form {21). These propor- tions are defined in the following manner: Lo 0 O I I fitn. O A I (Huynh, 1976). Irfla'o rum A ‘N v pl :0 >4 tr-n l Given the assumption that the marginal distribution is the same for each form, Huynh (1976) defined Po and kappa as: _ 2 311 31 Pl-PI US When the cut-off score is small, the following formula for E is far more convenient: Poo'l’o 20-20 where 90 is the proportion of examinees classified as non-masters by only one test form (Huynh, 1976). When the number of test items is moderately large (e.g., g_>10), Huynh (1976) suggested using a normal approximation procedure to estimate kappa. In this procedure, an arcsine transformation is applied to the data, resulting in an approximately normal score distribution. Univariate and bivariate normal distribution tables are then used to estimate the probabilities needed for computing 5. Peng and Subkoviak (1980) found that, in the vast majority of his simulated distributions, a simple normal approximation procedure using Yate's correction resulted in less proportionate error in estimating §_ than did Huynh's normal approximation procedure. Pens varied the beta distribution parameters, the cut-off score, and the test length. The upper limit of the latter variable was 30. Using real data, Peng (1979) collaborated his findings. The superiority of the simple normal procedure was more pronounced for short tests and/or moderate cut-off scores (between 65% and 85%). Similar results were obtained when the two normal approximation procedures were used to estimate Po (Peng, 1979; Peng & Subkoviak, 1980). Characteristics of Threshold Loss Indices. As can be seen, the most appropriate threshold loss coefficients are divided into two categories: (1) 29 coefficients; and (2) kappa coefficients. Because “6 the former indices do not take account of chance agreement while the latter ones do, various population and test characteristics affectgo and kappa differently. Since research has shown these factors affect dual and single administration coefficients similarly, the following discussion applies to both unless otherwise stated. First, under the assumption of exchangeability, the theoretical lower limit of‘po is the proportion of agreement expected by chance, while kappa's limit is zero (Huynh, 1978; Subkoviak, 1978b). In general, however, the lower limit of kappa, computed from two test administrations, depends upon the marginal distributions (Cohen, 1960). The upper limit of both coefficients is +1.00. Second, as the cut-off approaches the extremes, Po generally approaches one (Marshall, 1976; Marshall & Haertel, 1975; Subkoviak, 1976, 1977). This trend is particularly evident for symmetric uni- modal distributions (Marshall & Haertel, 1975). On the other hand, kappa generally approaches its lowest value as the cut-off moves toward the distribution extremes (Huynh, 1976; Subkoviak, 1977). This difference can be partly explained by the fact that the probability of chance consistency generally tends toward one as the cut-off approaches the extremes (Huynh, 1976). Therefore, Bo also approaches one, while kappa decreases because not much opportunity exists for increasing agreement above chance (Huynh, 1976). Third, the magnitude of_o has been found to increase as the distance between the cut-off and areas of heavy score density (e.g., the mode) increase (Eignor & Hambleton, 1979; Marshall, 1976; Subkoviak, 1976, 1977). Given_¥1<:1.00, examinees scoring close to u? the cut-off on the first test administration could easily obtain a score on the opposite side of the cut-off on the second administra- tion. On the other hand, those further away from the cut-off would more likely be placed in the same mastery state in both testing sessions. Therefore, the greater the number of scores further away from the cut-off, the higher the 20' EXceptions to this relationship have been found for the single administration coefficients (Marshall & Serlin, 1979). Marshall and Serlin (1979) examined the behavior of these coefficients given five different distributions: (1) bell- shaped; (2) highly negatively skewed unimodal; (3) bimodal with a stronger mode at the higher end; (u) symmetric bimodal with modes widely separated; and (S) symmetric bimodal with modes close together. With the exception of the fifth distribution, the size of Subkoviak's};o generally reflected the distance between the cut-off and the mode for both unimodal and bimodal distributions. Fortu- nately, the fifth distribution is atypical in mastery testing 8 reflected the cut-off's ~2. position for unimodal distributions and bimodal distributions with (Marshall & Serlin, 1979). Huynh's extreme modes, but not for bimodal distributions not belonging to the beta-binomial family. For Marshall and Haertel's index, five different test score models were used to simulate scores on a 29-item test from scores on an gyitem test. The adequacy of their‘fiQ in reflecting the cut-off's relative position depended upon the model used to generate scores. One of the best models was a binomial regression model comparable to that used in Subkoviak's index. This model produced results similar to those obtained with Subkoviak's .fig: An averaged double binomial model introduced by Marshall and H8 Serlin also reflected the location of the mode(s) for both unimodal and bimodal distributions. In contrast, given the assumption of exchangeability, Huynh (1978) mathematically proved that kappa is an inverted U function of the cut-off when the data are normally distributed. This relationship was also empirically supported for normally distributed data as well as for various beta-binomial and some bimodal distributions (Eignor & Hambleton, 1979; Huynh, 1976, 1978; Marshall & Serlin, 1979; Subkoviak, 1977). Apparently, the location of the cut-off relative to the score density affects kappa in a manner cpposite to its effect on .99’ i.e., kappa is greater when the cut-off is located near heavy score density areas. Intuitively, one might expect kappa to behave similarly t°.Po' The difference appears to be due once again to the influence of chance agreement. Specifically, in many distributions, 2c decreases as the cut-off approaches heavy score density areas, leading to a decrease in pa. However, kappa increases because more Opportunity exists for agreement above that expected by chance. Generally, the cut-off score appears to affect the magnitude of ‘39 and kappa in two ways, i.e., through its relative position to the extremes and to the heavy score density areas. Conceivably, these two influences could interact, producing some unpredictable results. For example, what would happen to the size of_po and kappa 1f the cut-off and the mode were equal to 2? Marshall (1976) used this interactional effect to explain the unpredictable relationships found between the cut-off and his coefficient. This effect probably also explains some unforseen trends Eignor & Hambleton (1979) found with kappa. H9 Fourthhpo does not require score variability to attain its upper limit but kappa does (Kane & Brennan, 1977). However, both coeffi- cients increase as the variance increases (Huynh, 1976; Marshall, 1976; Swaminathan et al., 197“). A large variance implies extreme scores and, consequently, better differentiation between masters and non-masters (Marshall, 1976). Fifth, although all the aforementioned variables affect‘po and kappa differently, the test length and the classical reliability coefficients affect them similarly. Specifically, as the number of test items increase, Bo and kappa increase (Eignor & Hambleton, 1979; Huynh, 1976, 1978; Marshall, 1976; Marshall & Haertel, 1975; Subkoviak, 1978b; Swaminathan et al., 197R). Increasing the test length probably results in a more accurate true score estimate and, consequently, a more reliable estimate of an examinee's mastery state. Correspondingly, as the classical reliability coefficient increases so should .0 and kappa. Marshall (1976) found the mean of his coefficient taken over various cut-off scores was highly correlated with Kfi-21 across several distributions (Bh9=.93). Given parallel tests, dual administration kappa was mathematically and empirically shown to increase as the classical reliability coefficient increased for a normal distribution and a beta-binomial model, respectively (Huynh, 1978). In addition, Downing and Mehrens (1978) found that Huynh's single administration kappa coefficient correlated .96 and .98 with 53-20 and 33:21, respectively. On the other hand, Algina and Noe's results (1978) did not support a relationship between Subkoviak's go and a classical coefficient. 50 Synthesis In the foregoing discussion, which coefficient to use in a particular mastery testing situation was not delineated. The present section addresses this issue by synthesizing the previous material and determining the major distinctions among the various coefficients. Using the concept of agreement functions, Kane and Brennan (1977) provided a single consistent framework for viewing the reliability coefficients. As explained by Kane and Brennan (1977), an agreement function denotes the extent of agreement between the interpretation of examinees' scores on randomly parallel tests. For mastery measure- ment, coefficients are based upon either a squared-error (with respect to the cut-off) or a threshold agreement function corresponding to the squared-error and threshold loss functions previously discussed. Kane and Brennan showed that the indices equal either the proportion of maximum agreement achieved by the measurement procedure or the proportion of maximum agreement achieved over and above that expected by chance. Maximum agreement is the expected agreement between a testing procedure and itself, while the agreement produced by the measurement procedure is the expected value of the agreement function. Figure 2 presents the major single test administration reliability coefficients within their appropriate categories, formed by crossing type of agreement function with the presence of a chance agreement correction. One must first decide whether to use squared error or threshold agreement coefficients (Kane & Brennan, 1977). Since the former coefficients are concerned with the extent of deviation from the cut- off, their size reflects the magnitude of errors (Brennan & Kane, 51 Chance Agreement Type of Agreement Function Uncorrected Corrected Squared Error Livingston's K?(X,Ix) Brennan & Kane's O :Brennan & Kane's ¢(X) Threshold gSubkoviak's po Subkoviak's kappa ‘Marshall & Héértel's Huynh's kappa ‘20, Huynh'slpg Figure 2.--Mastery Testing Reliability Formulations 1977a). In other words, they do not consider all inconsistent classi- fications or misclassifications to be equally serious, but assume that misclassifying an examinee whose true ability level is far from the cut-off is much more serious than misclassifying someone whose true ability is close to the cut-off (Brennan & Kane, 1977a). This advan- tage is particularly compelling since cut-off scores are, to some extent, arbitrarily determined and, therefore, a sharp distinction between masters and non-masters seldom exists (Brennan & Kane, 1977a; Glass, 1978). Furthermore, different procedures for setting cut-off scores result in different cut-offs (Brennan & Lockwood, 1979). However, a drawback of these coefficients is their sensitivity to all errors, even those not resulting in inconsistent mastery decisions (Brennan & Kane, 1977a). On the other hand, threshold agreement indices do not reflect the magnitude of errors but are only sensitive to errors resulting in misclassification (Brennan & Kane, 1977a). The disadvantage of these coefficients is that they consider all misclassifications to be equally serious (Brennan & Kane, 1977a). 52 Clearly, neither the squared error nor the threshold agreement coefficients are optimal in every situation. Kane and Brennan (1977) suggested the following course of action: The threshold agreement coefficient is appropriate whenever the only distinction that can be made usefully is a qualitative distinction between masters and non— masters. If, however, different degrees of mastery and non-mastery exist to an appreciable extent, the threshold agreement function is not appropriate because it ignores such differences (p. “0). Since reliability is relative to the score interpretation, the appro- priate agreement function should be dictated by the way the scores will be used (Popham & Husek, 1969; Subkoviak, 1978b). If the degree of mastery or non-mastery is of interest, coefficients incorporating a squared-error agreement function are more suitable (Subkoviak, 1978b). This situation occurs when different actions or programs are to be initiated based on how far from the cut-off an examinee scores and/or when distance from the cut-off leads to unequal misclassifi- cation losses (Brennan & Kane, 1977a; Popham & Husek, 1969). When only two courses of action are possible and misclassification losses are considered equal, threshold agreement coefficients should be applied (Brennan & Kane, 1977a). Likewise, if there exist more than two mastery categories and no differential misclassification loss related to distance, threshold agreement indices can be used. However, Kane and Brennan (1977) stated that threshold agreement coefficients are inappropriate when more than two mastery classifications exist and these categories are ordered. Addressing the ordered case, Goodman and Kruskal (195D) proposed two other measures which account for how different an individual's mastery classification on two test administrations is. No single 53 administration index of these ordered coefficients has been formally developed. However, it seems the single administration threshold agreement indices could easily be adapted to this purpose. The next decision one must face is whether or not to use a coefficient accounting for chance agreement. Differentiating between corrected and uncorrected coefficients is important because they provide different kinds of information about reliability (Kane & Brennan, 1977; Subkoviak, 1978b). The uncorrected squared-error and threshold agreement indices indicate the reliability of the deviation scores and the mastery classifications, respectively, i.e., the consistency of the score interpretation (Kane & Brennan, 1977). Both chance agreement and the consistency contributed by the testing procedure affect the value of these coefficients (Kane & Brennan, 1977). In comparison, corrected coefficients measure only the latter source of consistency, i.e, the contribution of the testing procedure to the reliability of scores over and above that expected by chance (Kane & Brennan, 1977). Clearly, the choice between corrected and uncorrected coefficients depends upon whether one wants to determine the consistency of scores regardless of the causes of this consistency (i.e., test procedure, group composition, group's mean ability) or the reliability of the testing procedure irrespective of the group's characteristic ability or mastery level (Subkoviak, 1977). In discussing threshold loss indices, Livingston and Wingersky (1979) and Berk (1980) do not recommend using the corrected coefficient, kappa, in situations where an absolute cut-off has been SH established because the correction for chance takes the marginal frequencies as given. As stated by Livingston and Wingersky (1979): Applying such a correction to a pass/fail contingency table is equivalent to assuming that the proportion of examinees passing the test could not have been anything but what it happened to be (p. 250). However, the present author fails to see how this fact differentiates kappa from any other reliability estimate which uses sample statistics (e.g., the sample mean) as estimates of population values. The corrected indices, coefficient kappa and 8, could be criticized because they approach or equal zero when little or no true mastery score variability exists (i.e., when everyone is placed in the same mastery state or receives the same domain score, respectively) even though the scores may be perfectly reliable (Berk, 1980). However, this criticism is unwarranted. These coefficients' low values in the presence of small variability do not indicate that the mastery scores are unreliable, but simply that the testing procedure does not add much more reliability to the scores above that achieved by chance processes (Kane & Brennan, 1977). In other words, a testing procedure resulting in some sort of criterion-referenced score interpretation must produce variability in terms of those scores if the procedure is going to contribute to reliability (Kane & Brennan, 1977). On the other hand, the uncorrected coefficients can be large even when no true score variability exists because of the score «consistency contributed by chance processes. These observations provide a new perspective on Popham and Husek's disagreement with Woodson over the variability issue (Kane 8: Brennan, 1977). To 55 reiterate, Popham and Husek contended that variability is not a necessary characteristic of a good criterion-referenced test, while Woodson argued that a test with no variability provides no informa- tion. It appears that Popham and Husek's argument applies to the score interpretation, while Woodson's argument applies to the test's contribution to this interpretation (Kane 8 Brennan, 1977). As previously discussed, the four types of coefficients depicted in Figure 2 react differently to the relative position of the cut- off. Obviously, the cut-off's location does not affect the corrected squared error coefficient. However, the uncorrected squared error indices are sensititve to the distance between the mean and the cut- off; they increase as this distance increases. The Bo and kappa indices are generally not expected to be sensitive to this difference unless the mean reflects heavy score density areas. On the other hand, squared error indices are not sensitive to the distance between the cut-off and the mode or heavy score density areas, while uncorrected threshold indices are hypothesized to increase as this distance increases. In contrast, the corrected threshold indices appear to be greater when the cut-off is located in heavy score density areas. For example, when scores are normally distributed, a U function characterizes the relationship between the cut-off score and 90’ while kappa is an inverted U function of the cut-off. Similar t°.Po’ the uncorrected squared error indices are also a U function of the cut-off score given a normal distribution since the mean equals the mode (Marshall, 1976). However, when the distribution is skewed and/or bimodal, uncorrected squared error coefficients will 56 increase while uncorrected threshold indices will decrease as the cut- off moves from the mean toward the mode(s). Correspondingly, for bimodal distributions, Marshall (1976) found that the magnitude of his g2 and 32(§,IX) did not fluctuate similarly as the cut-off score varied. This observation is particularly relevant in mastery measure- ment since the score distribution on any given test administration is often bimodal and, in some cases, is expected to be skewed (e.g., after an instructional program) (Marshall, 1976; Marshall & Serlin, 1979). Although the list of applicable coefficients can be reduced by choosing an appropriate agreement function and deciding whether or not to correct for chance processes, one must still select among alterna- tive formulas in many cases. The choice of an appropriate index in these instances depends upon the number of feasible test administra- tions, the satisfaction of the assumptions underlying a particular index, the coefficient's robustness to violations of these assumptions, the coefficient's bias in estimating the dual administra- tion population index, and the degree of sampling fluctuation exhibited by the coefficient. In most situations, two test administrations are not possible and, therefore, the applicable coefficients are typically those requiring only one test administra- tion. If one has decided to use an uncorrected squared error coeffi- cient, one can choose Livingston's K?(§,!x) and/or Brennan and Kane's 0(A). A major difference between these indices is that K?(K,Tx) is based upon classical test theory, while 0(A) is derived from 57 generalizability theory (Brennan, 1979). The latter theory has two distinct advantages over the former. First, generalizability theory provides the opportunity to examine the reliability of data derived from different types of experimental designs, e.g., nested design (Brennan, 1978). This theory also allows one to take account of whether the various effects are fixed or random (Brennan, 1978). Second, generalizability theory can differentiate norm- from criterion-referenced measurement by distinguishing between different error variances, while classical test theory cannot (Brennan, 1979). Specifically, Brennan and Kane's approach indicates that 021 is the appropriate error term for norm-referenced measurement, - while 031 + 0% is the proper error variance in criterion-referenced measurement (Brennan, 1979). Clearly, the classically parallel test assumption obviates the existence of 0%. Generalizability theory assumes tests are randomly parallel. Brennan (1979) finds the classi- cally parallel test assumption unreasonable for criterion-referenced testing since the test construction method does not require content specialists to include only items with the same difficulty level in the domain. If, as expected, the items in a domain have various difficulty levels, it would be very unlikely for all tests constructed from this domain to be classically parallel (Brennan, 1979). Further- more, since K?(X,Tx) equals @(X) when test means are equal, K2(K,Tx) is really a special case of 0(K)- For these reasons, the more general 0(A) appears preferable to K2(K,Tx) (Brennan, 1979). Unfortu- nately, no empirical research concerning the bias and sampling fluctuation of these coefficients exists. 'r‘ 1‘) C) '—J 58 When considering uncorrected threshold lOss indices, the appro- priateness of several alternative Bo formulas must be evaluated. If feasible, 2 can, of course, be estimated from two test administra- o tions. The dual administration p0 is unbiased and its standard error equals (2(1-9)/y)‘/2 (Huynh & Saunders, 1979). Generally, formulas for evaluating the standard error of the single administration 39 estimates have not been developed and very little empirical evidence pertaining to their bias and standard error have been produced. However, assuming a beta-binomial score distribution, Huynh (1978) showed that his go index is asymptotically unbiased and also presented a formula for the asymptotic standard error of this estimate. In addition, Huynh and Saunders (1979) found that Huynh's p9 generally underestimated the dual administration;o for large data sets not conforming to the beta-binomial model as well as for small and moderate sized samples (3:20,uo,60). In the former case, the average amount of bias was -2.31 across various test lengths and cut-off scores. For the small and moderate sized samples, the average degree of bias was -2.6% across various test lengths. Assuming a large sample size, Huynh and Saunders (1979) also compared the standard error of the dual administration pg to that of Huynh's estimate for various beta-binomial distributions, test lengths, and cut-off scores. The mean and 53921 of the distributions were chosen to reflect one of the following shapes: (1) U-shaped with the higher mode at the upper end of the distribution; (2) symmetric; (3) unimodal with the mode lying between H and g; and (u) J-shaped. In every instance, the standard error of Huynh's estimate was lower than its dual administration counterpart. On the average, the dist: SEVQ! Saunc error A“; ' Mum 59 standard error of the former was 59.3% of the latter. The uniformly smaller standard error of Huynh'slpo was also found for large sample distributions significantly different from the beta-binomial in several instances and for small to moderate sized samples (Huynh & Saunders, 1979). Over all the situations considered, the standard error of Huynh's fie was 50.fl% and 51.us of that of the dual administration estimate, respectively. Only one study correctly compared the bias and standard error of all the _pa estimates (including the dual administration 139) (Subkoviak, 1978a). In this study, Subkoviak's coefficient was based upon a compound binomial instead of a binomial model, and the proportion correct score was used as the true score estimate in Marshall's p2. Each estimate was computed for 50 random samples of 30 students each and compared to the dual administration_po obtained in the population (fle1586). Comparisons were made for three test lengths (10, 30, SO) and four cut-off scores (.59, .6n, .7n, .8n). The mean and standard deviation of each estimate across the 50 samples provided the necessary data for judging the estimate's bias and standard error. All the estimates became more accurate as the test length and the distance between the mean and the cut-off increased. Moreover, the estimates' standard errors decreased. The influence of the distance between the mean and the cut-off can be partly explained by the fact that estimates become more accurate and less variable as the population parameter becomes more extreme (Subkoviak, 1978). Corre- sponding to Huynh and Saunder's results (1979), the dual administration estimate was unbiased but had the largest standard error regardless of the test length and cut-off score. Huynh's‘fiO 6O underestimated go for short tests and was, generally, less variable than the other indices for 30- and 50-item tests. Marshall and Haertel's E0 was biased upward when the cut-off was near the mean and biased downward when the cut-off was in the tails of the distribution. This effect was more pronounced for shorter tests. Conversely, for short tests, Subkoviak's fio underestimated 90 when the cut-off was near the mean and overestimated £0 for more extreme cut- offs. This finding was similar to that found by Algina and Noe (1978). The opposite reaction of Marshall's & Subkoviak's indices may have been due to the use of different true score estimates. Specifi- cally, Marshall and Haertel's use of the proportion correct score produces an overestimate of the true score variance, while Subkoviak's regression true score estimate results in an underestimate of this variance (Algina & Noe, 1978). It should be noted that Subkoviak's go showed no consistent pattern for longer tests. Finally, Marshall and Haertel's index was the least variable but the most biased for 3:10. Except in this latter case, none of the four coefficients was substan- tially biased. In evaluating which single administration Po estimate to apply, the assumptions underlying each of them should be examined. All assume the distribution of an examinee's test scores given his/her true score is binomial. Recognizing that the equal item difficulty assumption might be unrealistic, Subkoviak (1976) proposed using the compound binomial instead of the binomial model. However, whether or not this more complicated procedure improves estimation of_po is highly questionable. Use of the compound binomial in Subkoviak's fig generally produced results similar to those obtained using the 61 binomial model (Marshall & Serlin, 1979). Furthermore, Huynh and Saunders (1979) found the standard deviation of item difficulties was not related to the degree of bias associated with Huynh's_§Q and Huynh's kappa estimate, and Subkoviak (1978a) provided evidence that all three coefficients are robust with respect to violation of the equal item difficulty assumption. Another assumption implicit in all three single administration coefficients is classic parallelism (Kane & Brennan, 1977). The validity of this assumption in criterion-referenced testing has already been questioned. When tests are not classically parallel, these coefficients will probably overestimate 20. To the author's knowledge, no empirical evidence addressing this question exists. Those few studies examining the bias of one or more of these estimates included only parallel tests (for example, Subkoviak, 1978a). HuYnh and Saunders (1979) noted that Subkoviak's procedure and Huynh's procedure assume the score distribution is beta-binomial. Therefore, they should have similar patterns of bias and standard error. Huynh and Saunders (1979) concluded that such was the case in Subkoviak's investigation (1978a). Although not explicitly stated, Subkoviak's study of bias appears to have been performed on data fellowing a normal distribution. The normal distribution is not a member of the beta-binomial family, although this family does include a ”normally" shaped distribution (Gross A Shulman, 1980). The bias and standard error of these estimates have not been investigated for distributions more typically found in criterion-referenced measurement, i.e., skewed and bimodal (Marshall, 1976; Marshall & 62 Serlin, 1979). Examining these coefficients given the latter distri- bution would be particularly interesting because the beta-binomial family does not include bimodal distributions, except for U-shaped and J-shaped functions (Gross & Shulman, 1980). Both these distributions are not expected to occur in the real world (Marshall & Serlin, 1979). Subkoviak (1976, 1978a) has stated that using a single regression equation to estimate the true score in his procedure is inappropriate given a bimodal distribution and has recommended using Huynh's procedure. However, Marshall and Serlin (1979) found that the magnitude of Huynh's p9 did not reflect the location of the modes for bimodal distributions, while Subkoviak's fie reflected the mode(s) for both unimodal and bimodal distributions. Although not explicitly stated, the researchers appear to have used a single regression equa- tion to obtain Subkoviak's true score estimate for the bimodal as well as the unimodal distributions. Gross and Shulman (1980) investigated the robustness of the beta-binomial model; they compared empirical values of_po obtained from two test administrations to the theoretical values of‘p9 derived from the beta-binomial model when its underlying assumptions were violated. They found that the theoretical and empirical values were in close agreement. However, the authors did not indicate the shape of the score distribution nor how severely the assumptions were violated. One of the most enlightening findings concerning the_po estimates evolved from Marshall and Serlin's study (1979). They used five versions of Marshall and Haertel's go varying in terms of the model used to simulate scores on a Zn-item test. They found that Huynh's p0 and Subkoviak's p0 were empirically equivalent to Marshall and choosin differe empiric least b 63 Haertel's estimate when the assumptions of the former indices were applied to the latter coefficient. Specifically, when the Keats and Lord beta-binomial model was used to simulate scores on a 29-item test for Marshall's @0, this index was equal to Huynh's?)o in each of 300 cases. Similarly, when a binomial regression model was used to simulate scores, Marshall's and Subkoviak's indices were equal. In summary, Marshall's p0 appears to be a general index subsuming the other two coefficients and is equal to them when the data are postu- lated to meet certain assumptions (Marshall & Serlin, 1979). Therefore, a choice among the three coefficients seems reduced to choosing among various test models rather than among three entirely different coefficients (Marshall & Serlin, 1979). Clearly, much more empirical research is needed to choose which test model results in the least bias and standard error given a particular type of distribution. Finally, if the situation demands a corrected threshold agreement index, one can use a dual test administration kappa estimate, Subkoviak's model, and/or Huynh's procedure. The dual administration kappa estimate is asymptotically unbiased (Huynh & Saunders, 1979). However, Huynh and Saunders (1979) found a small negative bias for both small (fl=20, HO) and moderate (fl=60) sized samples. They also presented a formula for computing this estimate's asymptotic standard error. Given a beta-binomial distribution, Huynh (1978) showed that his single administration kappa formula is also asymptotically unbiased and presented a formula for its asymptotic standard error. For several large data sets, some of which were significantly different from a beta-binomial distribution, Huynh and Saunders (1979) found 69 that this estimate tended to underestimate the pOpulation dual administration kappa. Across various test lengths and cut—off scores, the average percent of bias was —7.8. The same trend was found for small and moderate sized samples; across various test lengths, the average percent of bias was -11.0. Huynh and Saunders (1979) also compared the standard error of Huynh's kappa to that of the dual administration estimate. Over various beta-binomial distributions, test lengths, and cut-off scores, the standard error of Huynh's kappa was consistently lower. On the average, it was 53.2% of the standard error of the dual administration kappa. The uniformly smaller standard error of Huynh's estimate was also found for large data sets with distributions significantly different from the beta-binomial in several instances as well as for small and moderate sized samples. On the average, the standard error of Huynh's estimate was 50.2% and 56.9$ of the standard error of the dual administration coefficient, respectively. The bias of Subkoviak's kappa has not been investigated, and no studies have compared the bias and standard error of Subkoviak's and Huynh's kappa estimates. The same issues raised under the discussion of the bias of the go estimates are also relevant for kappa formu- lations. Specifically, these coefficients' biases and standard errors need to be evaluated for various score distributions, including a bimodal, and for situations where the classic parallelism assumption is violated. Obviously, the lack of empirical research does not allow definitive recommendations as to which coefficient to use within each cell of Figure 2 given a particular situation. In order to address 65 some of the uninvestigated issues raised in this discussion, the current study was conducted to assess the influence of various test characteristics upon the bias and standard error associated with each major single test administration coefficient when estimating the appropriate dual test administration population coefficient. Speci- fically, the effects of the following variables were examined: (1) violation of the classic parallelism assumption (2) shape of the test score distribution (3) test length (n) cut-off score (5) number of examinees in the sample Those coefficients whose derivation is based upon the assumption of classically parallel tests were expected to be more biased when this assumption was violated (i.e., when the tests were randomly parallel). The shape of the test score distribution (particularly a bimodal distribution) was hypothesized to influence the bias of the threshold agreement indices because of their implicit or explicit distributional assumptions. The location of the cut-off was not expected to affect the extent of bias. Finally, a decrease in standard error was predicted as test length and sample size increased. METHOD Data Base Several populations reflecting different distributional shapes were generated from data obtained from one of two sources. The first data base came from the responses of a sample of Michigan public school fourth graders to various criterion-referenced tests admin- istered by the Michigan Educational Assessment Program (MEAP). MEAP annually collects data on fourth, seventh, and tenth grade students' attainment of various reading and mathematics objectives which address several of the minimal skills beginning students in these grades should have. Using a replicated, systematic sampling procedure, MEAP annually selects approximately 5000 students in each grade and computes each test's technical characteristics from their data (Michigan Department of Education, 1977). (In applying this sampling plan, the Michigan Department of Education (1977) randomly chooses ten numbers identifying the first member of each of ten systematic samples. A spacing factor is computed and added to each of these numbers to identify the next member of each set. The spacing factor is repeatedly added to the previous set of numbers until the requisite sample size has been attained.) The data obtained from a sampling of 5,0fl0 fourth grade students in the fall of 1979 served as the major population data base in this study. The second data source or popu- lation was the responses of 589 college students to a mid-term exam given in their introductory psychology course. This exam was a "norm- referenced test" and produced a distribution not commonly found with criterion-referenced tests. 66 ”net Ch Div-v h: d; tne stu zero; a that $0 tional iata. is also lower m 67 Procedure Test Characteristics Distribution shape. Four distributions were incorporated into the study: (1) severely negatively skewed; (2) J-shaped; (3) bimodal with a bigger mode at the upper end and a lower mode not equal to zero; and (A) normal. The first distribution was believed to typify that found when a criterion-referenced test is given after an instruc- tional or a training program (Marshall, 1976). Correspondingly, this distribution was found in the MEAP data. The J-shaped distribution was included in the study because it was also represented in the MEAP data. According to Marshall and Serlin (1979), a bimodal distribution is also frequently found in mastery testing situations. Setting the lower mode unequal to zero was intended to reflect the probability that a non-master would guess the correct answer to one or more questions. Marshall and Serlin (1979) contended that this distri- bution is much more likely to occur in mastery testing than a J- or U- shaped distribution, especially when guessing is a viable factor. In some cases, the MEAP data (considering each grade) did follow a bimodal distribution with the lower mode equal to one. However, the bimodal did not occur more often than the J-shaped distribution. Finally, a normal distribution was included to explore the appro- priateness of the reliability formulas for typical norm-referenced tests. Note that the bimodal and the normal distributions are not members of the beta distribution family. 3388313 1976; 68 Test length. Test lengths of 5, 10, 15, and 20 items were examined. These test lengths typify those found for criterion- referenced tests and/or are representative of those needed to produce a high probability of accurately assigning respondents to a mastery state (Algina A Noe, 1978; Klein A Kosecoff, 1973; Marshall, 1976; Novick and Lewis, 1974). Furthermore, Berk (1980) recommended using between five and ten items per objective for most classroom decisions and between 10 and 20 items for school, system, and state level decisions. Cut-off score. Three cut-off scores, 70%, 80$, and 90$ were employed because they are representative of those occuring in mastery measurement and/or those recommended for usage (Block, 1972; Marshall, 1976; Novick A Lewis, 1974). To adequately effect cognitive learning and, concurrently, maintain interest in learning, Block's research (1972) has shown that the cut-off should be set between 80 and 85 percent. Marshall (1976) stated that one would typically use between 60 and 90 percent, and Novick and Lewis (1979) noted that the range seems to be between 70 and 85 percent in Individually Prescribed Instruction. Given the previously specified test lengths and the integer value of test scores, specifying three test scores (advancement scores) equalling the chosen cut-off levels was not always possible. There- fore, reliabilities were only computed for those combinations of test length and cut-off score for which a test score resulting in a percentage equal to or slightly greater than the given cut-off could be specii ment scor randomly chosen be illustra‘ organiza' because formulas for long- were beL SiZe. 69 be specified. Figure 3 presents these combinations and each advance- ment score with its associated cut-off level. Number of examinees. Sample sizes of 25, 35, and 50 were randomly selected from the population. The first two values were chosen because they were believed to typify classroom sizes and to be illustrative of the number of people participating in various organizational training programs. A sample size of 50 was used because it has been recommended that estimation of'c and B in Huynh's formulas be accomplished with u_> no for very short tests and N Z_2n for longer tests (Subkoviak, 1978). Finally, these three sample sizes were believed to be divergent enough to study the effects of sample size. Cut-off Level Test Length 701 80$ 90% 5 “/5 (80%) 10 7/10 (701) 8/10 (80%) 9/10 (905) .15 11/15 (73%) 12/15 (801) 1u/15 (93%) 20 1u/20 (70:) 16/20 (80%) 18/20 (90%) Figure 3.--Advancement Scores for Each Combination of Test Length and Cut-off Level Data Generation Item Domain. The study required a domain of items from which randomly and classically parallel tests of various lengths could be drawn. Specifically, a content domain consisting of at least ”0 items was needed to construct alternate forms of all possible test lengths included in this study. Since all the MEAP criterion-referenced tests 7O consisted of five items, items had to be taken from at least eight tests, measuring different objectives, to form the domain. MEAP groups the mathematic and reading objectives into major skill areas. For example, the program includes 15 mathematics objectives tapping various aspects of numeration skill. A content analysis indicated that eight tests from the numeration skill area appeared to measure similar objectives. These no items were intercorrelated and subjected to a principal components analysis. The mean item intercorrelation within objectives was .36. The mean intercorrelation between items on different objectives, computed by systematically sampling correlations within the HO x ”0 correlation matrix, was .16. The principal compo- nents analysis yielded a general factor accounting for 21.H% of the variance. Ten factors had eigenvalues greater than or equal to one. A varimax rotation indicated that, generally, items within a partic- ular test loaded highest on the same factor and each factor was defined by the items on one particular test. In summary, the set of ”0 items was more heterogeneous than what one might find for a very narrowly defined objective. However, the KB-ZO was .89, indicating a fairly high internal consistency. Therefore, the researcher decided to use these items to construct the domain. Forty students did not reach the questions in one or more of the eight MEAP tests comprising the domain and were, therefore, eliminated from the data base. Based upon 5,000 students, the p values of the “0 items ranged from .69 to .96. The mean and standard deviation of domain scores were 35.79 and 5.34, respectively. items h 33-20 a standar 71 The second data base, the psychology mid-term exam, consisted of 46 items. To increase this item domain's internal consistency, six items with low item-total correlations were eliminated. The resultant KR-20 was .68. The 2 values ranged from .19 to .96, and the mean and standard deviation of domain scores were 27.74 and 4.54, respectively. Score Distributions. The reason for using two data sources in this study was to provide a population representative of each distri- bution under investigation. The negatively skewed, J-shaped, and bimodal distributions were based upon the MEAP data, while the normal distribution was represented by the psychology mid-term domain scores. Similar to the majority of MEAP's criterion-referenced tests, the eight numeration tests produced negatively skewed distributions. Not surprisingly, the frequency distribution of total scores on the 40- item domain was also negatively skewed. Figure 4 presents the graph of this population distribution. To generate the J distribution, the domain scores were inverted and merged with the original scores. The resulting distribution closely resembled a U. Then, a new population was formed by randomly sampling 3,500 students from the original distribution (upper half of the "0") and 1,500 students from the inverted distribution (lower half of the "U"). As can be seen in Figure 5, the graph of this population closely follows a J-shape. The bimodal distribution was formed by altering the scores of a random sample of people from the negatively skewed distribution on a random sample of items. Specifically, the researcher first sampled 6% of those with scores greater than or equal to 30 and changed their 1,000 900 800 700 600 500 Frequency 400 300 200 100 man—lb 72 I L l l I T l l (0 15 20 2t 30 3 Yo Lnn’ Domain Score Figure 4.--Skewed Population Frequency Distribution of Domain Scores 73 700 a. 600 Hi 500 -- 400 >» o c m s o: m I: 300 200 100 .L L 1 0 5 10 15 20 ‘— p- «1- 30 3‘5 no (“mt- Domain Score Figure 5.--J-shaped Population Frequency Distribution of Domain Scores 74 scores from right to wrong on a sample of 30 items. If a student had already answered a particular question wrong, the item response was not altered. The reason for changing 30 items was to assure that the lower mode would equal the number of items expected to be answered correctly merely by guessing. This same procedure was repeated two more times with replacement of items and people occurring between each sampling procedure. If a student was selected in more than one sampling procedure, he/she was deleted from the second and/or third sample. These three samples were combined with the unaltered scores in the original distribution, producing the pOpulation frequency distribution depicted in Figure 6. Finally, the psychology mid-term scores were duplicated five times to create enough examinees for the sampling process. The resultant domain scores of 2,945 examinees produced the approximately normal distribution shown in Figure 7. The skewness and kurtosis moments were -.30 and .15, respectively. (In the computer package used in this study, the kurtosis of a normal distribution was zero instead of three.) These statistics indicated that the distribution was slightly negatively skewed and somewhat more peaked than a normal distribution. However, the departure did not appear to be practically significant. Alternate forms. Following the construction of an item domain (and the distribution manipulations, alternate parallel and randomly parallel forms were constructed for each test length. Randomly I>arallel five item tests were formed by randomly sampling items from the domain without replacement. Consequently, alternate test forms my and “V 55 \A LCMv-Lehvkl 20C issue a? 75 800 -- 700 __ 600 ,- soo -- >~. U c: Q) 5- uoo _- 0) H In 300 -- 200 l 100 +- i 1‘ 1 l l 1 1 1 f 1 1 T 1 l I F o 5 1o 15 20 25 3o 35 no Domain Score Figure 6.--Bimodal Population Frequency Distriution of Domain Scores 76 300 d- 250 -‘.. 200 -- >, 2 150 a- Q) a o: o y LI. 100 d- 50 -r 1 1 1 1 J 1 1 L l l I I 1 r I r 0 5 1O 15 20 25 30 35 40 Domain Score Figure 7.-Normal POpulation Frequency Distribution of Domain Scores 77 did not have any items in common. The items from both forms were not replaced in the domain when longer tests were constructed. For each alternate form, tests of 10, 15, and 20 items were built by using those items found on the next shorter test and randomly sampling (without replacement) the necessary number of additional items from those remaining in the domain. For the MEAP data, the same tests were used for the skewed and J distributions. However, since these particular tests did not produce bimodal distributions, the test con- struction process was repeated for the bimodal score domain. The sampling procedure was also repeated for the psychology exam data. Alternate classically parallel forms were constructed by pairing items based on their 2_values and item-total correlations. One item from each pair was placed in each form. The five pairs having the most equivalent items within each pair were used to construct the five-item tests. In forming longer tests, the next closest pairs were chosen and added to those on the next shorter test. Since the 2 values and/or the item-total correlations were expected to change when altering the distribution shape, this process was repeated for each distribution. Determination of Bias For every combination of test length, cut-off score, distribution Shape, and type of parallelism, population values of Bg’ kappa, and lutvingston's K?(3,T ) were computed from two test administrations. BPenman and Kane's (MA) and were also computed in every condition exetept those involving classically parallel alternate forms because 3511. items in the domain would have to have equal 3 values to meet this 78 assumption. Moreover, if all items had equal 2 values, ©(A) would simply equal Livingston's_§2(x,Ix) and 9 would equal the generalizability coefficient (Brennan, 1978, 1979). (Note also that the value of 0 does not change as the cut-off score is altered.) In all, 320 population values were computed. Formulas for each popu- lation coefficient can be found in Figure 8. Thirty independent random samples of 25, 35, and 50 cases were drawn with replacement from each of the four population distribu- tions. Within each cell of the design, an estimate of each population coefficient was computed for each of the 30 samples using the appro- priate single test administration coefficients. These estimates were obtained for only one alternate form. The mean of the estimates within each cell was compared to the population value to determine the magnitude and direction of bias. The standard deviation of these estimates indicated each coefficient's sampling error. The total design contained 240 cells (four distribution shapes, four test lengths, three cut-off scores for tests of 10, 15, and 20 items, one cut-off score for a five-item test, three sample sizes, and either Classically or randomly parallel alternate forms). One problem was encountered in sampling examinees; K§r20 and K_- :21 for some samples were negative or equal to zero. Although negative r‘e.'Liability coefficients can be equated to zero and many of the Coefficients can be computed when 53-20 equals zero, Huynh's Bo and kappa estimates cannot. For each case in which {13-20 or 113-21 was negative or zero, the random sampling process was repeated until other Samples with positive coefficients were found. 79 noumaHunm oHasmm :oHumauchHso< Home onch com meson mamcaopH< some cousaeou nucoHonuooo coHumHsaom auHHHanHox ooococouom1coHLouHLo nuom Lou mmHseLom11. w ocsmHm Hm H1 1 Hazel. ”Hake - 1 om mm H1m m smHsoxosm can 1H. We 91an Ho AH. Mvfioa HH1H1 sum»: mums: . m M . 1 E 11 11 OHM MI :xucdmé + Axum 5H. EH 1 1.”. N N Hm H m.1 mH H11m1 1 m mfiwm\Aw +~wMH mvo\1av11mww+ Aoiu fv+Nm AH: :\ Wov+A Haxwov+ N0 e m.ocmx 1- . .1 new amazonm AoH\Hva + no Mo M Hainmem moiHm1w—wt. MHHmathwm mHamHHam e H1 H m. 1 N A m\H mov+AH :\ ov+ HH11v + o Axve m.oemx N N N Mm 3W3 :SHHHTHM— + as New 1H21Hm5+aiw “Hm: . H as. sages HH1av + o N N Hanna 29:1 3.. AM— 1 Ida mew 1H21HHuHe+aH\w_1.H.s+HHm 1 1 1 N Hanoi.c + NH. WHNH13+H£ ma1+u113+H1¢> 315» M1 1 m om . m.:oumm:m>H4 HA use 12H s s 11m» H1 as H as: H H:s ansaLou oumaHumm quamm mmHseLom cowumazqom unofi0fi&%moo seem oumcgouHH 80 1.m .05 w 1 H an 333 :53: mom 5 m ox Haw—1M1 95:3: mim imH mo— 1% a N111. I 1 11 11 I 1 H1H a1 1. 1.1H1H 1 HaHIAQA Hem 1H V1 A :2on A: .5 H1H 1 H H.1..1imH1o_ N am am a 1N1 1 H maamx n. H.” e .1 m.me>oxasm 1 1 1 1H m 1 H1 m- 1. .1 H1. 1-1H1x iamzmeA H: NH V1 Am: aoAHHVmH 1 HV N1 1 a a a Am H 1.2a m mm mm 11 AoHaHV seas: mom m H oa n.2aHsm 91 7w AoN He 8 1am“ - m 1H mo AIWIV MVIH + AJINI < 1< N1Hamnw V N < ,1 - 1 Ha E fl 1 .11 111 1 CNN 11 .l 11 HNfi. .11 N11 .AHH V of HH. 1 3... . H1? 3 Wis : a 1 :N c < N1 m N 03 Has— om PHouLomm m H H H HuN H1. can 22982 1 a1 1 1 1 1 22.. TVH H? 1H + A H1H.1oVH.H a: H H.. 1 mN t H1 u+ a 1 1 H $1». 1 1H m + HH1HH.:1HHV1HV a. . 3H H + AH: f m N1 uN H1 anssnom 3333 39:8 amassgom zofiumasqom “53:.300 seem mamceouHc 81 .wcoHumLuchHecm ummu :uon :H m.mumum >Lmumms :H umomHa >HpcmumHmcoo oHaomq no COHuLoqonm .coHumLuchHsvm ummu mac :0 M.mumum >Lmumms CH noomHn quoma no coHuLoaogm .msmuH umm» mo gmnszz .mHasmm ms» :H mcompma ho Lmnszz .msmuH new mcomgma no mHQemm soucmn m Lm>o 2mm: .LOLLm HmucoaHLoqu msHa memuH new mcomnoa mo coHuomLmucH ms» 0» mac mocmHLm> .mmgoom cams swuH m2» no mocmHLw> .mmgoom mmLm>Hcs LHmzu Ho mcomnwa Lm>o mocmHLm> .mEmuH mo mmLm>ch Lo :Hmeoo msu cam mcomgwa no coHpmHsaoa ecu :H cams ucmgu .aHuomLLoo cmngmcm mEmuH no :oHuLoaona may mm ommmmnaxm mgoom um01pso .>HuomLLoo vmgmzmcm nequ ho amass: on» ma owmmognxm mLoom um01uso wsm "wuoz A.c.acooV m mgsmHm 82 Within each distribution, the same samples were used to compute estimates of the coefficients for every combination of test length and cut-off score. Furthermore, the same samples were used for estimating the reliability of randomly and classically parallel tests. As mentioned previously, when a test had a zero or a negative 53-20 or ‘Efie21 in a particular sample, the sample was eliminated and another one was chosen. However, only the internal consistency of five-item tests comprised of randomly chosen items was examined in determining which samples to delete. Since the classically parallel forms consisted of different items and since the same set of samples was used in both parallelism conditions, some samples retained in the set had a negative or zero 53221 for the classically parallel form. This problem occurred only for the normal distribution and was probably due to the relatively low internal consistency of the items in this domain. Moreover, within this distribution, the 33-21 for longer tests within both parallelism conditions was negative or zero for some of the retained samples. In those cells where this difficulty sur- faced, the sample(s) was dropped from the cell. Therefore, within some cells, the mean and standard deviation were based on less than 30 samples. However, every cell contained at least 20 samples. Estimation Formulas. Figure 8 presents the single test admin- istration formulas used to estimate each population alternate form coefficient. A few formulas require some explanation. In estimating @(A), Brennan and Kane (1977a) noted that (X-A)2 is not an unbiased 83 estimate of (u-A)2. They presented an unbiased estimate of this term: . 2 1 32 s 32 \ 2 (""2 "1 ‘21 \ (XI-A) -".r1—+n——+nn\ 3 \‘2 11 1311/ In addition, previous discussion of Brennan and Kane's indices assumed the item domain was infinite. However, the domain in this study is a finite universe. To account for this design factor, Brennan (1978) provided formulas for ¢(l) and o in which a finite universe correction factor is applied to the variance components comprising these coeffi- cients. These latter formulas which also incorporate an unbiased estimate of (u - A)2 were used in this study and appear in Figure 8. For Huynh's, Subkoviak's, and Marshall's indices, the researcher assumed that an individual's test scores followed a binomial distri- bution given his/her true score, rather than a compound binomial model. Studies cited previously have indicated that using the binomial model for heterogeneous item difficulty values does not sub- stantially affect the accuracy of these coefficients. Moreover, the binomial model has produced results similar to those found using the compound binomial for Subkoviak's go (Marshall & Serlin, 1979). For Marshall's Eo' scores on a ngitem test were simulated via a binomial regression model. Specifically, a linear regression was used to predict true score from obtained score and the predicted true score was used in a binomial error model to estimate the frequency distri- bution of a Zggitem test. As noted previously, Marshall and Serlin (1979) used five different models for simulating scores. The binomial regression model was chosen over the others because the relative size 8” of Marshall's £2 using this model better reflected the distance between the cut-off and the mode(s) for distributions similar to those used herein. RESULTS Population Values Tables 1 and 2 present the population distributional character-' istics associated with each randomly and classically parallel alternate form, respectively. For the bimodal distribution within the randomly parallel condition, one five-item form had only one mode and one ten-item form had three modes. As can be seen from the skewness and kurtosis moments, the normal distributions departed from their theoretical shape. For each condition, Tables 3 to 8 present the alternate form population values of the classical reliability coefficient (O11), Livingston's‘§?(g,Tx), Brennan and Kane's ¢(A), Brennan and Kane's ¢, 20, and kappa. To compute the kappa coefficient for classically parallel tests, the average of the corresponding marginal proba- bilities was used to determine the probability of chance agreement. Similarly, for g?(x,rx), the average of the classically parallel tests' means and variances were used as the values of “x (UV) and 02 (02), respectively. x Y ‘13 can be seen in Table 3, p11 increased as test length increased. In general, given a particular distribution and test length, 011 was higher in the classically parallel condition than in the randomly parallel condition. The exceptions occurred for shorter tests. Comparing the results for each distribution, it becomes clear that tests derived from more internally consistent domains had higher alternate form reliabilities than those from domains in which the item intercorrelations were not as high. 85 86 mthmdflw >>.M.1 wmqum cmwmdm mmhmwwm. om ..w111. .3; .\..? .? ..1? .. ..1.11..1 2.1?- fi? 1.? 11.? . mmmmum. wwwqumw mmwwwm mmm1 HWmeuMw om £1. .111 ? .1 11? . mHmoucsx mmmczoxm :oHumH>oQ .mmmm coo: .mmmmmm :oHuancumHn ooooomsm smog .Ecom oumccouH< HoHHmcmm mHsoocmm comm co mOHuchmuomcmzo11.H mHnme 87 Ho.1 mm.1 Ho.m H. Hm.mH \\ \1 Ho. 11mmnu mo.~ 1mm 111mm1MH om Nomen- QOOI o o \111 1\\ Ho A H 8 H H. m:.1 11mmqm. .MH 11mm1mw m. Hmecoz N=.1 mo. H1.. 0 m=.m $1 \...A. a 1.1.1 1am e s..- 8. E1111 m :-.A 1Hmnw. 1mmnn oo.H 1H1. .1HH1m. m «m. Hm._1 om.m om.m om.cH \ \ mo.1 111mmnflu 11mm1m o~.m .111Hm1mfl om m1. =m..1 mwvw11 mH.= Ho.~H \ HH.1 .111mmqflw 31.: 11wmqm 111Hm1HH mH HmoosHm :H. am.H1 =H.N oH. mo.m \ mm. 111Hmnfln 11mmum. .dH1mHH 11HH1M o. o:. H:.H1 am.H m.H mH.= \ HH. 111mmqfln 11mmqfl _1m1 11qum m mHm0ucsx mmoczmxm :oHumH>mn moo: cam: cummmq coHuanLumHa ocmoomom Home H.o.ooooV H mHgmH 88 HH.H1 mH.1 mH.H om.o HH.mH \...11.11 1.1.1.1 11:1.H 11mm. \mfl. o. om.H1 HH.1 mm.m mH.o mm.a \ .11mmuflu HH.1 11Hm1m 11mm1m 11mm1m m. om.H1 HH.1 Ho.= oH.o oo.o ooomsm1e 11.1.1.1. 11.1.1.- \..1.1...1 1.1.1.11 \fl 0. HH.H1 NH.1 Ho.m m.o om.m oH.m - oH.m1 oH. om oo.HH \s\.1.1 \mfl. \HMM 1.1 1\..m.1m a Hm.o om.m1 Hm.H m. Ho.mH %H 1111.111 \mfl A \Ha 1. o=.c Hm.m1 Hm.H o. H..m omzoxm \omum \Mfl a \e1 \.1.1.m 2 $1111 m..\.w..\\ 3V 1.\ mm\.H.11 m . Hm.o HH.H1 om. m oo : mHmoucsx mmoczoxm coHumH>oa moo: com: cameo; coHuancunHa ocmoomom Howe .scom mumccouH< HoHHmcmm aHHmonmmHo comm no mOHuchouomcmno11.m oHnme Table 2 (cont'd) Standard Test Le th Kurtosis Deviation Skewness Mode Mean Distribution O\ '- m o o 1—1— me— MI NI 0 o r- 1— | l Ln CD N :1- we 0 N'- LON o o v- N O P m to a Pm N a v-\ m I- Q I— v— o o 0-? PCO 0 o 2' co Ln 0 ‘— r-O CU “O O E -H a: 4.311 “,1 11.96 15 :1 ‘1035 ”.05 “,15 11.99 .1” -1.29 5.41 ",2 15.7” 20 .21 -1.32 5.37 1,20 15.87 89 -.18 -.17 -.60 '055 F 0‘ 0 -.57 7.1 10 7021 Normal -012 N 10.7” 15 -.011 F 10.8" 13.88 20 .11 15 13.86 90 Table 3.--Classical Reliability of Randomly and Classically Parallel Test Length 5 1O 15 20 Table 4.--Alternate Form Population Values of Livingston's 52(3’Ix) Each Cell of the Design. ‘ Skewed .60 .62 .711 .77 _J_ .911 .911 .96 .97 Randomly Parallel Alternate Forms Bi- modal .77 .89 .93 .95 Randomly Parallel Alternate Forms Alternate Forms for Each Distribution/Test Length Combination. Classically Parallel i .93 .97 .98 .98 Alternate Forms Bi- modal .80 .90 .94 .95 Normal .16 for Classically Parallel Alternate Forms Test Cut-off Bi- Bi- Length Score Skewed g_ modal Normal Skewed g_ modal Normal 5 4 .736 .945 .734 .336 .774 .940 .800 .181 7 .847 .939 .899 .190 .904 .975 .917 .413 10 8 .709 .945 .888 .558 .809 .977 .899 .523 9 .591 .954 .905 .775 .692 .981 .909 .735 11 .876 .962 .927 .309 .921 .977 .946 .466 15 12 .809 .965 .925 .563 .869 .979 .942 .590 14 .756 .973 .942 .843 .782 .984 .954 .832 14 .911 .966 .941 .416 .942 .980 .956 .518 20 16 .824 .969 .936 .654 .883 .982 .951 .710 18 .761 .975 .947 .842 .830 .985 .958 .862 91 Table 5.--P0pulation Values of Brennan and Kane's ©(A) for Each Cell of the Design. Randomly Parallel Alternate Forms Test Cut-off Bi- Length Score Skewed g_ modal Normal 5 4 .632 .915 .789 .378 7 .877 .951 .893 .393 10 8 .774 .956 .882 .548 9 .697 .964 .897 .736 11 .894 .967 .921 .521 15 12 .837 .970 .918 .645 14 .790 .977 .935 .841 14 .935 .975 .943 .564 20 16 .873 .977 .937 .708 18 .822 .981 .946 .848 92 Table 6.--Population Values of Brennan and Kane's 9 for Each Cell of the Design. Randomly Parallel Alternate Forms Test Cut-off Bi- Length Score Skewed g_ modal Normal 5 4 .535 .905 .789 .244 7 .697 .950 .882 .392 10 8 .697 .950 .882 .392 9 .697 .950 .882 .392 11 .775 .966 .918 .492 15 12 .775 .966 .918 .492 14 .775 .966 .918 .492 14 .821 .974 .937 .563 20 16 .821 .974 .937 .563 18 .821 .974 .937 .563 93 Table 7.--Alternate Form Population Values of go for Each Cell of the Design. Randomly Parallel Classically Parallel Alternate Forms Alternate Forms Test Cut-off Bi- Bi- Length Score Skewed i modal Normal Skewed :1 modal Normal 5 4 .927 .943 .852 .499 .935 .928 .925 .613 7 .931 .945 .938 .497 .957 .977 .966 .667 10 8 .873 .908 .891 .621 .917 .964 .929 .657 9 .783 .847 .828 .815 .826 .904 .829 .733 11 .926 .943 .931 .569 .963 .967 .953 .643 1 5 12 .896 .925 .897 .615 .934 .950 .933 .650 14 .760 .835 .782 .893 .766 .851 .791 .879 14 .940 .953 .944 .645 .956 .965 .953 .676 23C) 16 .898 .929 .908 .683 .927 .948 .927 .749 18 .798 .862 .833 .888 .830 .885 .859 .900 94 Table 8.--Alternate Form Population Values of Kappa for Each Cell of the Design. Randomly Parallel Classically Parallel Alternate Forms Alternate Forms Test Cut-off Bi- Bi- Leng th Score Skewed i modal Normal Skewed i modal Normal ES 4 .517 .875 .635 .106 .591 .846 .792 .135 7 .505 .879 .824 .132 .611 .948 .897 .221 1t) 8 .435 .807 .735 .162 .572 .922 .809 .310 9 .387 .692 .637 .120 .461 .803 .622 .214 11 .579 .878 .815 .215 .705 .928 .864 .270 ‘155 12 .579 .844 .752 .154 .636 .893 .824 .279 14 .462 .668 .566 .107 .439 .703 .576 .305 14 .593 .896 .842 .296 .690 .924 .865 .335 22c) 16 .589 .851 .777 .253 .696 .892 .819 .372 18 .500 .725 .655 .233 .580 .771 .705 .283 95 Several characteristics of the criterion-referenced coefficients deserve attention. First, not surprisingly, those computed using clas- sically parallel tests were generally greater than their counterparts in the randomly parallel condition. (This comparison was, of course, only relevant for §2(X,Tx), p0, and kappa.) The exceptions appeared to be related to the size of p11 and the location of the cut-off. For example, within the J distribution, Table 3 indicates that p11 of the 5-item randomly parallel tests was slightly greater than its classically parallel counterpart. Likewise, §2(§,Ix), Po' and kappa were also higher in the randomly parallel condition. In other cases, §2(§,Ix) was higher in the randomly parallel condition even though p11 was lower. In these instances, the means of the randomly parallel tests were further from the cut-off than the means of the classically parallel tests. As Shavelson et al. (1972) noted, the difference between the cut-off and the mean can influence §?(§,Tx) more than 011 does. For 2 and kappa, the relation- 0 ship of the cut-off to heavy score density areas and to the size of the chance agreement probability appeared to account for the other excep- tions. Second, 9(l) and 6 increased as test length increased. Except for a few instances in the randomly parallel condition, §?(§,Ix) was also an increasing function of test length. Contrary to previous findings, .90 and kappa did not follow this trend even though p11 increased (Eignor & Hambleton, 1979; Subkoviak, 1978). This latter result indicates that the size of the error (expressed as a proportion) found in classical reliability may not correlate with the proportion of error found in reliability coefficients based on the Platonic true score model. 96 Third, given a particular test length increased as the cut-off .99 moved away from heavy score density areas. For the skewed, J, and bi- modal distributions, these areas were in the upper extremes of the distribution. Although 29 has been known to increase as the cut-off approaches the extremes, the score density appears to have had more in- fluence on the size of‘po in this study. Except for the normal distri- bution, the changes in the value of kappa as a function of the cut-off generally followed the same pattern as 90’ One might expect kappa to become higher as the cut-off approaches denser areas because the proba- bility of chance agreement decreases. However, the author believes that due to the large size of these dense areas in the skewed, J, and bimodal distributions, po was reduced enough to outweigh this factor. For the normal distribution, the strength of the heavy score density areas and the size of the chance agreement probability also appeared to interact, producing some unusual patterns of kappa coefficients. Finally, as ex- pected, §2(§,Tx) and 9(l) increased as the distance between the cut-off and the mean increased. ggggi Appendices A1 to A24 present the mean bias and standard deviation of each single test administration coefficient for each cell of the design. A negative value indicates underestimation, and a positive value means that the single test administration coefficient overestimated its population value. Except in two instances, the results for Subkoviak's and Marshall's p estimates were equal, confirming Marshall and Serlin's 2 findings (1979). For the two exceptions, one for bias and one for standard deviation, the results differed by only .001, indicating that 97 the differences may simply be due to rounding error. Therefore, to avoid redundancy, only one of these coefficients, Subkoviak's 80’ is mentioned and discussed below. The reader should assume that this discussion applies equally to Marshall's 80' To investigate each hypothesis, the mean of these statistics across-appropriate cells was computed. In doing so, each cell's mean and standard deviation was weighted by the number of samples upon which it was based. Throughout the ensuing discussion, the use of the term "significance" means practical significance, rather than statistical significance. For this study, any mean biases and standard deviations greater than or equal to .025 and differences between mean biases and standard deviations greater than or equal to this value were considered practically significant. The relative ability of the coefficients to estimate their respec- tive population reliability coefficients for randomly versus classically parallel tests was examined for 52(3,Tx), Po’ and kappa since their single test administration estimates assume classic parallelism. Collapsing across number of examinees, distribution type, test length, and cut-off score, Table 9 contains the mean bias of each estimate for both types of parallelism. Contrary to expectation, the absolute mean bias in the randomly parallel condition was less than or equal to that in the classically parallel condition for every coefficient. However, the only significant difference between the two conditions was for Subkoviak's E. Taking direction into account, violation of the classic parallelism assumption significantly altered the mean bias of 32(3’Ix) and the kappa estimates, while the Po estimates were fairly robust. In 98 Table 9.--Mean Bias (Across Cells) of Various Coefficients in Estimating the Reliability of Classically and Randomly Parallel Alternate Forms. Type of Parallelism Coefficient Random Classic Livingston's 32(x,zx) .019 -.019 Subkoviak's §o -.010 -.030 Huynh's 60 .011 -.012 Subkoviak's E -.023 -.111 Huynh's E .039 -.050 the classically parallel case, all indices underestimated the population coefficient with the kappa estimates and Subkoviak's go doing 30 31801- ficantly. Given randomly parallel tests, only Subkoviak's coefficients were underestimates. The others overestimated their corresponding para- meters with Huynh's E being a significant overestimate. In previous research, Huynh's coefficients have always been underestimates (Huynh & Saunders, 1979; Subkoviak, 1978). However, these studies used equivalent tests. The present findings support the past research, but also indicate that past results do not generalize to the randomly parallel condition. The second hypothesis was that the Po and kappa estimates would be more biased for those distributions not belonging to the beta-binomial family (i.e., bimodal and normal). Even though no hypotheses were generated for the influence of the distribution upon §2(§,Tx), 9(A), and 9, Table 10 presents the mean bias of every coefficient for each distribution. Based upon the absolute value of the mean bias, the 99 Table 10.--Mean Bias Across Cells of Each Reliability Coefficient for Each Distribution. Distribution Coefficient Skewed J-Shaped Bimodal Normal Livingston's 32(x,rx) .004 -.005 -.025 .028 Brennan & Kane's $(l)a .057 .008 .010 .061 Brennan 8 Kane's $3 .086 .009 .011 .141 Subkoviak's fie -.020 -.o11 -.028 -.019 Huynh's §Q_ -.011 .018 -.011 .000 Subkoviak's fi_ -.076 -.o32 -.069 -.o92 Huynh's i -.018 .034 -.025 -.014 8Means for these coefficients were based only on cells within the randomly parallel condition. pattern of results for Subkoviak's coefficients conformed somewhat to that predicted. Specifically, Subkoviak's 80 and E were least biased for the J distribution and most biased given the bimodal and the normal distributions, respectively. In the case of Subkoviak's g, the differ- ences between the J and the other distributions were significant. Contrary to expectation, Subkoviak's go was almost equally biased for the skewed and normal distributions, and Subkoviak's E was slightly more biased for the skewed than for the bimodal. Generally, the absolute mean bias of Huynh's coefficients followed a pattern opposite to that predicted; Huynh's p0 and g were least biased for the normal distri- bution and most biased for the J distribution. However, as expected, Huynh's g was less accurate for the bimodal than for the skewed 100 distribution. For Huynh's go’ the biases associated with these two distributions were equal and in the same direction. In no case did any distribution significantly change the absolute mean bias of Huynh's coefficients. Considering both magnitude and direction, Subkoviak's p9 consis- tently underestimated the population value with significant bias occurring for the bimodal distribution. Note, however, that the mean bias of this coefficient was not significantly altered by changes in the distribution's shape, regardless of whether or not the type of distri- bution violated the underlying assumptions. 0n the other hand, altering the distribution changed the direction of bias for Huynh's fig, significant differences between the results for the J distribution and leading to those found for the skewed and bimodal distributions. Specifically, Huynh's fie was unbiased for the normal distribution, slightly negatively biased for the skewed and bimodal, and positively biased for the J distribution. In no case were these degrees of bias significant. For both kappa estimates, the bias associated with the J distribution was significantly different from that found in the other conditions. Specifically, Subkoviak's E underestimated kappa much more for the other distributions, although the extent of bias was significant throughout. In the case of Huynh's E, the J distribution significantly affected the direction of bias as it had done for Huynh's go; Huynh's §_was positively biased for the J distribution and negatively biased for the others. In addition, the biases associated with the J and bimodal distributions were significant. 101 Table 10 indicates that the bias of &2(§,Tx) was significantly affected by the type of distribution in terms of both magnitude and direction. The biases for the bimodal and normal distributions differed significantly with significant overestimation associated with the former and an approximately equal, but negative, bias corresponding to the latter distribution. The mean biases for the skewed and J distributions were close to zero and were significantly different from those found for the bimodal and normal distributions, respectively. $(A) and 8 followed the same pattern. They consistently over- estimated their parameters and were significantly less accurate for the normal and skewed distributions than for the others. As a matter-of- fact, the extent of bias associated with the normal and skewed distri- butions was quite high and significant, but was very slight for the other distributions. For 8, the normal distribution's mean bias was also significantly greater than that found for the skewed distribution. Finally, 8 was more biased than $(A), although the differences for the J and bimodal distributions were negligible. Moving the location of the cut-off score was expected to have no influence on the coefficients' accuracy. The mean bias associated with each cut-off score can be found in Table 11. These means were based on the results for 10, 15, and 20-item tests. Five-item tests were not included because only one cut-off score was examined for this test length, i.e., the design of the study was not completely crossed. Al- though the cut-off scores associated with the 15-item tests were not exactly equal to those of the other two test lengths, the researcher felt the slight deviations would not significantly affect the results. 102 Table 11.--Mean Bias Across Cells of Each Coefficient for Each Cut-off Score. Cut-Off Score Coefficient 70$ 80% _90fi, Livingston's 13205.13!) ~ .016 -.004 -.003 Brennan & Kane's $(l)a .038 .032 .035 Subkoviak's £30 -.012 -.029 -.012 Huynh's‘po -.010 -.014 .027 Subkoviak:s E -.071 -.069 -.030 Huynh's g -.052 -.016 .075 8Means for these coefficients were based only on cells within the randomly parallel condition. As can be seen, the expectation was confirmed for EZ(§.IX), $(A), and Subkoviak's so since changes in the cut-off score did not significantly alter these coefficients' accuracy. However, the biases of Huynh's estimates and Subkoviak's g for the 90% cut-off were significantly different from those found for the other two cut-offs. Specifically, Huynh's E significantly overestimated kappa for the 90% cut-off, but significantly and moderately underestimated this parameter for cut-offs of 70% and 80%, respectively. Huynh's go followed a similar pattern, although the bias associated with the 701 out score was not significant. Subkoviak's E significantly underestimated kappa, regardless of cut-off score, but did so significantly less for the 90% out score. Finally, for Huynh's 3, setting the cut-off score at 701 led to significantly more underestimation than did the 80% cut-off. Since the main effects hypotheses concerning bias were generally unsupported, a three-way interaction effect among the relevant variables 103 (i.e., type of parallelism, distribution, and cut-off score) was examined. The bias of each 5-item test was again excluded because only one cut-off score was examined for this test length. The results of this analysis can be seen in Table 12 and are discussed below for each coeffi- cient separately. Livingston's §2(X.Ix). For the J and bimodal distributions, neither violating the classic parallelism assumption nor moving the cut-off score significantly altered this coefficient's accuracy. 0n the other hand, the absolute mean biases belonging to the skewed and normal distributions were significantly greater in the randomly parallel condition than in the classically parallel case for cut-off scores located nearest to the distributions' population means. Accounting for both magnitude and direction, altering parallelism conditions significantly changed the bias of §2(§,Tx) for every cut-off score within the skewed distribution and for the 70% cut-off within the normal distribution. In the former case, the differences increased as the cut-off approached the population mean since the mean bias became more negative in the classically parallel case and more positive in the randomly parallel condition. As a matter-of- fact, varying the cut-off score significantly altered the bias in the randomly parallel condition. Significant differences as a function of cut-off score were also evident in the classically parallel condition when the results for the 70% and 90% cut-offs were compared. For the normal distribution, altering the cut-off did not appreciably affect the mean bias in the classically parallel condition. However, given random parallelism, the mean bias associated with the 70% cut-off was very 104 Table 12.--Mean Bias Across Cells of Each Coefficient for Every Parallelism/Distribution/Cut-off Score Combination. 701 Type of Bi- Coefficient Parallelism Skewed J modal Normal A2 Random .018 .004 -.016 .171 Livingston's g (X,;£) Brennan & Kane's $01) Random .027 .008 .013 .107 Subkoviak's go A Random -.O17 -.006 -.037 .090 Huynh's Bo A Random .000 -.028 -.071 -.037 Subkoviak's _Ig Classic -.205 -.040 -.084 -.106 A Ratldom 0016 ”0012 ”0078 0037 Huynh's 5_ (3138810 -0175 -003” -012“ -0037 105 Table 12 (cont'd.) 80% Type of Bi- Coefficient Parallelism Skewed J modal Normal A2 Random .050 .002 -.017 -.003 Livingston's,§ (X’Ix) " Classic -.028 -.008 -.031 . .000 Brennan & Kane's 9%)) Random .054 .007 .015 .053 Subkoviak's‘p‘£2 Classic -.035 -.027 -.037 -.062 A Random -0009 0016 -0012 0002 Huynh's 29_ Classic -.038 -.003 -.032 -.036 A Random .026 -.022 -.048 -.025 Subkoviak's‘g A Random .062 .032 -.007 .056 Huynh's K Classic -.109 -.008 -.065 -.093 106 Table 12 (cont'd.) 90% Type of Bi- Coefficient Parallelism Skewed J modal Normal Livingston's §?(§,Ix) ‘ Classic -.042 -.007 -.030 -.002 Brennan & Kane's dMA) Random .083 .005 .014 .038 Subkoviak's p0 Huynh's Bo " Classic .007 .061 .048 -.021 Subkoviak's 3 Classic -.057 -.027 -.008 -.190 A Random .150 .151 .108 .024 Huynh's 5 107 large and significantly different from that found for the other two out- offs. Specifically, §2(X.Ix) greatly overestimated its parameter when the cut-off equalled 701, but fairly accurately estimated §?(§,TX) for the 801 cut-off, and moderately underestimated §2(§,Ix) given the 90% cut-off. Although no hypothesis was made concerning the influence of distri- butional shape, this variable did have an impact. In the randomly parallel condition, the effects varied across cut-off score due to the changes induced by this variable within the normal and skewed distri- butions. The J distribution resulted in the least bias. As a matter-of- fact, its bias was close to zero, regardless of cut-off score. Although not significant, §?(§,Tx) consistently underestimated its parameter for the bimodal distribution. The relationship between the J and bimodal distributions' results remained fairly consistent across cut-off score. The greatest degree of bias was associated with the normal distribution for the 70% cut-off and with the skewed distribution given the other two cut-offs. In these instances, the bias differed significantly from that corresponding to the other distributions with each one significantly overestimating its population value. The only other significant difference was found between the bimodal and skewed distributions for a 701 cut-off score. §2(§,Ix) underestimated gz(§,ix) in the former case and overestimated 52(3,Tx) in the latter case. For the classic parallelism condition, the pattern of mean bias created by changing the distribution was similar across cut-off score. The mean biases for the normal and J distributions were almost zero in every case with the former distribution resulting in no bias for the 80% 108 cut-off. The mean biases associated with the skewed and bimodal distributions were quite similar. In both cases, §2(§,Tx) was consis- tently underestimated with significant bias occurring for the 80$ and 90% cut-offs. The bimodal distribution led to significant underestimation for the 701 cut-off, as well. The biases produced by these distributions were significantly different from those found for the normal distri- bution, regardless of cut-off score. For the 905 cut-off, the skewed distribution also displayed significantly more bias than the J distri- bution did. Once again, the relationship between the mean biases of the J and bimodal distributions was fairly consistent across cut-off score. Brennan and Kane's 8(4), Since $(A) was not computed for classi- cally parallel tests in this study, Table 12 contains the mean bias of this coefficient for every distribution/cut-off score combination. The results almost paralleled those found for 32(§,Tx) in the randomly parallel condition. As a matter-of-fact, in terms of absolute value, the pattern of results for §2(§,Tx) and $(A) were, with one exception, nearly identical. In many cases, the actual degrees of bias were very similar. Note, however, that $(l), on the average, consistently over- estimated its parametric value, while §2(§.Ix) did not. The following discussion elaborates upon the similarities between these coefficients. Altering the cut-off score within the J and bimodal distributions hardly changed this coefficient's accuracy. However, as the cut-off approached the population means of the other distributions, the mean biases increased. For the skewed distribution, each increase was signi- ficant. When the distribution was normal, the mean bias associated with 109 the 70% cut-off score was significantly greater than that found for the other two cut-offs . Changes in the frequency distribution also altered the results. The mean biases corresponding to the J distribution were consistently close to zero. The bimodal distribution created slightly more inaccuracy. Because neither of these distributions was affected by cut-off score, the relationship between them remained fairly constant across cut-off score. When the cut-off was 70%, the normal distribution's mean bias was extremely large and significantly different from that found for the other distributions. Contrary to the pattern established by &?(x,zx), the biases of the skewed and normal distributions were, on the average, comparable, significant, and significantly different from that found for the other two distributions when the cut-off equalled 801. Finally, given a 90% cut-off, the skewed distribution produced a large mean bias which was significantly greater than the biases of the other distri- butions. In this situation, the mean bias associated with the normal distribution was also significant as well as significantly greater than the J distribution's mean bias. Subkoviak's £0. In terms of absolute value, violating the classic parallelism assumption did not significantly affect the accuracy of Subkoviak's go for the J and bimodal distributions. If one considers direction, however, type of parallelism did significantly alter the J distribution’s results when the cut-off was 90%; on the average, the random parallelism situation led to overestimation, while its classically parallel counterpart produced a fairly accurate estimate. 110 Contrary to the hypothesis, the skewed distribution's absolute mean bias was greater when the classic parallelism assumption was valid. As the cut-off approached this distribution's population mode (mean), the differences in the mean bias of the classically and randomly parallel conditions increased since the bias became more negative in the former case and less negative in the latter case. In the latter condition, the bias was even slightly positive for the 90$ cut-off. However, whether or not direction was taken into account, violating the classic parallelism assumption significantly altered only the biases corresponding to the 80$ and 90$ cut-offs. Although neither type of parallelism was consistently associated with less bias, significant differences also occurred for the normal distribution. For the 80$ cut-off, the absolute mean bias was greater in the classically parallel situation, while the opposite was true for the 90$ cut-off. Taking direction into account, violating the classic parallelism assumption significantly altered the results for all cut-off scores within this distribution. For the 80$ and 90$ cut-off scores, go consistently underestimated its parametric value. The bias corresponding to the 70$ cut-off was negative in the classically parallel condition but positive in the randomly parallel situation. Keeping type of parallelism constant, changing the cut-off score did not significantly alter the mean biases within the beta-binomial distri- butions, except in the case of randomly parallel J distributed tests. In this instance, the mean bias for the 90$ cut-off was positive, while the mean biases for the 70$ and 80$ cut-offs were slightly negative. When the distribution was either bimodal or normal, moving the cut-off did 111 lead to significant differences. Specifically, given classic parallelism, the 80$ cut-off produced much more underestimation than did the 90$ cut—off. In addition, for the normal distribution, the 70$ cut- off resulted in significantly more negative bias than did the 90$ cut- off. The accuracy of the reliability estimates for randomly parallel tests was also significantly affected. When the distribution was bi- modal, Subkoviak's?o largely underestimated-pO for a 70$ cut-off but fairly accurately estimated the population value given the 90$ cut-off. For the normal distribution, Subkoviak's coefficient largely over- estimated'po when the cut-off was 70$ while largely underestimating?o for higher cut-off scores. Also, the 90$ cut-off produced significantly more underestimation than did the 80$ cut-off. For the 70$ and 80$ cut-off scores, the patterns of results formed by changing the distributional shape were similar and partially supported the hypothesis that the normal and bimodal distributions would produce more bias than the beta-binomial distributions. Given random parallelism, the biases corresponding to the J and skewed distributions were low, negative, and approximately equal. The bimodal distribution produced significant underestimation, but the results were not signi- ficantly different from those found for the J and skewed distributions. Although differing in direction, the normal distribution was significantly biased for both cut-off scores and, for the cut-off closest to its population mode (mean), significantly more biased than the other distributions. When the cut-off equalled 80$, the bias found for the normal distribution was also much worse than that found for the beta- binomial distributions, but the differences did not attain significance. Given classic parallelism, the J distribution once again produced the 112 least bias with significant underestimation occurring for the 80$ cut- off. The skewed and bimodal distributions resulted in slightly more negative bias which, therefore, also reached significance for the 80$ ‘ cut-off. For the latter distribution, the extent of underestimation was also greater than -.025 when the cut-off was 70$. When the distribution was normal, significant underestimation occurred for both cut-off scores. In support of the hypothesis, the differences between the normal distribution's results and those of the beta-binomial distributions attained significance. The mean biases associated with the normal and bimodal distributions also differed significantly for the 80$ cut-off. The pattern of results for the 90$ cut-off was somewhat different. Unexpectedly, the bimodal distribution, on the average, produced fairly accurate estimates in both parallelism conditions. For the randomly parallel situation, the negative bias associated with the normal distri- bution was significant and, in terms of absolute value, significantly greater than that found for the other distributions. However, such was not the case for the classically parallel condition. As a matter-of- fact, the skewed distribution claimed the greatest bias which was significantly negative as well as significantly greater than that found for the J and bimodal distributions. Huynh’s g . Similar to the results found for Subkoviak's go’ vio- lating the clagsic parallelism assumption did not significantly affect the bias within the J and bimodal distributions. When the distribution was skewed, type of parallelism did significantly alter the results for the 80$ and 90$ cut-offs. These differences were significant whether or not direction was taken into account. The direction and degrees of bias 113 corresponding to the 80$ cut-off were almost exactly the same as those found for Subkoviak's so with the classic parallelism situation resulting in more underestimation. However, for the 90$ cut-off, Huynh's go was significantly more biased in the randomly parallel condition, while Subkoviak's estimate produced more bias in the classically parallel condition. In fact, Huynh's coefficient significantly overestimated the parameter in the randomly parallel condition but provided a fairly accurate estimate in the classically parallel condition. Finally, whether or not one considers direction, the biases associated with each parallelism condition within the normal distribution were significantly different from each other, regardless of cut-off score. Once again, both Po estimates followed a similar pattern. For cut-offs of 70$ and 90$, the absolute mean bias associated with random parallelism was greater than that found in the classically parallel condition, while the opposite held true for the 80$ cut-off. The mean bias was consistently negative in the classic parallelism condition. However, in the random parallelism situation, the mean bias was highly positive, virtually zero, and highly negative for the 70$, 80$, and 90$ cut-offs, respectively. Changes in the cut-off score significantly impacted the mean bias within every distribution. For those distributions with their population mode (mean) close to 90$, the mean bias corresponding to this extreme cut-off differed significantly from that found for the other cut-off scores. Generally, the mean biases for the 90$ cut-off were signi- ficantly positive, while the mean biases associated with the other out- off scores ranged from slightly positive to significantly negative. One exception to this trend occurred when estimating the reliability of 114 classically parallel tests having skewed distributions. In this case, the mean bias for the 90$ cut-off was close to zero. Among the three distributions, the bimodal produced the only significant difference between the 70$ and 80$ cut-offs; given random parallelism, significantly more negative bias was found for the former than for the 80$ cut-off. On the other hand, within the normal distribution, altering the cut- off had no major effect in the classic parallelism condition. In the case of random parallelism, the results for the various cut-off scores were all significantly different from each other. As noted previously, the 70$ cut-off led to significant overestimation as it had done for most of the other coefficients, while the 80$ and 90$ cut-off scores resulted in a fairly accurate estimate and a significant negative bias, respec- tively. The hypothesis that the normal and bimodal distributions would produce more bias than the beta-binomial distributions was generally unsupported. Although the type of distribution significantly affected the direction and/or extent of bias, no consistent pattern could be found either across cut-off score or parallelism condition. Subkoviak's E. Contrary to previous results, type of parallelism significantly affected the accuracy of Subkoviak's 3 within every distri- bution. In terms of absolute value, the J and bimodal distributions were sensitive to this variable when the cut-off was 80$; §_was more nega- tively biased in the classically parallel condition. When direction was considered, parallelism produced an additional significant effect for the J distribution. Specifically, when the cut-off was 90$, the biases asso- ciated with the two types of parallelism were equal but opposite in 115 direction. As usual, parallelism significantly affected the results associated with the skewed and normal distributions. In terms of absolute value, the differences were significant for all cut-off scores within the normal distribution and for the 70$ and 80$ cut-off scores within the skewed distribution. In all these cases, the classic parallelism condition produced more bias than did its randomly parallel counterpart. When direction was considered, all respective comparisons within these two distributions were significant. For classically parallel tests having skewed distributions, g consistently underestimated its parameter. However, for randomly parallel skewed tests, 3 was, on the average, unbiased when the cut-off was 70$ and positively biased 17¢) given the other cut-offs. When the tests were normally distributed, underestimated the population value, regardless of cut-off score and type of parallelism. Cut-off score also had a pervasive effect. For the three distri- butions having their population modes (means) near 90$, the mean biases associated with this extreme cut-off were, in general, significantly less negative than that found for the other two cut-offs. Two very distinct deviations from this trend occurred for the skewed and J distributions within the randomly parallel condition. For the former distribution, the mean bias was either zero or positive and increased significantly as the cut-off approached 90$. For the J distribution, moving the cut-off affected the bias' direction but not its magnitude in that g over- estimated kappa for the 90$ cut-off and almost equally underestimated this parameter for the other cut-offs. In addition, given classic parallelism, the 80$ cut-off produced significantly greater 116 underestimation than did the 70$ cut-off for the J distribution, while the opposite occurred for the skewed distribution. The pattern of results formed by moving the cut-off was quite different for the normal distribution. In the randomly and classically parallel situations, the 90$ cut-off produced more underestimation than did the 80$ and 70$ cut-offs, respectively. In addition, given classic parallelism, the 80$ cut—off resulted in significantly more negative bias than did the 70$ cut-off. Generally, the hypothesis that §_would be more accurate for beta- binomial distributions was not supported. Once again, the pattern of results formed by altering the distribution varied across cut-off score. However, in the classically parallel condition, 2 significantly underestimated kappa for every distribution and cut-off score, except one; for the bimodal distribution, the bias found when the cut-off was 90$ was close to zero. In the randomly parallel condition, the degree of bias was generally significant but varied in direction. However, for the skewed and bimodal distributions, the mean bias was zero or close to zero for the 70$ and 90$ cut-off scores, respectively. Huynh's 2. With relatively few exceptions, the bias of Huynh's i was significantly affected when any of the variables in the study changed values. Moreover, the results did not follow any pattern, making interpretation very difficult. Therefore, the following observations are not as specific as those presented for the other coefficients. Across all distributions and cut-off scores, the absolute mean biases within the parallelism conditions were comparable in only four cases. For the 90$ cut-off within the J distribution, E produced significantly more overestimation in the randomly parallel condition. On 117 the other hand, when the distribution was bimodal, the bias was signi- ficantly more negative in the classically, as Opposed to the randomly parallel, condition when the cut-off was either 70$ or 80$. In terms of absolute value, the skewed distribution also produced significantly more bias for the classically parallel situation given cut-off scores of 70$ and 80$. However, when the cut-off equalled 90$, the randomly parallel condition produced significantly more bias. Finally, the absolute mean bias for classically parallel normally distributed tests was greater than that found for their randomly parallel counterparts for cut-off scores of 80$ and 90$. Violating the classic parallelism assumption also affected the direction as well as the magnitude of the bias, especially for the skewed and normal distributions. For both these distributions, &_was positive in the random parallelism condition and, generally, negative in the classic parallelism condition. For the three distributions having their modes (means) near 90$, the bias associated with this extreme cut-off differed significantly from that found for the other cut-offs. Specifically, the mean biases for the 90$ cut-off were significantly positive, while the mean biases corres- ponding to the other cut-offs ranged from significantly positive to significantly negative. In all three distributions, the results for the 70$ and 80$ cut-offs also differed significantly, regardless of parallelism. As the cut-off moved from 70$ to 90$, the bias moved toward overestimation. Changing the cut-off did not affect the normal distribution as drastically. In the randomly parallel condition, the bias for the 80$ cut-off was significantly more positive than that found for the 90$ cut- 118 off. When the tests were classically parallel, the 70$ cut-off produced significantly less negative bias than did the 80$ and 90$ cut-off scores. The hypothesis that 3 would be more accurate for the skewed and J distributions than for the other two distributions was generally not sup- ported. Once again, the pattern of results was inconsistent across cut- off score. In general, the degrees of bias were significant. However, for the 80$ cut-off, the biases associated with the J and bimodal distri- butions were close to zero for the classically and randomly parallel conditions, respectively. SamplingiVariability The variability of each coefficient across samples was predicted to be inversely related to the test length and the sample size. For each coefficient, Tables 13 and 14 present the weighted mean standard devi- ation across cells associated with each test length and sample size, respectively. Clearly, the results support the hypothesis, i.e., the sampling variability decreased as the test length increased as well as when the sample size increased. As can be seen, Subkoviak's coefficients were more variable than Huynh's coefficients for every test length and cut—off score. 5 appeared to be more unstable than 8(A). Table 15 contains the mean standard deviation for every test length/sample size combination. No significant interaction effects were evident. 119 Table 13.--Mean Standard Deviation Across Cells of Each Coefficient for Each Test Length. Test Length Coefficient 1o 15 20 Livingston's 32(z,_x) .049 .039 .032 Brennan & Kane's $(A)a .042 .028 .019 Brennan 8 Kane's $3 .050 .037 .026 Subkoviak's go .035 .033 .030 Huynh's go .029 .025 .022 Subkoviak's E .089 .085 .078 Huynh's g .073 .064 .058 8The mean standard deviations for these coefficients were based only on cells within the randomly parallel condition. Table 14.--Mean Standard Deviation Across Cells of Each Coefficient for Each Sample Size. Sample Size Coefficient * 25 ' 35 50 Livingston's 32(Z’Ix) .050 .043_ .042 Brennan & Kane's $(X)a .040 .036 .031 Brennan & Kane's $3 .050 .043 .037 Subkoviak's fie .038 .033 .030 Huynh's fig. .030 .025 .024 Subkoviak's i, .099 .085 .077 Huynh's fi_ .078 .067 .063 aThe mean standard deviations for these coefficients were based only on cells within the randomly parallel condition. 120 Table 15.--Mean Standard Deviation Across Cells for Each Sample Size/Test Length Combination. Sample Size Test Coefficient Length 25 35 50 10 .055 .048 .046 Livingston's &2(x,rl) 15 .043 .037 .038 20 .036 .031 .029 10 .047 .044 .037 Brennan & Kane's $(A)a 15 .031 .027 .026 20 .022 .019 .016 10 .057 .051 .042 Brennan 8 Kane's $3 15 .043 .035 .033 20 .031 .025 .022 10 .039 .034 .032 Subkoviak's go 15 .038 .031 .029 20 .034 .030 .027 10 .032 .028 .026 Huynh's E12 15 .028 .023 .022 20 .025 .021 .019 10 .100 .088 .080 Subkoviak's §_ 15 .097 .081 .076 20 .091 .077 .068 10 .081 .073 .067 Huynh's E 15 .073 .061 .060 20 .067 .057 .052 8The mean standard deviations for these coefficients were based only on cells within the randomly parallel condition. DISCUSSION Although all the coefficients in this study, except for 3(1), were derived under the assumption of classic parallelism, they were in many cases robust to violation of this assumption, i.e., type of par- allelism did not significantly alter the absolute mean bias. Speci- fically, for fi2(§,T£) and the pg estimates, this variable had no significant effect when the distributions were either J-shaped or bimodal. However, type of parallelism, in general, affected the absolute mean bias when the distributions were either skewed or normal. These findings can perhaps be explained by examining the item characteristics which must be present to form each distribution. If the domain score distribution is either J-shaped or bimodal, the domain must consist of items having fairly homogeneous p values and high item intercorrelations. On the other hand, items within a domain having either a skewed or a normal distribution are more heterogeneous and have lower item intercorrelations. When items are randomly chosen to construct alternate forms, some or all of the statistics computed from one test are more likely to adequately represent the charac- teristics of the other form when the item domain is more homogeneous. In other words, the relationship between randomly parallel tests derived from a homogeneous, in contrast to a heter- ogeneous, item domain more closely resembles that found between classically parallel tests. In addition, for E2(X,Tx), coefficient alpha is probably a better estimate of the alternate form reliability when the domain is homogeneous, leading to more accurate estimation of 121 122 52(X,Ix) for randomly parallel tests having a J or bimodal distri- bution. Given these facts, one would expect type of parallelism to have a greater effect when alternate forms are derived from a hetero- geneous item domain, i.e., from an item domain having either a skewed or a normal distribution. Type of parallelism did affect the absolute mean bias of the kappa estimates, regardless of distribution. However, the effects were somewhat less pervasive for the J and bimodal distributions. When parallelism did significantly alter the absolute mean biases, the random parallelism condition did not always result in the most bias. For example, except in one case, Subkoviak's coefficients displayed greater bias in the classic parallelism situation. This latter result can perhaps be understood by examining Subkoviak's formula more closely. Specifically, notice that using a regression estimate of true score causes regression toward the mean which becomes more severe as EB-ZO decreases. When the distribution was either normal or skewed in this study, 53-20 was likely to be fairly low. The resultant regression toward the mean may have caused the alternate form population value within the classically parallel condition to be severely underestimated, leading to a greater degree of bias for the classically parallel situation. As predicted, for E2(§,z§), the random parallelism condition did produce the most bias. Finally, the type of parallelism associated with more bias varied for Huynh's estimates. Cut-off score had a significant effect on all the coefficients. For 52(X’Ix) and 9(1), the effects were found predominantly within the random parallelism condition when the distributions were either 11.—v4“. {la 1 .m . . ii... I. _. 35...: .17..- i .. . . ... . . z. ., 123 skewed or normal. For the most part, these biases tended to increase as the cut-off approached the mean; the biases were positive when the cut-off was located near the population mean. Note that as the cut- off moves close to the mean, the difference between the mean and the cut-off has less of an impact on the value of these coefficients. When the cut-off almost equals the mean, the squared-error agreement coefficients are approximately equal to 53-20 or KR-21. Therefore, the present results indicate that as the difference between the mean and the cut-off becomes less influential and the norm-referenced reliability coefficient accounts more for the magnitude of g2(X,Tx) and 9(1), the bias becomes significantly more positive for randomly parallel tests having a skewed or a normal distribution. For the J and bimodal distributions, the bias of E?(Z,Tx) and 9(1) did not significantly change as a function of cut-off score. Because of these distributions' homogeneous item domains, 53-20 is probably a fairly accurate estimate of the population coefficient used in these formulas. No general statements can be made about the effect of cut-off score on the threshold agreement coefficients since the results varied widely. The most consistent finding occurred for Huynh's coeffi- cients. Specifically, for distributions having their population mode (mean) near 90$, the bias associated with this extreme cut-off score was significantly positive. Significant overestimation may occur in this case because, according to the binomal error model, the standard deviation around an extreme true score is smaller than that around non-extreme scores. However, the data in this study do not conform 124 exactly to the binomial error model and, therefore, scores may be more variable than what is predicted by this model. Since scores within these distributions cluster about the 90$ cut-off, this increased variability will have more of an effect in decreasing the population values, leading to overestimation by Huynh's coefficients. The hypothesis that the};O and kappa estimates would be less biased for the beta-binomial distributions than for the bimodal and normal distributions was, generally, not supported. A possible expla- nation for this finding is that the J and skewed distributions in this study did not conform closely enough to members of the beta-binomial family. In addition, the normal and skewed distributions may not have been different enough since the normal distribution was actually somewhat skewed. SUMMARY AND CONCLUSIONS Due to the variable results found in this study, no general rules can be offered for choosing between coefficients falling within each category (e.g., uncorrected threshold agreement coefficients) of Figure 2. For each parallelism/distribution/cut-off score combi- nation, Table 16 indicates the direction of bias produced by each coefficient. If a coefficient had a mean bias less than .025 in either direction, the coefficient was considered to be unbiased. Recommendations about which coefficient to use in each of these cells can be made and are presented in Table 17. Two criteria were used in choosing a coefficient in each case: (1) when the biases were in the same direction, the coefficient with the least bias was selected; (2) when the biases were opposite in direction, the negatively biased coefficient was chosen, unless the positively biased coefficient was fairly accurate (i.e., had a bias near zero) or much more accurate than its competitor. The latter situation occurred only once where Huynh's E had a posi- tive, but nonsignificant, bias, while Subkoviak's i had a significant negative bias. Several other points about Table 17 should be mentioned. First, sampling variability was not taken into account in making these recom- mendations. There were two reasons for taking this course of action: (1) the bias of an estimator is more important than its 125 Table 16.-—Direction of Bias of Each Coefficient for Each Parallelism/Distribution/Cut-off Score Combination 70$ Type of Bi- Coefficient Parallelism Skewed J modal Normal No No No Over- Livingston's Random Bias Bias Bias est. 22(X T ) No No Under- No - -’~§ Classic Bias Bias est. Bias Brennan & Over- No No Over- Kane's 8(1) Random est. Bias Bias est. Brennan & Over- No No Over- Kane's 3 Random est. Bias Bias est. No No Under- Over- A Random Bias Bias est. est. Subkoviak's po No No Under- Under- -' Classic Bias Bias est. est. No No Under- Over— A Random Bias Bias est. est. Huynh's 20 Under- No Under- No " Classic est. Bias est. Bias No Under- Under- Under- A Random Bias est. est. est. Subkoviak's 5 Under- Under- Under- Under- Classic est. est. est. est. No No Under- Over- A Random Bias Bias est. est. Huynh's 5 Under- Under- Under- Under- Classic est. est. est. est. 126 .. _ 2.! a...F.1id1dfi33n..z_3t1$33!}. 3.... 129 . aofiuoaouuu ea .333 a...» 353:: one: on .3332.» .58 a... cue-«u: clogged 3 ~33 3 .33. 35 5 ecu—"aloe no .333: c.3238! 0.55: 533.333..“ 9 AMHJVA a Page: «33333... e ANHQVNM 2.. e.gcasz n.ssas= e AMH.~.mm on audio: n.3aaaososn n.esasm e Auw.a..u on dunno: n.2emsm n.ssas= s A u.u.~g e n.2nasosssn n.xsaaoxssm e an e op n.xsa>oaosn n.3saaoassw a Aww.mwfa ea n.3snsososn n.3saaososn o ~u~.~emu om n.rcara a.ssasz e .n:.~.mu oe,~seoean n.5easa o.seas= e aua.xvma on success n.2eaaoxssu o.xesaoxssm e A H.mvnu o» n.3saaoxosn e.xna>osasn e A e.gcmu se n.xaaacnssa o.xoa>oaasn e AHa.m.~m on 6.xsaaoxssn e.gsaaoxssn e .wm.mCmm ca e.gaasa n.scasr e A e.g.wu on nuance u.xsa>osnsn o.xs«>oxasm a a 9.xs~g cm assess n.5aasx 1e -e c.saa:= \e.ssaaoxssn e Ana.ncmm o» e.geasz n.scas= e Axu.~vam a» c.ans>csasn n.rcas= e Auw.a.~u om n.3asaosssn n.xea>mxosn .e I‘ve no cos-an n.=c s: oozoxn oasis: £22326 e rams: a 8 3.32838 533.63% a Amnéuu 8 n.asa== n.sc«ansasn o A a.~ewu oe n.xoa>osssm n.3saaoxssm e A ~.xemu ob Qvlwmmpmmlmwwmm odqe BENNEH .556 not... «5 3939.5. :5di 3239.5. Logan Locum ucuooenou 9309.325 uneasy... votes—om vouooetoo v38....ooc= Benson 69.5.3 voaooeeoo «.309—.525 0300.28 convenience: ssoaaqcuoou co scam 1mmoaoaeeooo mo cane $26383 canoes mass: «can 82:6“ . 3332300 9.09m 30130253333:3:523~95.— ncam so» 3:33.308 ace-ooh: 3239.5. and LELM enema—am 9332005330028 caucus-coconut: 03m... 130 sampling variability in determining the estimator's adequacy; and (2) Tables 13, 1H, and 15 indicate that the differences in stability of each coefficient within a particular category are not very large. Second, there were two instances where Subkoviak's and Huynh's coeffi- cients were equally biased, and, consequently, both were listed. However, since Huynh's coefficients appeared to be slightly more stable, one might want to select his estimates. Third, even though 8 significantly overestimated its parametric value when the distribution was either skewed or normal, this coefficient is the only available corrected squared-error formula and was, therefore, recommended in every situation. Finally, although either §2(§,Ix) or $(A) was recommended in each case, one must remember that these two coefficients are not really comparable because they do not estimate the same population value. Specifically, g2(§,Tx) measures the reliability associated with a particular test, while $(A) indicates the reliability of any set of items randomly selected from a domain. Because of the latter fact, all tests which can possibly be constructed from an item domain must be classically parallel for the classic parallelism assumption to be valid. This situation is unlikely to occur, unless all items within the domain have equal p values. When this situation does occur, §2(§,IX) and ¢(A) are equal, and one can directly compare the accuracy of 32;§,Tx) and $(A). However, since this study did not contain a domain of items having equal p values, $(A) could not be computed in the classic parallelism condition. Therefore, g2(§,Tx) was consistently recommended as the 131 appropriate uncorrected squared-error coefficient within the classic parallelism condition. Finally, mention should also be made of two other methods of estimating reliability; Subkoviak and Wilcox (1978) and Livingston and Wingersky (1979) introduced mastery coefficients which measure the extent of agreement between the observed score and the estimated true score. The former coefficient uses a threshold loss fUnction. Livingston and Wingersky's index reflects the size of the misclassi- fication error but does not use a squared error loss function. Since reliability is really concerned with accurately estimating an examinee's true score or true classification, these indices deserve considerable attention. APPENDICES . . . . . mo. epo. «Ponxx soc. . “of .E s\ a. .\ \. \. \\. : moo. epo.u moo.u Pmo.- omo.u moo . Poo on o o o o o o o m o. wao\ :o\ ms mo\ 8o\ 2o\ =S\. o\. S 9 Foo. >.o.- coo.- Fmo.u .Fo.n _o.u Poo mmo .V % mg m? @V .E GK a? : ape. 0.0.- moo.u mmo.u _m_. Po.u moo. mpo. moo. mo. ope. wwwvxx mmmn\\ ompv\\ awwv\\ mwwvxx \\ O \mmmmm mmumx \mmmmm moo. Pso.- moo.- mac. o... mmwn\\ omo. mwmhxx .wmmxx mmmuxx mmmvxx mwo. mwmv\\ , m OF ooo.u \mmmmm \ moo.a e.o.: mmo.u .mo.u moo. zoo. m... :mmmxx wme\\, mme\\ wwwh\\ mmh\\\ .wmv\\ mwmv\\ h pmo. mmo.u woo.u 0.0.- 0mm. .mo.u zoo. mmo. . . . .. pmp. Pop. mmo. Pop. ...\ f i \ \. \. \. a . mo_. mmo.u mmo.- mmo.n so. mmo >_o . mPO u amenoz dance vommsm omzoxm HmsLoz Hobos commcn cozoxw ogoow cameos -Hm -e -fim nu occlusc same ”Sc—Oh 00 “Chmu Hd Hoaamnmm >HHmonmmHo menom oumcnoua< HoHHmme >Hsoccmm P< xHazmmm¢ moccaemxm mm no noaoemm mmono< AxH.MVNm m.:ounwcH>Hq no coHpmH>oa ogmccmum new nmwm com: 132 133 3N6. . ope. «0. who. \\\\\ \\\\\. \\\\\ \\\\\\ Npo.l NPO.I ©OO.I NO.I 950. P . pro. 0. \\\\\ J\m\\\ \\\\\ fl\\\\\ mmo.l NPO.I 500.! PNO.I ofi FY Po. 9% .opo.- .o.- .mmmmm =.o.- Hmsnoz Hobos commnn cozoxm lam I6 omo. 3pc. FPO. mwo. \\\\\\ \\\\\ .\\\\\ \\\\\ @— mpo.t Noo.l moo. who. N 0. Po. NFC. o. m\\\\ A\\\\. \\\\\ JW\\\\ 0_ =~o.n ooo. zoo. Nmo. mmp. Po. .0. are. FNP. Poo.| moo. No. HMSLOZ HNUOB UQNNZW 00308” GLOUW lam Id unciuso match oumcgoua¢ Hoaamnmm uaamofimmmao match mumcgmua¢ Hmaamnmm haeoocmm mmocHsmxm mm no mmaoemm mmogo< AMH.MVNw m.coumwcfi>aq mo scaumfi>oa osmccmum new mmam 2mm: A.c.p:oov P< xHozmmm< om gsmcms same mwm\\\ omm\\\, mmmxxx mmp. »wm\\\ mwmxxx emmxxx mmm\\\ op mpo. mmo.- moo.- mo . Npo.u mFo.- moo. omo. E\ .\. .s .§ §\. .g\ .g\ ...\ N, .. .mo. mmo - moo.- omo.- omo ooo.- ooo. omo. O 0 O O P. O\ I O m? of of of N. o? of mxmo\ : mmo. omo.u moo.- ooo.- oo.. o.o.u ooo. opo. mg mg as m? of >mo\ NS\ mmo\ m ooo. ooo.- ooo.- ooo.- mmo.- omo.- ooo. m=.. oo. omo. _o. moo. Poo. omo. moo. omo. mxxxxx \\AW\\ .\\\\\ .\\\\\ .\\\\\ \\\\\ \\\\\ m op ooo. mo . ooo.- ooo.- ooo. mmo.- ooo. moo. O O O l O O . O 0\ m..\ .g .2\ .2 §\ A? .E we . ooo. moo.- ooo.- m.o.- mmm. mmo.u ooo. mmo. mmwxxxx moo www\\\ moo wwwxxx mmm\\\ me\\\ opp o m moo. up..- mmo.u .omo.u mmo. opo. o.o.- ooo.- HmELoz mmcos ommmnm omzmxm Hmspoz Hobos oommnm cozoxm onoom cameos -Hm uo -Hm no oooaooo ammo mason oumcgoua< Hoaamnmm mammommmmao < wagon opmcnmum< Hommmnmm zaaoocmm N< xHozmmm< homemamxm mm mo moaoemm amono¢ AmH.mv m m.c0ummcm>mq uo :oHumH>oa unmocmum pom mmwm com: 134 135 JWWV\\ ewm\\\. mmm\\\ WWW\\\ mW\\\\ :WW\\\. MWW\\\ wwm\\\ @— Noo. opo.l zoo.l 33o.l moo.l moo.l moo. Poo. 5.? .3 .fi oi. .% .g % :\. 2 :oo. m—o.l moo.I NNo.I moo. ooo.! ooo. mo. mmm\\\ mwm\\\ wmm\\\ mwm\\\. ”W\\\\ mwm\\\ mwm\\\ mwm\\\ :P moo.l opo.l ooo.: :Po.l app. ooo.: ooo. bro. Hmsnoz Hmcos .mmmmmm .mmmmmm .wmmmmm Hmcos .mmmmmm oozmxm .Immmmml tam Io lam Io omeluso menom oumcnmum< Homamnmm mammoammmmo wagon oumcnoum< Hmammnmm hasoocmm noocmsmxm mm mo nomosmm moono< Axh.womm m.coumwcm>mq no cowpmm>oo ogmncmum new mmmm coo: A.o.u:oov N¢ xHozmmmd om causes some NNo. bmo. po. cmo. mmo. pNo. who. >30. \ \ \ \\ \ \ \ \ .1 ooo. Pmo.l moo.| no.1 FNo.I h—o.l Noo.l cmo. ¢Y a? ,Y mfi as :\ as E\. N, m. mpo. :mo.l aoo.l #6.! ooo. NNo.I Noo.l mmo. wmo. mpo. moo. cmo. app. mpo. mpo. , NNC. No. mNo.I woo.l mNo.I New. mpo.l ooo. Npo. O . m 0 O , FF. CO . ”mo. . O O fwm\\\ \m\\\\ Awm\\\ m\\\\\ mW\\\\ ‘\\\\\ wm\\\\ mwm\\\ o mpo.n mmo.l mpo.l $30.: wmo.l Pmo.l —oo.u cap. 8% RV. mg fimV mg 5% g mmo\ m S >NO.I N©O.I =~o.n mmo.I mmo.l omo.l Poo.l :50. E g :E 3% 8.. 3% So. so. wpo.l 530.: mpo.l mpo.l \MHMW\ Nmo.l .mmmW\ \mmmm\ h ;o\ m8\ 3\ ax mm.\ 5\ omo\ m8 3 m :30. opp.n mmo.l m—F.I wmo. moo. :No.l 2:0.I Hmegoz Haves ummmnm cmzmxm Hmsgoz dance vmmmcm cwzoxm mgoom sumcmq lam Id lam .Ih Mueluso umwe mmwcfiemxm om no mmaqsmm mmono< AxH.MVNm m.:0pmmcfi>fiq uo :oHumw>mn cnmucmum ucm mmfim cam: wagon mumcgmua< Hmaamgmm >Hamofimmmao wagon mumcnoaa< Hmaamnmm haeoocmm m< xHozmmm< 136 137 kwmuxx Noo.- % Poo.l wmmvxx _moo. HGSLOZ NNo.I mg. mmo.| 2? mpo.l Haves lam $00.: commnm Iw wagon mumcnmua< Hoaamnmm adamofimmmHo mmmv\\ mao.u g mmo.a h? peo.u umzmxm mmmv\\ moo.| mmvxx, moo. mmmV\\ mmp. HMSLOZ A.u.ucoov m< xHazmmm< g mco.u moo.| wwwv\\ ooo.: Haves tam moo. @\ moo. cmmmnm uh wagon mumcgmaac Hmaamnmm >HeouCMm mmmv\\ who. zao. mpo. 0P0. cmzmxm mp op 2F wgoom whouuzo mmmcflamxm om no mmaasmm mmogo< AxH.mva m.:0umm:H>Hq mo :ofiumfi>oo unaccwum vcm wwwm cam: om sauce; away APPENDIX A“ $(A) Access Samples of 25 Examinees Randomly Parallel Alternate Forms Mean Bias and Standard Deviation of Brennan and Kane's Test Cut-off J- Bi- Length Score Skewed shaged modal Normal 5 u .101 0019 -0025 0015 ,/1762 .037 ,/<6§§ .155 7 .019 .000 .005 .078 .036 z/T616 .02u A .110 .001 .002 .007 .030 10 8 //f666 017 .031 z”f00 9 .063 .002 .009 .029 .10 .015 .029 .005 .035 .01 .022 .072 11 //T022 ”T000 /’/:01 z’f101 .054 .008 .020 .019 15 12 .0uu .01 z/f612 .076 .072 .006 .02 .021 1a //1666 .008 1/7611 /’7622 1“ .028 .01 .02u .153 .011 .005 2/1665 .085 .06 .009 .028 .071 20 16 .025 ,/1606 z/f601 053 18 .09 .007 .020 .0u9 //7632 .006 .007 ””761; 138 APPENDIX A5 Mean Bias and Standard Deviation of Brennan and Kane's ¢(l) Across Samples of 35 Examinees Randomly Parallel Alternate Forms Test Cut-off J- Bi- Length Score Skewed shaped modal Normal 5 u .065 .019 -.028 .006 fi .028 X92 /09 .017 .006 -.00u .088 7 /7626 z/1613 2/7632 «<fiifi .0u7 .006 -.002 .060 10 8 l”:05 t’7612 ,/7630 .08 .084 .005 .002 .005 9 x/T5;6 .01 .0u7 .039 0035 .011 .016 0096 11 46 65 49 43 15 12 .058 .01 .018 .051 .025 .006 .025 .068 .083 .008 .017 .032 1” «/f631 ”’7005 /’1010 ’/:6?§ .026 .011 .021 .1u7 1H ' 2/7606 z/TEEA ”’760; , .063 .058 .01 .025 .083 20 16 ‘ .015 .00u ,/757A .051 18 .091 .009 .022 .055 026 //<66§ ,xffi73 .017 139 APPENDIX A6 Mean Bias and Standard Deviation of Brennan and Kane's ¢(A) Across Samples of 50 Examinees Randomly Parallel Alternate Forms Test Cut-off J- Bi- Length Score Skewed shaged modal Normal 5 u .077 .01“ -.032 .021 .069 . 32 .067 .126 7 .021 .003 -.002 .073 /.0/23 45 /.02 .098 0052 .001 -0003 003” 10 8 2’7655 r’7616 ,z’763 .085 .091 .000 .000 .031 9 fl .%5 .03 437 .03” .008 .017 .089 11 43 .008 4 2 41 15 12 .056 .006 017 .033 .022 .009 .017 .078 .08" .005 .01” .026 14 .032 -m3 4 5 % 1n .026 .009 .022, .161 .008 .%5 %6 %56 20 16 .056 .008 .02“ .082_ ./015 .45 /01 .0113 .091 .006 .021 .053 18 /.62 .005 .01 41 140 APPENDIX A7 Mean Bias andAStandard Deviation of Brennan and Kane's ¢>Across Samples of 25 Examinees Randomly Parallel Alternate Forms Test Cut-off J- Bi- Length Score Skewed shaped modal Normal 5 u .101 .023 -.023 .151 ,/1138 ./<6§6 .089 ,xfTK8 7 .069 .009 .01 .121 z’fagé .017 ,//763 [/7636 10 8 .069 .009 .01 .121 ,/1699 ./1619 ,//f65 ,/1696 9 .069 .009 .01 .121 .093 «I161; ./’755 .096 .08 .002 .029 .128 11 ./072 ./009 4 2 47 15 12 .08 .002 .02u .128 ”’76;2 .009 .012 [/1661 . .08 .002 .02u .128 1a ‘/’76;2 [/1669 ”7612 »’7669 10 .092 .01 .028 .162 .ou ,/1666 .007 a’76;6 20 16 .092 .01 .028 .162 ///761 .006 .007 1’76?6 18 .092 .01 .028 .162 .01: %6 %7 ./076 141 APPENDIX A8 Mean Bias andAStandard Deviation of Brennan and Kane's ¢1Aoross Samples of 3S Examinees Randomly Parallel Alternate Forms Test Cut-off J- Bi- Length Score Skewed shaped modal Normal .065 .023 -.029 .11 5 u 2/1768 .027 «’1669 ‘/<696 7 .086 .007 -.001 .128 z’fagé 2/7612 ,/’f66 2/76?2 10 8 .086 .007 -.001 .128 .072 .012 ,z’765 .072 9 0086 0007 -0001 .128 //<6?2 .012 .05 .072 .09 .011 .018 .199 11 ‘/’T61 [/1666 "T625 "76;2 .09 .011 .018 .109 15 12 I’TBA .006 1/7625 —’7612 .09 .011 .018 .109 1a /’T/A 2/1666 .2155; 1’7512 1“ .093 .011 .025 .159 .02u /’7669 ’57611 ”1639 .093 .011 .025 .159 20 16 .029 .009 .019 211669 18 .093 .011 .025 .159 .029 .009 019 2’7669 142 APPENDIX A9 Mean Bias andAStandard Deviation of Brennan and Kane's ¢>Across Samples of 50 Examinees Randomly Parallel Alternate Forms Test Cut-off J- Bi- Length Score Skewed shaped modal Normal 0069 0019 -003“ .1143 5 U .092 W .061 .13 7 .093 .003 -.002 .116 .035 47 .03 46 10 8 .093 .003 -.002 .116 .035 2’7511 ,,//765 ,/7666 9 .093 .003 -.002 .116 .035 .017 A3 46 .09 .009 .017 .14 11 ./032 409 45 476 .09 .009 .017 .1” 15 12 .032 .009 m 476 009 0009 0017 01"" 1” ./032 49 % 476 1” .091 .01 .02“ .169 «/’762 2’7665 z”661 z’TEEA 20 16 .091 .01 .024. .169 42 40/5 .01 ./0511 18 0091 001 00211 0169 AZ %5 41 ./0511 mo. Nmo. on. poo. Pmc. mac. m=o. 950. m\\\\\ \\\\\ \\\\\ \\\\\ \\\\\ \\\\\x \\\\\ \\\\\ 2F :Fo.l mmo. mpo. bro.l PmO.I No. mNo. ace. 6. No. mo. @mo. o. NNO. NO. . Nmo. mmo.l mNo.l NO.I emo.l ®FO.I mwo.l ooo.! wOo.l use. NNO. mNo. «mo. mmo. . mNo. mNo. mNo. \\\\\ \\\\\\ \\\\\. \\\\\\ \\\\\ \\\\\\ \\\\\ \\\\\ PP Nmo.l mwo.l mpo.l QNO.I mo. mpo.l ©OO.I moo.l mo. bmo. N O. O. Fmo. mmo. MD. who. J\\\\\ \\\\\ H\\\\ JW\\\\ \\\\\ \\\\\ J\\\\\ \\\\\ m 20.! moo. floo.l mo.l POF.I =No.l :No. Poo.l . . . . . . o. . . 1 6\ §\ 6\ 65\ E.\ Nm\ .6\ 6% m e mfio.l 30.! Nmo.l hmo.l mmo.l wNo.l zoo.l 006.! mmm\\\ MWW\\\ mwm\\\ wmm\\\ mmm\\\ mmm\\\ mWW\\\- mmw\\\ b mmo.l N:C.l @FO.I ©NO.I ©NP. mmo.l mpo.l mpo.l 6\ 6O\ .§ §\ g\ 6\ .2\ §\ .6 m ooo.! FOF.I ®N0.I mo.l mo. 50.! 30.! wmo.l Hmsgoz Hmoos cmmmnm nozoxm Hmegoz Hmcos oommcm cosmxm onoom cameo; IHm IQ lam Id huOIuSU name mspom mumcgmuH< Hoaamnmm >Hamoammmao moocfiemxm mm mo moansmm mmono< W a.HHmnmLmz mo coHumH>oa unmocmum ocm mmfim :mmz manom mumcgmua< Hoaamnmm zaaoocmm o~< xHazmmm< 144 145 m0. Nmo. m 0. 0. 0. MMO. m 0. m0. 0\ \ ..\ ..\ :\ \ =\ \ a mF0.! N0.! 0P0.! bm0.! 030.! m00.! 0F0. bro. m 0. N0. N0. Fm0. m 0. N0. 5N0. , mm0. \W\\\\ w\\\\\ &\\\\\ \\\\\ HW\\\. &\\\\\ \\\\\ \\\\\\ or m00.! 0F0.! NFO.! 0m0.! m:0.! 000.! 000. P00.! 020. N0. NNO. 0N0. 00. mmo. NNO. MNO. mm0.! 000. 000.! mF0.! PNO. 300.! ’00.! M00. Hmagoz Hmcos vommzm co3mxm Hmanoz Hmoos woman» vozmxm onoom -am .a -fim nu aeonsso meson oumcgoua< HoHHmme adamoammmau wagon oumcnoaa< Hoaamnmm >Heoccmm moocfiemxm mm mo moaasmm mmono< ”Jam.HHmnmgmz no coHumH>oQ ogmucmum vcm mmfim 2mm: A.U.ucoov op< anzmmm< om some»; game mmo. mmo. «mo. «mo. mmo. omo. mmo. mac. \\\\\ .\\\\\. \\\\\ \\\\\ \\\\\. \\\\\. _\\\\\ \\\\\ :P moo. o_o. _Po. mo.- mmo.- mmo. mmo. woo. 6% :\ \ .1 .1 6V :\ 2.\ N. e Fmo.- :o.- “No.1 emo.- .ooo.- mmo.- .Po.- :.o.- mmo. mmo. are. «No. 1 mac. _mo. .mo. mmo.\\ .\ \ \ \\ \ \ \ \x : m:o.- mmc.- m_o.- mFo.- ego. mmo.- .o.- ._o.- emo.- mpo.- .mpo.- omo.- >~o.u mmo.u mmo. .=oo.- mmo. emo. .mo. mmc. omo. mmo. amo. omo. . \. \. \. \. \. \. \\. \. m 2 mac . Pmo - _go . omo - mmo . ago - aoo . m.o . €\. m§ 6% .1 of mg @V mg l .ao.- _mo.- amo.- opo.- ... Nmo.- mmo.- .mo.- mmmuxx mmmv\\ mmmvxx mwmvxx mwmvxx mwmvxx mmmmxx wwmv\\ m mpo.- Pm..- .zo.- m:o.- moo. .mo.- Pao.- mmo.- a Hmsgoz Hmcoa ommmnm nmzmxm Hmsnoz Havos commcm oozmxm mgoom cameo; -Hm -u -Hm nu ooo-u=o same msgom mumcgoufi< Hoaamnmm adamonmmHo meson mumcgmuac HmHHmme zaeoocmm me momcaemxm mm no moaqemm mmono< nwm.HHmcmLmz no cofigma>oa ngmvcmum new mmwm cam: pvt xHozmmm< 146 147 o. m o. m o 0 mm\ m\ m0\ 5% Po. =mo.u PNo.! ozo.u mwm\\. mwm\\\ mwm\\. mwm\\1 mmo.u _mo.u mo.n smo.! mmo. Pmo. pmo. No. moo.! Ppo.! _o.! mpo.u awesoz Hmcos commcm oozoxm tam In mELOk m..— wCe—mu H< Hodamnmm haamowmmmao mmo.- moo.- FFC. Npo. mwmvxx 6mmV\\ mmmhxx mwmuxx mmo.- mmo.- m.o.- aoo.- mmmv\\ mmmvxx mwmvx\. mmmvxx Foe. Pmo.- moo.- ooo.- .mmmumm .mmmmm .mmmmmm .mmmmmm -fim -6 msLom oumcnoaa< Hmfiflmamm sasouqmm mp @— :9 onoom .COquo m! moocwemxm mm no moaaemm mmono< n_m.HHmcmLmz no cofiumfi>mo oumccmum new mmam cam: A.o.ocoov F_< xHazmaa< om cameo; game é mmv m% «9* mg RY iv :81»-.. a 000.! FPO. 000. mm0.! ma0.! 5N0. mpo. m00. MWWV\\ wmmh\\ mmmh\\ mmh\\\ 2mmh\\ mme\\ mmh\\\ wWWV\\ NF m9 mm0.! 0M0.! NNO.! 0m0.! ~0P0.! MNO.! PPO.! MP0.I mmo. Fro. . 0P0. NNO. m0. 9N0. 3N0. . :No. \\\\\ \\\\\ \\\\\. \\\\\ &\\\\\ \\\\\ \\\\\ \\\\\\ PF >m0.! 0N0.! h~0.! mm0.! 0:0. 0N0.! b00.! P0.! 0' 0| 0' 0|. 0H0!- ol- 0 o mmmh\\ mmh\\\ mmmh\\ MWDV\\ rmmh\\ mwmh\\ wmmh\\ mmmh\\ 0 OF m00.! =m0.! :m0.! hm0.! 520.! 330.! NPO.! 000.! . . . . 0. \ NNO. 5N0. mN0.\\ ”MA ,W .m\. mm“ NA 3.- e\..- mm. . RE ms 2% mg a? a? mm.\.\ .m€\ : m 0m0.! 0—P.! >m0.! mzo.! N00. 0h0.! mm0.! b:0.! Amanoz Hmcos cmmmzm nozmxm HmBLoz Haves ommmcm uozoxm oLoom cameo; lam In lam Id MMOIJ=0 ummh mason oumcgmaa< Hoaamgmm aaamowmmmao mason mumcnmua< Hodamnmm haaoccmm momcwemxm om no moaaamm mmogo< nfimm.aamzmgmz uo cofiama>ma onmncmum vcm mmfim cam: NP< xHozmmm< 148 .ll‘ «0. . aid: 4 .rmvfG .11“. 1.3.4:; .. o . . . dJJ-Tm 149 mwmv\\ mmmvx. Jmmhxx mmmvxx m00.! 0N0.! 0N0.! m0.! sf my WV 3% 300.! m0.! :No.! N:0.! mm0..\ opo. . 0P0. _ N0. \\\\ \\\\\ \\\\\. \\\\\\ mm0.! mw0.l m00.l m~0.l Hmsnoz Hmcos commnm cozmxm lam lfi meson oumcgmua< HoHHmme adamoammmao N0. mN0. mmo. mNO. m\\\\\ .\\\\\ \\\\\ \\\\\ 20.! 000. N00. 000. mmmvx. mmmvxx mwmv\. mme\\ m:0.! 0P0.! H0.! h00.! mmm\\\ mm\\\. mwwxxx www\\\ 0P0. 0P0.! N00.! 000.! Hmegoz Haves commnm cozmxm IHm I0 meson oumcgmuac HmHHmme adsoocmm mp me :P mnoom uncluso moocwemxm om no moaasmm mmogo< am.m.HHmnmLmz no coHumH>ma ugmvcmum new mmwm :mmz A.U.ucoov Np< xHozmmm< om some»; game m0. Nmo. — 0. 1 P 0. Fmo. . mac. m:0.\\. F50.\\ m\\\\\ \\\\\ HW\\\ UW\\\\ \\\\\\ \\\\\\ \\\. \\\ :— :P0.! mm0. m—o. 590.! Pm0.! N0. 0N0. =00. 5\ .1 s\ m? a? §\ .g\\ 5\ 6 m. 0m0.! 0N0.! N0.! 0M0.! 0P0.! m—0.! 000.! 000.! 5 . N0. mmo. Pmo. mmo. mN0. mN0. mN0. \Wm\\\ N\\\\\ \\\\\ \\\\\ \\\\\ \\\\\. \\\\x .\\x\\ FF Nm0.! mP0.! mpo.! 0N0.! m0. mP0.! 0004! m00.! QMb\\. mmm\\\ N20 JWW\\\ mmm\\\ mmm\\\ Jmm\\\ mwm\\\ m 20.! m00. 500.! m0.! ’09.! 3N0.! 3N0. P00.! JWW\\- @Wm\\\ mWW\\\ mmm\\\ WWW\\\ WWW\\\ mmW\\\ MWW\\. m or 050.! #0.! Nm0.! 5M0.! am0.! 0N0.! £00.! 000.! .1 .1 §\ N2\ ms\ 1% .1 @x n mm0.! N:0.! 0F0.! 0N0.! 0N». mm0.I 0P0.! m~0.! 2\ §\ $\ 6.\ g\ E\ .2\ of 1. m 000.! For.! 0N0.! m0.! 00. 50.! 30.! ®m0.! Hmsgoz Hmnos nommzm nmzmxm amenoz Hmnos nommcm nozmxm ogoom cameo; Ifim In lam In unclazo name wagon mumcgmua< moocaamxm mm no moaasmm mmogo< om.m.xma>oxnsm no coHpmH>on numncmum new mmam :moz Hoaamamm adamoammmao wagon oumcnoufl< Hoaamgmm hasoncmm mp< anzmmm< 150 151 MPO.! N0.| 0F0.I 5m0.! 020.! m00.! 0P0. 5P0. o. \ N . . o. o. C . \ . . . m®0.! ©F0.I 5F0.! 0m0.! M20.! 000.! 000. 900.! 020. N0. NNO. 0N0. . 00. 1 MNO. NNO. - MNO. mm0.! 000. 000.! m—0.! PNO. 200.! 900.! m00. Hmsnoz Hmnoe nommzm nozoxm HmBLoz Hmnos nmmmnm nmzmxm IHQ I0 lam Ih menom oumcnmua< Hoaamnmm haamofimmmao m moocfismxm mm no moaasmm mnogo< A.n.ucoov m—< xHozmmm< magom mumcgmua< Hmaamnmm afieoncmm mp 09 :— onoom amelpzo Wm.me>Ov—DSW MO COHQMH>GQ ULMUCM»W USN mafim cam: 0N somcmq same mo. mmo. Nmo. Nmo. . mmo. m0. mmo. N 0. W\\\\. \\\\\. \\\\\_ \\\\\ \\\\\. w\\\\\ \\\\\ .HW\\\ 2P m00. 0F0. 990. m0.! mm0.! NNO. mN0. 000. 5\ §\ 3\ mg mg E\ a? 8\ N, e rm0.! 20.! 5N0.! 0N0.! 000.! 0N0.! P90.! 2P0.! mm0. NNO. mp0. NNO. m 0. Fm0. PNO. mN0. \\\\\\ \\\\\ \\\\\ .\\\\\ HW\\\ \\\\\\ \\\\\ \\\\\ PP N20.! mN0.! 090.! 0P0.I 520. Nm0.! F0.I P90.! mmh\\\ mmmh\\ wmmh\\ mwmh\\ mwwh\\ mwmh\\ wmmh\\ mmmh\\ m 5N0.! NPO.! ®F0.! 0m0.! 550.! 0M0.! mN0. 200.! Nm0. . 5N0. FNO. N0. mmo. m0. 0N0. m0. \\\\\ \\\\\ \\\\\\ m\\\\\ \\\\\. w\\\\\ \\\\\ M\\\\\ m or mmo.! ~m0.! p20.! 0N0.! mm0.! 020.! 000.! mP0.! mwm\\\ mwm\\\ me\\\ wwm\\\ mm\\\\ wmb\\\ mmb\\\ mwm\\\ 5 F20.! Pm0.! 2N0.! 0F0.! pp. Nm0.! MNO.! FNO.! §\ 6\ §\ of a? 1% 5\ g ._ m mF0.! PNF.! P20.! m20.l moo. '00.! F20.! Nm0.! Hmsgoz Hmnos nommsn noZoxm Hmsnoz Hmnos nmmmnm nozmxm mnoom cameo; -«m -w -nm .4 “mousse name wagon oumcnoua< Hofiamgmm haamoammmao moocfiamxm mm mo mmaqemm mmoLo< o! (a! menom oumcnmuac HmHHmme zaaoncmm 2F< xHazmmm< m.xmfi>oxnsm ho cofiumfi>mn ngmnCMum ncm mmHm com: 152 153 momcHamxm mm no moaaemm mmogo< Hoaamnmm zaamoammmau menom oumcnoua< Hmfiflmgma aasoucmm d m m.me>oxnsm no cowumw>oa nuancmum ncm mmwm 2mm: A.n.ocoov 29¢ xHozmmm< 0N0. mmo. . mmo. 5 0. m0. m0. N0. . mmo. PO. 2m0.! PNO.! 020.! mmool N00.l 5F0. N90. mmm\\\ mmm\\. mwmxxx mmmxxx mwmxx. mmw\\\ mmmxxx mmwxx. m. cm ®m0.! pm0.! m0.! 5m0.! mm0.l MNO.! NPO.l 000.! mm0. PNO. PNO. N0. P20. N0. 050. NNO. \\\\\\ \\\\\\ \\\\\ \\\\\\ \\\\\\ w\\\\\ \\\\\\ \\\\\ 2P 000.! Pp0.! r0.! MPO.! 900. 9N0.! m00.l 000.! HmsLoz Hmnoa nommnm nmzmxm HmELoz Hmnos nommnm nozmxm mgoom cameo; IHm !0 Ida l0 thlaso 9mm? meson oumcnmua< mo. mmo. . omo. «mo. mmo. _mo. mo. mo. ooo.- .Fo. ooo. mmo.- ooo.- omo. mpo. moo. % 9% NE mo\ E m? mo\ o~o\ NF 9 mmo.u omo.u mmo.u omo.u opo.- mmo.- Ppo.- mpo.u O O O O . O F o. N O No. mo\ 1 av W1 omo\ N\ =o\ =\ : omo.u omo.u opo.u mmo.! ooo. ooo.- ooo.! .o.- mmm\\\ om\\\\ ommxxx mmm\\. mmm\\\ mmm\\\ mmm\\\ mmm\\\ o omo.- o.o.u meo.n ooo.! oH.n :mo.- opo. moo. o o o o o No. . N o N o 1 .1 m1 1 1 1 1 .1 . .. moo.s ooo.! :mo.u omo.- ooo.- ooo.- m.o.- ooo.! .1 1 .1 1 1 1 .1 1 . omo.- ooo.- No.- opo.- Pup. mo.- ooo.- No.1 1 1 1 m1 1 1 1 .1 .. . omo.- o_P.u omo.u ooo.- moo. ooo.- mmo.u poo.u Hmaeoz Hmnoa nommem nmzmxm Hmseoz Hmnoa nommen nozoxm oeoom cameo; -am no -Hm -o ooouooo some meeom oumeeoaa< meeom oumeeopa< Hoaamemm haamofimmmao HoHHmemm hasonemm mmoefismxm om no moaesmm mmoeo< nxmm.xmfi>oxnew no eofiumfi>ma nemnemum new mmam emmz mp< xHazmmm< 154 155 HoHHwewm maawowmmwao mowefiewxm om no moaaewm wmoeo< m m A.n.ueoov mpc anzmmm< maeom moweewua< Hoaawewm haaonewm n.wa>oxeem no eofiuwfi>mn newnewum new mwwm ewo: 1 11x 1 .1 1 1 1 m1 .. m00.! 0N0.I 0N0.! m0.! 20.! 000. N00. 000. 200.! m0.! 2N0.! N20.! m20.! 090.! 90.! 500.! mmo. F0. P0. -_ N0. mm0. N0. 0P0. MNO. mm0.! m~0.! m00.! mp0ol 0P0. QFO.! N00.! 000.! thoz AMOS Bag” 50303” Han—(~02 H MUOE 8mg” UQ3QXW OLOOW £8MCQ1— lam l0 IHm l0 9001950 ummh MELOE mamCquH¢ m . P0. 59 . . 0m . - m 0. mpo. 590. p . \mm\\\ W\\\\. \MW\\ \VW\\ HW\\\ \\\\\ \\\\\ mmV\\, 2P 0N0.! 500. 00. p20. m00.! 00. N00. mm0. 0. - P0. F0. . Pm0. \ Nm0. P0n\\ mw0. F 0. . mW\\\\ w\\\\\ W\\\\. \\\\\ .\\\\\ d\\ \\\\\ mW\\\ NF m— 000.l N0.! 200. 920.! 0P0. p00.! 0P0. P0.! 0 0. ~0M\\ m—0. NNO.- 0. - P0. 90. F 0. o\\ o\ \ .\ o\ .1 =\ m\ 2 F00. —m0.! 90.! 20.! m00. 0N0.! N00. N90.! 0 o o x m 0 \ o . o o . I 1 1 1 .1 1 .1 1 .1 . Pm0.! Nm0. 020. N00. 200.! mN0. 50. m0. 1 .1 1 1. .1 .1 .1 1 . .. m20.! m20.! 500.! Nm0.! 200.! 0P0.! 090. 200.! .1 1 1 1 .1- 1 1 1. . 000. 200.! 5F0.! Pm0.! 2mp. 020.! 290.! 0P0.! 1. 1.1 1.111. 1. . Pmo m00 ! 200 ! 520 ! 0MP 0N0 ! Pro ! 0m0.! steoz Hwnos nomwem nmzmxm steoz Hwnoe nomwem nmSmxm meoom cameo; IHm l0 IHm I0 MMOIUDU ummfi wagon moweemua< Hoaawewm haawoammwao m meeom ouweemua< HoHHwewm zaaonewm momefiewxm mm mo moanswm wwoeo< m.m.ee>:: mo eoHawH>ma newnewum new wwam ewmz 0F< anzmmm< 156 157 1 1 1 m1 >mo.u Fae. omo. moo. 1 N1. 1 1 amo.u Pro.u N90. 20.! ...m\..\ .1 1 1 :No.l mmo.l Ppo.! hmo.! Hasnoz Haves vmmmnm cmzmxm tam Ia menom mumcgmua< Hwaamgmm adamofimmmfio mmmcfismxm mm no mmaqsmm mmogo< mm0.! P00. who. Fmo. NE NE mphx0\\ \. g mF0.I P00. >90. 200.! .W..\ fix--- .1. 1 020. MNO.! N00.! F0.! Hasnoz Havos commcm nmzmxm IHm I0 m wagon mumcpmua< Hmaamnmm haaoccmm A.U.ucoov ©P< xHazmmm¢ 0P or 2— «zoom ghouuso m m.sc>z= uo cofiumw>ma ogmucmum cam mmfim 2mm: om zumcmq game wagon oumcnoua< Hwfiamamm zaamoflmmmao mmmcaemxm mm no moaaemm mmogoc m wagon mumcgmpa< Hmaamnmm zaaoccmm >p< xHozmmm< m.m.::>:: no cofipmw>mn ngmocmum new mmwm 2mm: . . . . . . Po. 0. mmm\\\ mwm\\x mm\\\\ qwm\\\ mmm\\\ mwm\\‘ M\\\\\ dm\\\ :P mpool owe. ma. GNO. mo.l NQO. N©o. mmo. Mk 2% was» g 0% W? é mg.. N, 3 COO. @NO.I moo. mm0.l Pmo. 90.! mac. MPO.I .wmo\ m? Q? ~o\ Ex Nb\ .% m8\ : ooo.! mmo.l FPO.I Nmo.l zmo. :mo.l POO. FPO.I NO.I NC. ~36. mccol FPO.I mpo. mwo. Pmoo EV wk E 8% EV .W\ 0% cg m e Nmo.l ©m0.I NPO.I ONO.I NCO. mo.l FPO. mcool 30. PN0. PO. MNO. O. NO. mpo. NNO. awo.l Mbool FNO.I mNO.I map. hm0.l 3—0.I mNO.I E\. EV mm? 9% 2% F? a? a.\. a m GPO. mop.l Npool 330.! mop. bmc.l wpccl Q=O.l Hasnoz Haves vmmmnm cmzmxm HmsLoz Haves vmmmnm vmzmxm mgoom numcmq lfim l5 le l5 thIOSU 0mm? 158 159 mNO. PC. PC. MO. ND. NNO. PFC. NOW“ 000.! Nmo. mmo. $00.! 30.! mmo. :50. 020. 5\ E a? g\ a? fix. a? E\ e 330.! mpool DOC. N:O.l FPO-I ooo.! ope. moo.l . .. . . O \ . ‘ F . . \\ . :8\ m? wwwx WV Go\ m\o\ mo\\ ms a mmo.l smocl NPO.I mNo.l pmo. ONO.I 300.! OFO.I H mac—OZ H MODS 60“ MS” 00303” H NELCZ H MUG:— Umm NS” U$3Q 8W GLOOW UHm IQ lam lfi thlUfiU wagon mumcnouac menom mumcgmuac Hmaamnmm aafimoammmao Hmaamnmm zaaoccmm d mmmcasmxm mm no mmaqemm mmono< mwm.::>:: mo cofiumfl>ma cumucmum new mmam 2mm: A.U.u:oov ~F< xHazmmm< ON same»; game m? g w? 3\ §\ g m? .g\ a Pmo.- mo. mmo. :No. co.- mo. mac. omo. mmmvx. mwmvx\ mm\\\\ mwmxxx mmm\\\ mwmxxx mwm\\‘ wwp\\x m, m, moo.- mmo.- Poo.- mao.- Pmo. ooo.- MPo. m.o.- cmo\ _F 6\ WS\ @\ ms\ 3% ms : ~oo.- Nao.- m.o.- pmo.- moo. mmo.- .oo.- m.o.- E\ E g @\ §\ §\ .g\ 2\ a mmo.- mpo. pmo. mro.- wmo.- mpo. poo. cmo. @v _g EV Ev. :O\.. E.\ E\. a? w e co.- co.- m.o.- omo.- m.¢.- ¢mo.- o_o. moo.- mmwxx mwmh\\ mwmvx. mwmvx\ mmmvx‘ mwmh\\ mwmvx\ mwmvxx F mpo.- mso.- mmo.- pmo.- map. omo.- mpo.- mmo.- mmwxx. mmmv\\ mwmhxx mmwv\\ mmmhxx mmmhxx mmmvx. mmmvx. a m ooo. amo.- =.o.- pzo.- mm_. omo.- o.o.- o:o.- Hmsgoz Hmcoa cmmmnm umzmxm HmsLoz Haves cmmmzm cmzmxm macaw numcmq -fim -u -Hm -3 uno-»=o ummy wagon mamcgouad mwmcfismxm om no mmaqemm mmono<.mm Hmaamnmm haawofimmmHu wagon oumcgmua< Hmaamnmm adsoucmm mp< xHazmmm< m.::>=m no coHumH>ma camccmum cum mmwm :mm: 160 161 m m \0 2' N '- O o o .\\ I cn\€§ co m N O O I HmsLoz Nmo.n Hmcos lam FPO. \ Pmo. Pov\\\ Noo.- @oo. \\\\\ =PO.I ommmnm ta nanom mumcgmua¢ Hmaamgmm adamOHmmmHo 5\ FPO.I 0:0.3 .6. d\\\\\ omo.l U m3® 8m =No. mmc.u E\. mpo.u % N20. Hmanoz $ mmo. g moo.l 5\ hmo.l Hmoos tam pro. WE spa. mwmxxx moo.| commcm Ia nELom mumcgmuad Hoaamgmm >Hsoucmm opo.| umsmxm m? we :9 macaw analyze .0. mmmcwemxm om no mmHQEmm mmono< m.m.£:>=: ho cofiumfi>mo vgmucmum new mmwm cam: A.U.ucoov mp< chzmmm< om name»; game @WWV\\ MMDM\\ WWD\\\ mWW\\\ $WW\\\ mWM\V\ mmmh\\ WWW\\\ :9 QNN.I :00. FPO. b00.l N0.l m0. mmo. >N0. a? we? 2\ F \ §\. @ @Y t\\. a e HH.I 500.! Nm0.l m—.I 000. MNO.l MN0.I 9N0.l a? 3% @V ORV % @Y .1“me @Vx : 000.! 0m0.l P20.I wa.l h:0.l Fm0.l PN0.I ®F0.l % 1% g\o.\ ma\.. @V §\.\ f 9% a NP.I :Fo. zN0.l 0N0.I 0N0.I 9:0.l 0m0. 000. :1. S\. E w.\ 1% m? % EV m e mhp.u hmo.l awo.l hmw.l mmo.l mmo.l mpo.l Nzo. 2\\ 9%. §\ WV @\ @V §\ a,\ 1 N00.l mmF.l 5:0.l NPN.I m00. 000.! N:0.I :m0.I mm? We.\ 2,\.. EV v.\ :1 1% fi .1 m £00.! 00m.l 000.! ©0N.I >00. mmp.l 000.l mmp.l Hmagoz Hmcos cmmmnm vm3oxm Hasnoz Hmcos ummmnm cmzmxm mgoom guano; le I0 lam l0 thluso ummh mason mumcnmua¢ Hmaamnmm adamowmmmao wagon mumcgmua< Hmaamgmm uaaoccmm mmmcfiemxm mm no mmaaemm mmogo< mmm.xmw>oxo:m mo cofigmw>mn onmccmum new mmfim cam: mp< xHazmmm< 162 Haves lam pm? Pmo.| g szo.a mm? Fmo.l vmmmcm Ia wagon mumcnmuac HoHHMme haamoammmao moo.l Hmcos tam 8mg. -n msLom wumcgoua< Hoaamgmm masoccmm «no. Ev pmo. ME, mo. U 030 SW mp or :F wgoom whetuso mmmcfiemxm mm mo awanemm mmogo< mam.xmw>oxn:m no :oHumH>oo unmuCMum ucm mmfim cam: A.o.ucoov ¢F< xHszmm< om cumcmq puma @\ §\ §\ .1 m? §\ E\ E\ a mam.- mo. _Po. .mo.- .mo.- awe. smo. «no. wmmw\\ wmmv\\ @mvx‘~ mmwv\\ mmmnxx mmbh\. mmmV\\ wmov\ N, m. am..- moo.u omo.- a..- mmo. mzo.u mmo.- moc.- wmm\\\ 4mm\\\ wmm\\x mmwxx mmxxxx mmm\\‘ mmm\\\ mwwxx PF mao.- >o.- mzo.- mem.u mmo.u ~o.- mmo.u ooo.- m? E @\ E\. 2% f E\. N:\. a mmP.- »_o.u ¢mo.- mo.- amo.u =oo.- mac. moo. mmmh\\ mmcv\\ mmmn\\ mmxvx. mmvxxa @mbn\\ mmwvx. mwwv\- w o" e¢F.- mmp.u moo.u map.- pmo.- =ao.u opo.- . amo. mmpv\\ mwmvxx @mmhxx omwv\\ mmmwxx wwv\\\ wmmv\‘ mmwvx‘ p .¢o.- ¢mp.- omo.- ¢om.u moo. am..- mzo.u hmo.a mwpv\\ mwxv\\ mmmv\\ mmwh\\ wmmv\\ mwwvxt mwpvx. mmwvx : m mmo.- mam.u >mo.u mam.u .mo.u m2..- pmo.u go..- Hmagoz Hmcoa cwmmcm omzmxm HmsLoz Haves cmmmnm cmzmxm ogoom namcmq -fim nu -Hm aw “wonuzo wagon oumcnmua< HmHHwme adamoammmfiu mmmcasmxm mm no mmaaemm mmogo< wagon mumcgmua< HoHHmme zaaoccmm m m.xma>oxn:m no cofiumw>wa unmucmum new mmfim cam: om< xHozmmm< game 164 165 _mo\ ib\ :b\ .02\ NFN.I moo.l mmo.l Nmo.l mmmu\\ mmw\\\ mmmhxx mmwnxx mm.u >o.n moo.u >:P.u WV mm? :o\ s.\ m=P.I mmo.l mNo.I ©=P.I Hmsgoz Hmcos cmmmnm cmzmxm tam Id manom oumcnmua¢ Hmaamnmm zaamofimmmHo mmm\\\ mmm\\\ mmm\\\ mmp\\ mNP.| moo. :No. who. % mfix §\. ;\o.\ amo.n mmo.| mmo.l :mo. wmo.| m=0.u 0.0.: mzo. Hmsnoz Haves nommnm cozmxm tam In mason oumcgoua< HmHHmnmm haaoncmm mp or 3F ogoow mmonuso mmocasmxm mm no mmaasmm mmono< mmm.xmfi>oxn=m no :oHumH>mo onmucmum cam mmam 2mm: A.U.ucoov om¢ xHazmmm< om cameo; ance mmm\\\ mmm\\\ fiWW\\\ mmm\\. me\\. WMW\\\ mm\\\\ WMW\\\. :p MN.I 9N0. 000. 0:0.I =N0.I >30. mN0. :mo. FY NE >mo\.. :m.\ Mao\. Go\.\ who“. m\m\.\ NP 9 00F.I 0H.I omo I 00— I arc N30 I Nmo I :00 I FY 2% 6 E\ s\ v.1 ,6? F\ : m00.I m00.I Fm0.I 0:N.I om0.I m>0.I >N0.I N00. 5% 0% ma\. mY- mmo\. fix «ms WSx m >mp.I >N0.I =0.I o>0.I mm0.I >m0.I FNO. N00. NE §\. m\ g\ \3.\. E s\.. E.» m e 0N I mmP.I moo I ON I «:0 I F00 I zmo I >0 us moo\. WWo.\ ”WV Waxy WV. $\. as 0. :0~.I 00P.I mm0.I 0N.I P00.I =~.I Nm0.I :N0.I >00 I Nam I P00 I P>m I M00 I saw I Mac I :ON I Hmanoz Haves ummmnm cmzmxm Hmsaoz Haves commnm cozmxm wnoom numcmq lam I0 IHm I0 nucluso umwh menom muwcnmuac Hoaamnmm haamoamnmflo wagon oumcnmua< Hwaamnmm zaaoccmm mmwcfiamxm om no mmaqemm mmono< m.m.xma>oxnsm no coHumH>wa vnmccmum cam mmfim cmmz FN< anzmmm< 166 167 mfi\ N8\ 9? :o\ mmp.u mmo.I :wo.u mmo.n s\o.\ We\ a? @v >m..n mwc.n Pco.u omF.I mfix E\ WV WY mP—.I 0:0.I mm0.I mmP.I Hasnoz Haves commsm umzmxm lam Ia meson oumcnmua< Hmaamnmm haawofimmmHo % PP .' 2% .m:o.- @V 8a.. Hasnoz mmxo\ as. ame PS: §\ Pm0.I autos IHm m00.I NNO\.\\ bN0.I g m~0.I 8mg.» -n wagon mumcnmua< Hmaamnmm maaoozmm 6030 X” mp @— :F mnoom uncluso momcaewxm cm 00 mmflasmm mmono< mmm.xmfi>oxnsm no cofiuma>un vgmccmum new mmwm :mmz A.o.ucoov FN< xHazmmm¢ om cameo; game g §\ §\ 0? o\ 2\ §\ \ a 53F.I POP. msp. NNP. 0m0. 25’. op. amp. % 2% 3% fix Ev ex.\ mm? mfi N, a Pm0.I Nm0.I 000. 000.I ~00. 0P0. Pmo. 0P0. e\\ 5\ @\ Q? .§. §\ @\ 5\ : mpo.l b0.I mN0.I 0P.I m0. m:0.I N00. 000. 3% Na? @ex 0% §\f 2.\ E E m. omo.l mwo. one. woo. mmo. omo. mmp. bar. Wax ma? E mY @.\ .m\o.\ .E..\ g m e Pop.I m0P.I 0—0.I p00.I 0:0. mN0.I mmo. 000. mg. % @Y ”VI- 6 3% E.\ .E N. mF0.I hmP.I =0.I Nmp.l who. z—F.I m0.I F0.I mmo. 0:N.I NF0.I mmP.I hop. m0.I =N0.I m00.I Hmanoz Hmuoa vwmmzn cwzmxm amaze: Haves vmmmnm omzmxm onoom sauce; Iwm I0 lam I0 MQOIUSU ummh mason mumcnwuad HmHHmme zaawofimmmao nsLom wumcnmua< Hwaamsmm hasoucmm mmmcaamxm mm no mwaasmm mmogo< mam.::>:m mo :oHpmH>ma cnmucmum cam mwfim 2mm: NN< xHazmmm< 168 . O C C P. O O C PN—.I 000. 00-. so. mm0.I mmp. map. mmF. wwxxxx. mwmxxx mmm\\\. mmm\\\ mwwx\\ mmm~;\ mwm\m\\ mmmxxx o_ om mN—.I Ppo.I poo. N00.I 0N0. 2N0. zmo :00. PF. . O C P. O o. . mso.l mmo.I mmo.I 00P.I po.I wmo.I moo.I bmo. Hasnoz Hmooa ummmnm Omzmxm Amanoz Hmcoa 00mmcm cozoxm mLoom numcmq lam In lam Id uncluzo name menom oumcgmua¢ menom mumcnmua< amaamnmm zaamonmmHo Hwaamgmm >Haoocmm mmmcfismxm mm ho moaasmm mmogo< m.m.::>=m ho coHumfi>mn onmucmum new mmHm cam: A.c.»=oov m~< anzmmmc 3% E\ 2\x 2.\ mfix E\. E\ fix. a mop.l NNF. mhp. P00. moo. mmp. Par. 33—. fix @xx Ex @\ xx? §\ fix Ex 2 m. OM0.I m0.I 000. N:~.I PP. P00. mmc. wNO. mmm\\x mmm\\\ WW©\\\ WWW\\\ WWW\\. mmm\\\ mmm\\: JWW\\\ FF QP0.I w®0.I NN0.I GMN.I omo. 0m0.l :00. :90. 2% a? fix Ex Ex f 3%. 8.x m 00.! mmo. NOD. 090. $30. ©MO. me. amp. fix I? fix Ex E\. 2% mm?» @x m e NNP.I :NP.I 3N0.I NPP.I 5:0. =30.I @mo. hop. @\- mmxx fix ox:\ ngxx fix a\ E\ I Nm0.I ©PN.I m30.I sz.I mmo. :MF.I 0N0.I NFC. m00.l PQN.I MN0.I ®2N.I :00. m=0.I Nm0.l «FP.I Hasnoz Haves ummmzm cmzmxm Hassoz Hmcoa vmmmsn cmzmxm msoom sumac; IHm I0 lam I0 thIQSD ummfi wagon oumcnmaa< HoHHmnmm zaamowmmmHo wagon mamcnmua< Hmaamnmm hasovcmm momcfiemxm mm no mmaqamm mmogo< m.m.::>=: mo cofiumw>ma onmucmum cam mmHm :moz mm< xHazmmm< 170 171 N=P.I m\o.x :P.I 25‘ 350.: Amanoz mm0.I :o\\ ooo.l Haves Iam mmuxxx wow. m.owx- moo. .o. m\\\\\ amo.- ummmnm Ia cs P90. pr\ bpp.l NP. \ >=P.I umzaxm wagon mumcnmua¢ Hmaamnmm haamofimmmHo mmocwemxm mm mo mmaasmm mmogo< «mm.s:>sm mo cofipmw>wo unaccwum new mmwm 0:0.I 3V pro. 23‘. :Po.u Hmsgoz Nmo.I Hmcoa IHm mmo. N00.I cmmmnm I0 vmzoxm megom mumcnmuac Hmaamgmm zfieoucmm A.u.ucoov mm< xHozmmmc 0P op :P mnoom quIuso cam: om sauce; ummh mmmvxx wmmvxx wwmvxx wmmvxx mmmvxx mmmhxx mmmvxx mmmvxx 2. mm..- mm_. For. moo. Foo. amp. MFP. mzp. 5% g mmo\. E m8\ 3% % exoxmvx NP 9 .mo.. zoo.- moo.u pm..- Poe. moo.- awe. pmo. mmhxxx mmmhxx mwmvxx wwwvxx mmmhxx Pmmwxx mmnnxx .mmvxx P. wPo.- :oF.- omo.- :mm.- mac. noo.x moo.- mpo. gs 3% 1% s\. 9% Ex g\. was I moo.u smo. No. mpo. .:o. omo. m~_. mo_. m8\. emo\. :% E .ao\. as Ex mex m 2 mm..- «a..- omo.- 5...- mmo. w:o.x mo. m... :mmvxx mmmnxx ammmxx kahxx mmmvxx mmmnxx mmmhxx mmmuxx p =o.- >m~.- ~mo.u mmP.- poo. mme.x omo.- pee. ,% CY me\.x 1% m..\ E.\ .E z\..\ z m wpo.- mm.- mmo.- =5~.- amo. mmo.u on.- we..- Hwaeoz Hwnoa nmmwew nvzmxm steoz Hwnos nmmwem nmzmxm weoom cameo; -Hw In new I» neouusu ”may meeom wuweemuae mmwefiewxm om no mwaaswm mmoeo< wmm.ee>:z no eoHuwH>ma newnewum HwHHwewm zaawoammwau mseom muweemuae HwHHwewm masonewm =N< xHazmmm< new anm eww: 172 173 % _m\o.\ WOV io\. ~_F.- zoo. m¢o. moo. nmmvxx. ammvxx mwmnxx mmmhxx ‘m...- mmo.- ~oo.x ¢...- NF- @\ m.\ a..\ .mo.- mo.- =mo.- m=P.- steoz Hwnoe nomwem nmzmxm -em -w maeom oaweewaa< Hmaawewm >~Hwoammwao mmoefiewxm om no mwaeewm mmoeoe m w.ee>=: 2.\o.\ RV g §o\. nmo.u NPP. smp. pap. mmvx N8\. g s.\.. mmo. poo. mmo. cmo. o. mmo. smo. . c. Poo. nmo.n P—o.u mo. Hweeoz Hwnos nomwem nwzwxm lam In maeom ouweeouae Hmaawewm haaonewm no eofipwfi>mn newnewum new mwfim A.o.pcoov =m< xHazmmm< mp op :p weoom thIpzo ewm: 0N cameo; puma LIST OF REFERENCES LIST OF REFERENCES Algina, J., & Noe, M. J. A study of the accuracy of Subkoviak's single-administration estimate of the coefficient of agreement using two true-score estimates. Journal of Educational Measurement, 1978, lg, 101-110. Anastasi, A. Psychological testing. New York: Macmillan, 1976. Berk, R. A. Item analysis. In R. Berk (Ed.), Criterion-referenced testing: State of the art. Baltimore: The Johns Hapkins University Press, 1980. Berk, R. A. A consumers' guide to criterion-referenced test reliability. Journal of Educational Measurement, 1980,.31, 323-3119 0 Block, J. H. Student learning and the setting of mastery performance standards. Educational Horizons, 1972, §QJ 183-190. Brennan, R. L. Psychometric methods for criterion-referenced tests. Unpublished manuscript, March 1979. (Available from [Department of Education, SUNY at Stony Brook, Stony Brook, New York, 11790)) . Brennan, R. L. KR-21 and lower limits of an index of dependability for mastery tests (ACT Technical Bulletin No. 27). Iowa City, Iowa: American College Testing Program, December 1977. Brennan, R. L. Extensions of generalizability theory to domain- referenced testing (ACT Technical Bulletin No. 30). Iowa City, Iowa: American College Testing Program, June 1978. Brennan, R. L. Some applications of generalizability theory to the dependability of domain-referenced tests (ACT Technical Bulletin No. 32). Iowa City, Iowa: American College Testing Program, April 1979. Brennan, R. L., & Kane, M. T. An index of dependability for mastery tgstsz )Journal of Educational Measurement, 1977. 13, 277- 2 9. a Brennan, R. L., & Kane, M. T. Signal/noise ratios for domain- referenced tests. Psychometrika, 1977. fig, 609-625. (b) 174 . 175 Brennan, R. L., & Lockwood, R. E. A comparison of two cutting score procedures using generalizability theory (ACT Technical Bulletin No. 33). Iowa City, Iowa: American College Testing Program, April 1979. Buck, L. 8. Use of criterion-referenced tests in personnel selection: A summary status report (Technical Memorandum 75- 6). washington, D.C.: United States Civil Service Commission, December 1975. Carver, R. P. Special problems in measuring change with psychometric devices. In Evaluative Research: Strategies and Methods. Washington, D.C.: American Institutes for Research, 1970. Cohen, J. A. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, g9, 37-H6. Cronbach, L. J., Gleser, G. C., Nanda, R., & Rajaratnam, N. The dependabilitypof behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972. Downing, S. M., & Mehrens, H. A. Six single-administration reliability coefficients for criterion-referenced tests: A comparative study. Paper presented at the annual meeting of the American Educational Research Association, Toronto, 1978. Ebel, R. L. Criterion-referenced measurements: Limitations. School Review, 1971, 69, 282-288. Eignor, D. R., & Hambleton, R. K. Relationship of test length to criterion-referenced test reliability and validity (Report No. 86). Amherst: University of Massachusetts (School of Education), Laboratory of Psychometric and Evaluative Research, 1979. Glaser, R. Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 1963, 18, 519-5210 Glaser, R., & Klaus, D. J. Proficiency measurement: Assessing human performance. In R. Gagne (Ed.), Psychological principles in system development. New York: Holt, 1962. Glaser, R., & Nitko, A. J. Measurement in learning and instruction. In R. L. Thorndike (Ed.), Educational measurement. (2nd ed.) washington, D.C.: American Council on Education, 1971. Glass, G. V. Standards and criteria. Journal of Educational Measure- ment, 1978, 1;, 237-261. Goldstein, I. L. Training program: Development and evaluation. Monterey, California: Brooks/Cole, 197M. 176 Goodman, L. A., & Kruskal, H. H. Measures of association for cross classifications. Journal of the American Statistical Associ ation, 195”, 32, 732-76M. Cross, A. L., & Schulman, V. The applicability of the beta binomial model for criterion referenced testing. Journal of Educational Measurement, 1980, 11, 195-201. Hambleton, R. K., & Eignor, D. R. Criterion-referenced test development and validation methods. Training program presented at the annual meeting of the American Educational Research Association, San Francisco, April 1979. Hambleton, R. K., & Novick, M. R. Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 1973, 123 159-170. Harris, C. A. An interpretation of Livingston's reliability coefficient for criterion-referenced tests. Journal of Educational Measurement, 1972, 93 27-29. Harris, C. W. An index of efficiency for fixed-length mastery, tests. Paper presented at the annual meeting of the American Educational Research Association, Chicago, April 1972. Harris, C. W. Note on the variances and covariances of three error types. Journal of Educational Measurement, 1973. 123 99-50. Harris, M. L., & Stewart, D. M. Application of classical strategies to criterion-referenced test construction. A paper presented at the annual meeting of the American Educational Research Association, New York, February 1971. Huynh, H. 0n consistency of decisions in criterion-referenced testing. Journal of Educational Measurement, 1976, 13, 253- 269. Huynh, H. Reliability of multiple classifications. Psychometrika, Huynh, R., & Saunders, J. C., III. Accuracy of two procedures for estimating reliability of mastery tests. Paper presented at the annual conference of the Eastern Educational Research Association, Kiawah Island, South Carolina, February 1979. Ivens, S. H. An investigation of item analysis, reliability,pand validity in relation to criterion-referenced tests. Unpublished doctoral dissertation, Florida State University, August 1970. Kane, M. T., & Brennan, R. L. Agreement coefficients as indices of dependability for domain-referenced tests (ACT Technical Bulletin No. 28). Iowa City, Iowa: American College Testing Program, December 1977. 177 Keats, J. A., A Lord, F. M. A theoretical distribution for mental test scores. Psychometrika, 1962, 21, 59-72. Klein, S. P., A Kosecoff, J. Issues and procedures in the development of criterion-referenced tests (ERIC/TM Report 26). Princeton: ERIC Clearinghouse on Tests, Measurement, and Evaluation , September 1973 . Koslowsky, M., A Bailit, H. A measure of reliability using qualitative data. Educational and Psychological Measurement, 1975. 3:3 8u3-8u6. Livingston, S. A. A reply to Harris' "An interpretation of Livingston's reliability coefficient for criterion-referenced tests". Journal of Educational Measurement, 1972, 9; 31. (a) Livingston, S. A. Criterion-referenced applications of classical test theory. Journal of Educational Measurement, 1972, 23 13-26. (h) . Livingston, S. A. Reply to Shavelson, Block, and Ravitch's ”Criterion-referenced testing: Comments on reliability". Journal of Educational Measurement, 1972, 9; 139-1A0. (c) Livingston, S. A. A note on the interpretation of the criterion- referenced reliability coefficient. Journal of Educational Measurement, 1973. 1_0_. 311. Livingston, S. A., A Wingersky, M. S. Assessing the reliability of tests used to make pass/fail decisions. Journal of Educational Measurement, 1979, JQJ 2u7-260. Lord, F. M., A Novick, M. R. Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968. Lovett, H. T. Criterion-referenced reliability estimated by ANOVA. Educational and Psychological Measurement, 1977. 313 21-29. Magnusson, D. Test theory. Reading, Massachusetts: Addison-Wesley, 1967. Marshall, J. L. The mean split-half coefficient of agreement and its relation to other single administration test indices: A study based on simulated data (Technical Report No. 350). Madison: University of Wisconsin, Research and Development Center for Cognitive Learning, June 1976. Marshall, J. L. Possible mathematical relationships of true and obtained scores and their implications for mastery testing_. Paper presented at the annual meeting of the Midwest Educational Research Association, Bloomingdale, Illinois, 1978. 178 Marshall, J. L. Personal communication, 1980. Marshall, J. L., A Haertel, E. H. A single-administration reliability index for criterion-referenced tests: The mean split-half coefficient of agreement, Paper presented at the annual meeting of the American Educational Research Association, Washington, D.C., March-April 1975. Marshall, J. L., A Serlin, R. C. Characteristics of four mastery test reliability indices: Influence of distribution shape and cutting score. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, April 1979. Mehrens, W. A., A Ebel, R. L. Some comments on criterion-referenced and norm-referenced achievement tests. Measurement in Education, Winter 1979, 19, 1-8. Michigan Department of Education. Technical Report: Michigan Educa- tional Assessment Program. Lansing: Michigan Department of Education, Research, Evaluation and Assessment Services, June 1977. Millman, J. Criterion-referenced measurement. In W. J. Popham (Ed.), Evaluation in education: Current applications. Berkeley, California: McCutchan, 197A. Millman, J., A Popham, W. J. The issue of item and test variance for criterion-referenced tests: A clarification. Journal of Educational Measurement, 19?", llJ 137-138. Novick, M. R., A Lewis, C. Prescribing test length for criterion- referenced measurement. In C. W. Harris, M. C. Alkin, A W. J. Popham (Eds.), Problems in criterion-referenced measurement. CSE monograph series in evaluation, No. 3, Los Angeles: Center for the Study of Education, University of California, 197“. Peng, C.-Y. J., An investigation of Hgynh's normal approximation procedure for estimating criterion-referenced reliability. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, April 1979. Peng, C.-Y. J., A Subkoviak, M. J. A note on Huynh's normal approximation procedure for estimating criterion-referenced reliability. Journal of Educational Measurement, 1980,.EL, 359-368. Popham, W. J., A Husek, T. R. Implications of criterion-referenced measurement. Journal of Educational Measurement, 1969, Q, 1- 9. 179 Schmitt, N., A Schmitt, K. Differences in reliability estimates for objective-referenced tests. Unpublished manuscript, 1977. Shavelson, R. J., Block, J. R., A Ravitch, M. M. Criterion-referenced testing: Comments on reliability. Journal of Educational Measurement, 1972, 9, 133-137. Subkoviak, M. J. Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Subkoviak, M. J. Further comments on reliability for mastery tests. Unpublished manuscript, university of Wisconsin, 1977. Subkoviak, M. J. Empirical investigation of procedures for estimating reliability for mastery tests. Journal of Educational Measurement, 1978, lg, 111-116. (a) Subkoviak, M. J. The reliability of mastery classification decisions. Paper presented at the first annual Johns Hopkins university National Symposium on Educational Research, Washington, D.C., 1978. (b). Swaminathan, H., Hambleton, R. K., A Algina, J. Reliability of criterion-referenced tests: A decision theoretic formulation. Journal of Educational Measurement, 19?”, 113 Wardrop, J. L., Anderson, T. R., Hively, W., Hastings, C. M., Anderson, R. I., Muller, K. E. A framework for analyzing the inference structure of educational achievement tests. Journal of Educational Measurement, 1982, 1g, 1-18. Woodson, M. I. C. E. The issue of item and test variance for criterion-referenced tests. Journal of Educational Measurement, 197A, JJJ 63-6". (a) Woodson, M. I. C. E. The issue of item and test variance for criterion-referenced tests: A reply. Journal of Educational Measurement, 19?", ll, 139-1A0. (b)