AN EVALUATION OF THE SEQUENTIAL METHOD OF PSYCHOLOGICAL TESTING Thesis for H1: Deqru 0‘ Ed. D. MICHIGAN STATE UNIVERSITY John James Paterson 1962 This is to certify that the thesis entitled AN EVALUATION OF THE SEQUENTIAL METHOD OF PSYCHOLOGICAL TESTING presented by John James Paterson has been accepted towards fulfillment of the requirements for CDwi/ET [31% -. Major: professor Date June 15, 1962 LIBRARY Michigan State University V VT-‘r _, _._.. #. ff “’ 1“ H I“. - 1 r ‘0- .-¢ ..... nucF I ~z-" " 5.: 5:" v-.‘..-.....-~ an- .' s ""V. fic~v~~ .mv-‘ ‘~:L. . -.. .3 ".2 ’e-... -_. 1A. "‘ x ' i-..g \ _ J . " ‘0. -~. . h' ~ I * _ “~..v u ‘1 ~...:_“ a.‘,‘ 3... ~‘.‘:"-V‘ 2 ‘ 0‘ . s... V- P. “‘1 a‘.‘ "" ‘9: ~ u y N‘.\ ‘ k. *1 VA 5 ‘\ L4 Q G 5“. \; I “ V“ \._.\ :- V y‘ ‘. .- "vffiifi o u; \‘ F “¢~c - e. \" .\- ‘k 5": .‘V‘; » ABSTRACT AN EVALUATION OF THE SEQUENTIAL METHOD OF PSYCHOLOGICAL TESTING by John J. Paterson In the sequential method of psychological testing the examinees are directed to subsequent items on the basis of their responses to prior items. No examinee responds to all the items of a sequential test, and any given examinee might complete the test by responding to any of several com- binations of items. Scores on the sequential test reflect the difficulty of items correctly answered not the number correct. The evaluation did not involve an actual population of individuals, but used probability models and hypothetical populations. The probability of passing a given item in a test was calculated from the ability level of the individual, the difficulty of the item, and the precision of the item. (Precision may be computed from the item-total biserial correlation.) The probability of passing a sequence of items was determined for each of fifteen ability categories by multiplying together the probabilities of passing or failing a sequence of six items. Sixty—four different se- quences were calculated for each ability category. The problem involved was the comparison of the sequen- tial model with the traditional cumulative model (in which ‘ I RAU-W‘vn — u- up I»-yo...-n.~ u n- .— v- «- i—l" .‘ ,t\ "—.u V‘agg . ‘ "F .J-fi- ‘ ~\ .-a- ._._. . r; r." ‘v‘ oy--.-v.. C ‘2'!- 0.1,,‘ 54 'y.‘ rat ..\- John J. Paterson all items were at the 50 per cent level of difficulty) to determine how well individuals at different ability levels were classified by the tests. The parameters of the sequen- tial test (difficulty and precision) and the effects of errors in estimating these parameters were examined in relation to the resulting classification of individuals. One sequential test model was constructed with an item-total biserial correlation of .75 and item difficulties such that the sum of the squared deviations of the individ- ual's ability level from the mean ability level of the group into which the individual was classified would be a minimum. Even though individuals in each ability category were kept separate from individuals in other categories, individuals in different categories took the same difficulty item if the calculated difficulties were less than .20 standard deviation units apart. A rectangular distribution of ability was assumed in these calculations. Both normal and U-shaped distributions of ability were used as input for the above sequential and cumulative test models to determine how well the results classified individuals of different ability levels. It was concluded that regardless of the distribution of ability used as input; the individuals in the extreme ability categories had significantly less variance of scores in the sequential test. At middle ability levels the sequential test did have slightly lower variance of test scores than the cumulative. For the . n \ -sn. .r‘V‘ - — I— “‘ .v‘v- ~13‘Ao U .vu . . q 2'; 1" JV" -"" o -lev‘ 0‘. ’7 ""\ AV‘" nu ‘ 'Vr %..u I ‘ . r~1..l.‘,‘, -uu-.“": Q I -~, g Wow. ( ( I (n ( L (U i s h; \— l-’ John J. Paterson top scores the sequential test had less variance of ability level than the cumulative. The second and fifth items in the sequential test werewd each separately changed in difficulty and precision. The resulting number of people at each score, the mean ability level of individuals at each score, the variance of scores for top and middle ability level individuals and the variance of ability level scores for the top and middle scoring individuals were all insignificantly changed. The sequential test was not sensitive to errors in estimating the precision and difficulty of the items. When precision of items in the sequential tests was varied, tests consisted of higher precision items (with dif- ficulties appropriate for that precision level) had less variance of scores for ability level categories and less variance of ability level categories for top and middle scoring individuals. It was concluded that more difficult items are needed to distinguish among more able students; less difficult items among the less able. If extreme scores having low variance of ability level are desired, the item difficulties should be regressed toward the mean from those difficulties which give the best discrimination between individuals of similar ability level. in partial fu AN EVALUATION OB OF PSYCHOLOGICAL TESTI 1 '1 DV .. A THESIS Submitted ts Michigan State University lfillment of the requiremcrt for t? d ‘ Ceilege Oi Education 1102 617 25,73 12.. s. ,‘ ,' .. 2, , g, 3, ACKNOWLEDGMENTS The writer wishes to express his appreciation for the guidance given by Dr. David R. Krathwohl in the preparation of this thesis and to the Bureau of Educational Research for arranging the time necessary for completion of the research. TABLE OF CONTENTS CHAPTER I. DESCRIPTION OF THE PROBLEM. Description of the Sequential Test Model Starting Point. Stopping Point. Scoring Pattern of Items Directions to Testee. A Iiagram of a Sequential Test Used in This Study Need for Test Improvement . . Maxim.ally Efficient Use of the Items Selected . Control of the Score Distribution Meaning of A Score Rationale for the Sequential Item Model Maximally Efficient Use of Items. Control of the Score Distribution Meaning of a Score Selection of the Sequential Procedure Hypotheses . . Effect of the Type of Ability Distribution . Effect of Item Precision and Difficulty for the Sequential Test . . Effect of Errors in Estimating Para- meters . . Limitations of the Study Best Cumulative Test. Distribution of Scores Ability Distributions Test Parameters . . . Test Construction Procedures Test Presentation Procedures and Effects. Overview of the Remainder of the Dissertation II. REVIEW OF LITERATURE. Maximally Efficient Use of Items Selected. ii PAGE 1...) ‘...0.-'~-— _.,... .n. --- ' 11 CHAPTER Control of the Score Distribution. Meaning and Use of Score Produced. Sequential Testing Procedures III. PROCEDURES Test Model Construction . . Effect of Shape of Distribution of Ability. . . Effect of Normal Distribution . Effect of U— Shaped Distribution . Effect of Ability Distributions for Additional Seqeuntial Tests . Item Precision and Difficulty for the Sequential Test . Errors in Sequential Test Para.meter Estimates General Comparisons sumwary of Procedures and Hypotheses. IV. ANALYSE AND RESULTS Sequential Test Construction First Item Decision Second Item Decision Third Item Decision Fourth Item Decision Fifth Item Decision Sixth Item Decision . Input Distribution Effects Results from the Normal Distribution. Results from the U- Shaped Distribu- tion. Item Precision and Difficulty for the Sequential Test . Varia.nce of Scores. Variance of Ability Levels Errors in the Sequential Test Parameter Estimates . Errors in Estimating Difficulty Errors in Estimating Precision. General Comparisons V. CONCLUSIONS Sequential Testing and Testing Problems. Efficiency of Items. Control of the Score Distribution. Meaning of a Score. 111 PAGE 69 81 91 91 95 97 100 102 105 107 112 114 122 122 123 123 125 125 126 128 131 132 138 142 142 144 1A8 148 153 156 16A 16A 16A 167 160 "‘ J ~ __ ‘... _—_ o-v v-0 Ian‘.__.. ~..-‘._’“~ ‘-""1'-\v-r _-... ' '-. . “1-, .... O A no . c. -— ~ . CHAPTER PAGE Sequential Testing Hypotheses. . . . . 171 Effect of Ability Distribution . . . 171 Effect of Precision and Difficulty . . 173 Effect of Error in Parameters. . . . 175 VI. SUMMARY AND RECOMMENDATIONS . . . . . . 176 Summary . . . . . . . . . . . . 176 Recommendations . . . . . . . . . 183 BIBLIOGRAPHY . . . . . . . . . . . . . . 186 APPENDIX A . . . . . . . . . . . . . . . 192 APPENDIX B . . . . . . . . . . . . . . . 198 iv _. ‘_ u-u . ...~~"‘ I. L»... f. L Cad T. u... L. A... L. n.» :3 3.. ;. ml . ‘ .r.\ rye A a s V s w u. , A . & .C A . A: a 4. ac A v ac 14. a: vn r. r.. A“. vs. .V r.. L» .r.. IL“ A\~ Ax~ m- .5. «x» A\~ ~\~ Pr. . . .H TABLE LIST OF TABLES Analysis of Means and Variances of Normalized Scores for Category 8 Individuals When Normal Distribution of Ability is Input into Sequential and Cumulative Test Models . . . . Analysis of Means and Variances of Normalized Scores for Category 14 and 15 Individuals When Normal Distribution of Ability is Input into Sequential and Cumulative Test Models Analysis of Means and Variances of Ability Level Scores for the Top 8.4 Per Cent of the Score Distribution When Normal Distribution of Ability is Input into Sequential and Cumulative Tests . . . . . . . . Differences Between Normalized ”T" Scores for Adjacent Top Ability Levels for Normal and U— Shaped Input. . . . . . . . . Differences Between Ability Level Scores for Adjacent Top Scores for Cumulative Test Model for Normal and U—Shaped Input . . Differences Between Ability Level Scores for Adjacent Top Scores for Sequential Test Model for Normal and U—Shaped Input . . Analysis of Means and Variances of Normalized Scores for Category 13 Individuals When a U- Shaped Distribution of Ability is Input into Sequential and Cumulative Test Models Analysis of Means and Variances of Normalized Scores for Category 15 Individuals When a U— Shaped Distribution of Ability is Input into Sequential and Cumulative Test Models Analysis of Means and Variances of Ability Level Scores for the Top 13.5 Per Cent of the Score Distribution When a U— Shaped Distribution of Ability is Input into Sequential and Cumula- tive Tests . . . . . . . . V PAGE 13A 134 134 136 136 137 139 139 139 o- - Pp-' '.--—" I o - ‘ 1 4 I: . Q ‘1 i ——J .. . . . L. .. . a: .. . «u . . .. . .... 'L m ‘ ‘ . a m m . or. .1. .s1 . . . . . ...,f - .9. .v ..>J ~ u . . . .s .C . v .. _. rm“ N\ r u . 1‘ Lisa ..~ .1 r TABLE 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Analysis of the Variance of Scores for Individ— uals at Specified Ability Levels for Five Tests of Different Precision. . . . . . Analysis of the Variance of Ability Level Scores for Individuals at Specified Score Levels for Five Tests of Different Precision The Means and Variances of Rank Scores Assigned to Each Ability Level by Five Tests of Different Precision. . . . . . The Discrimination Indices Between Adjacent Ability Levels for the Input of a Normal Distri— bution of Ability into Tests of Different Precision Distribution of Individuals by Two Tests-~One Test With Second Item Difficulties Farther from 50 Per Cent Level Than the ”Error Free” Test Distribution of Individuals by Two Tests——One Test With Second Item Difficulties Nearer to 50 Per Cent Level Than the ”Error Free" Test Analysis of the Variance of Ability Level Scores for Individuals at Specified Score Levels for One ”Error Free" Test and Two ”Error in Difficulties of Fifth Items” Tests. Analysis of the Variance of Rank Scores for Individuals at Specified Ability Levels for One "Error Free" culties of Fifth Distribution and Top Score Values Difficulties and Input Distribution and Top Score Values Difficulties and Ability Input. Distribution and Top Score Values Difficulties and Input Test and Two ”Error in Diffi- Items” Tests Mean Ability Level Scores for for Three Tests With Different Normal Distribution of Ability Mean Ability Level Scores for for Three Tests With Different Rectangular Distribution of Mean Ability Level Scores for for Three Tests With Different U-Shaped Distribution Ability V1 PAGE 143 143 146 147 150 150 152 152 157 158 159 : . . . :O ». ... .u. r. r . .e. C~ 2. Ca V. . . .u. s. ._ v“ w“ .. A. r” .C. n C. ._. . C __v .C. n . a a. .n. : . .C v. : . a . : i a. L. .N. ....C... .... 3.2. v-“ and | o t I . . . . .,_ .. a . w... fr. .r. p/b .r. L... . L. —~. Q. ..u w.. L. S. _~\p Ar. — n v a .Z. . ». ~ 5. A. . . . r“ ~\\M > . Q. . N v. .t s. n. : A. 3‘ .V a . nx~ phi a L» _. s e .. K. .A . T.» by «a. e O 1 s . _ L» .. A a g Q» «C are -r-u 'u TABLE PAGE 21. Distribution and Mean Ability Level Scores for Top Scores of Tests With Different Levels of Precision and With an Input of Normal Distribu- tion of Ability. . . . . . . . . . . 161 22. Distribution and Mean Ability Level Scores for Top Scores of Tests With Different Patterns of Items Encountered and With an Input of a Normal Distribution of Ability . . . . . . . . . 163 23. Per Cent Passing Items of the Different Sequential Tests Constructed . . . . . . . 193 24. Mean Normalized "T" Scores for Each Ability Level for Cumulative and ”Least Squares” Sequential Tests . . . . . . . . . . . 194 25. Distribution and Mean Ability Level Scores for Cumulative Test With the Input of Different Distributions of Ability. . . . . . . . . 195 26. Distribution and Mean Ability Levels for Top Scores on "Least Squares” Sequential Test With the Input of Different Distributions of Ability . 195 27. Distribution and Mean Ability Levels for Top Scores With Difficulties of Certain Items Changed in a Sequential Test With an Input of Normal Distribution of Ability . . . . . . . . . 196 28. Distribution and Mean Ability Levels for Top Scores With Precision of Items Changed in a Sequential Test With an Input of Normal Distri- bution of Ability . . . . . . . . . . . 197 vii FIGURE 1. LIST OF FIGURES Graphic Representation of the "Least Squares" Sequential Test Model Three Distributions of Ability Mean Ability Level of Groups Separated Out by Sequential Test and Difficulties of Items Used viii PAGE 99 124 DESCRIPTION OF THE m'-- . v‘ y r ‘ . 7" rv ine usual 0038» U2 erics of questions or correct or incorrect a given o _s by an examlnee typ .f test; the e O £1110 based on a cumulative tive to the cumu 9 items '98 given to eao. .‘.'. \.' CHAPTER I (J I} L: \J 5 item are he r'3te: of CO' 1.. . -- em 3-91/1 f1.) r112) s..‘_ n3 “loceuir” us ore Cl 5.. S PROLLHA W 1 111 One seque GICEITRR‘ ntial model. In te“ts based upon the .eqiential model, exanlnees are directed to subsequent items on the basi" of their re~ sponses txlé” mar ones. (Mule no clunvincc iflailonds 1fl)c111 of 1;}.‘3 l ch“d..) C)if £1 \fiC11163rli.i.211_ t.i t;?. 2:11 1 .1115' i; 1‘; :1; .:gici1n zit: : z” 15;}? * complete the ‘est 7y .csponiiru to any oi a YJLIECJ oi r‘m- binations of items. Score“ on $141 rt a1 cats are 14:). upon the nature of laps to which c. 6'. 17.2441 :-.—‘.-._spon:,e.s 1:5: gluon and not merely C H f—v ’7‘ (L. i" ,-o f ' ,. ,u... . ‘ - J \ A». - “5. y..- .. ° . v3 The 1m:slc 'ilotdnnf tau.~-l;t a 111 thi -- 's. 4-...,-~w:.— ._1 - L. » parqmmon 01 c1 seqcunitlal Imll< riith this t: test model. In the sequen individual who UdifibE.) D 1 model t ‘ IL_; item 1s l._l _ .... .2 a 3.. T. w. W. i. p1 .: L. L. t. a: . . T. w. .1 .. . r» a u .3 .. . . d» f” =— .v r“ ...t. . ” .AJ .;4 o . "w . . L. . e A a 1‘ .. . u v u c .4 a u. a 1. o .w. . . 5v . C. s. w. h... .. r: .1. as »y au- .—. r. a” i. .q .14 4 . c .‘ ...«-\-J-& Ar- -a a.. A- Abe-v— ibi I. ‘ ,-. .~: «3‘ s. d- . ..- ...v’.A Agsu «. n. n. r»; rm. ”a at I. .r.. m. Sb rw hit :4 k... Ts~ 2 to a more difficult item; if he fails an item he is directed to an easier item. If the item is very precise, the individ- ual who passes it is given a much more difficult item; and if the item is not very precise, the individual is given an item closer to the difficulty level of the item Just answered. The opposite is true for failing an item. The score is directly related to the difficulty of the item to which the individual is directed at the final stage of testing. In addition to the comparison of testing methods, the sequential test is examined for its strengths and weaknesses. Methods of improving the sequential model are suggested from the results so that even if the present procedure is not better than the cumulative test, future sequential procedures may be improved. The present evaluation of the sequential method of psychological testing consists of (1) a description of the features of the sequential method as compared with the usual cumulative test; (2) a description of some of the problems encountered in the use of the cumulative test and how these problems are handled by the sequential model; (3) a rationale for the sequential solution; and (4) the formulation of hypotheses as to the behavior of the cumulative and sequential test models in regard to specific problems. Following the hypotheses are (5) the limitations of the study and (6) an overview of the remainder of the dissertation. To aid the .nb . r. was C» .L. —.. 2: «iv \ d t.‘ Ax» 0a a p v .~. « . .L a. : . . r . t a t .. .A. ‘ r.. C l‘ \ A v is l. .a- . _. . . . ‘ 4 .2 .\ a a 5. .~. is a. u. ‘\. a . 4 «it u. s ‘ h a e \K» 3 reader a few of the more frequently used terms in this dis- sertation are explained in Appendix B. I. DESCRIPTION OF THE SEQUENTIAL TEST MODEL In any testing situation certairldecisionswmist be made: (1) the individual must be told where to start, (2) the decision must be made when to stop testing, (3) the final score must be determined, (4) the characteristics of each succeeding item must be stipulated, and (5) the testee must be informed as to where and how he should proceed. In the cumulative test the character of these decisions is obvious. Because they are unusual in the sequential item test, these decision points will be described in some detail. Starting Point Depending on the purpose of the test and what one therefore wishes to emphasize, the starting point may be at any level of difficulty. For instance, one may start with an easy item that most individuals will be able to pass and with which the individual would feel comfortable, or one may star with an item at the middle of the score distribution with no consideration as to the individuals who may be taking the test. The sequential test model deveIOped in this paper has the individual take as his first item one that would be con— sidered at the fifty per cent level of difficulty for the group of which he is a part. The reason for this choice is -v- .5-" l..- -A ,9 ._. ‘ » V F u n b r x C Q . 1.: a. . L . _.a . . a . _ . . w. : . w. k . . J. . r‘ 3 . . . ,, . , .c . 1V P. e A -‘ rd. .v. I‘ ‘ -~» g.» ‘vV ‘ I,» V o .- . fis M .. . . .. u 5. .~« 3. . s 0“ - x . we .: 1 . V. F» . I s i. . . . 4 .~‘ .1 . a 4.. s i. v , . s x x f. . H. . . L» .r . . . ..\_L. ., a .. r .~_ . a. r . CL . ‘ ._. a . u . . P . e \{s . ~s . .. « ..~» \ . . x t 1‘ 4 explained in Chapter III, Section 1. The present discussion must, of necessity, ignore the psychological effects which need to be empirically determined. StoppingPoint Criteria for deciding when to step are also determined by the purpose of the test. If doing the best job possible in the time allowed is paramount, then everyone is given the same number of items knowing that the extremes will be better classified than the middle ability levels. (Note that the criterion measure need not be a measure of ability but could be an attitude or interest. However, in this dis— sertation the criterion will be referred to as an "ability.") If time is flexible and there is a prescribed degree of accuracy for each score, then a fewer number of items is used for the extreme and more items used for the middle ability levels. If the rapid classification of extreme ability level individuals is desired, then one may stop testing when it can be determined that the individual is probably not at some middle ability level. In the sequential model in this paper all people will take six items. Scoring Reasons for choosing one system of scoring over another depend upon whether the score is to discriminate one ability group from another, to discriminate among the individuals in a group, or to describe the response pattern of the individual. -v-. § \J‘: If one wishes to disrrlnlnate one ability group from anOther, one would probably assign a score reflecting the difficulty of the final item. If one wishes to discriminate among in- dividuals, then the score may rep resent the number of people in, for example, one hundred that the individual would rank above in the population. If the score is to represent a response pattern, it may be an estimate of the number of items the testee could have answered C01 -rectly if he does answer an item of a gi en ifficulty, or it may identify the precise pattern of correctly and incorrectly answered items. ‘he sequential test model in this paper assigns the individual a score which is the difficulty of the item to which he is directed at the final stage of testing as his score. Pattern of Items The problem in the sequential test is to select thrt sequence of items which will yield the information needed to assign the individual a score. At any stei‘ in the test the decision as to the s cceeding item to be taken may depend upon (1) the number of pmeceding items one has ansfeled correctly, (2) the pattern of preceding items, or (3) the difficulty and precision of the immediately preceding item. This sequential model uses the difficulty and precision of all preceding items to deter:ci ne the next item. Difficulty of the item for this model 1s measured in terms of standard score an ts for a theoretically normal >~ Lu 6 group. An item that fifty per cent of the theoretical group would pass is designated as ”0.00.“ The precision of the item is essentially a measure of the validity of the item. The measure of precision, 65, may be defined as the standard deviation of the item characteristic curve. (It is also re- lated to the measure of precision ”h” used in psychOphysics: h = 1/1205 ; and, as Lord indicates, 05 is identical with his "bi".) Directions to Testee The testee may be told how well he perf rmed on any given item, may be told what is right or wrong with his per- formance, or may be simply directed to another item. Any combination of the above may be used at different stages in the test. Individuals may be directed to items which are taken by those who perform differently, or they may be directed to an item unique to their pattern of response. Pattern of response may be determined from correctness or incorrectness ' 3:. e I! — *1) (D ‘ 1 L" . Po only, or each alternative to any item may designat ent sequence. In this sequential test, pattern was determined from only correctness or incorrectness of items, and more than one possible sequence of responses could lead to the same item. l o 1 .. o Freder1c M. Loro,_A Theory of Test Scores, Psychometric Monograph No. 7 (Chicago: University of Chicago Press, 1952), p. 7. o. .. ‘ 7.» .s. ..A .. A l. a . . ... L. . _ . Ca 7 Many methods of giving the necessary information to the testee are available. In the empirical tests that have been built by Krathwohl and Paterson, the succeeding item that the individual should attempt is disclosed to the in- dividual when he eras s the opaque covering under the letter that has been selected as the answer to the question at hand. The final erasure disclosed a letter used to indicate a score ”W f I rather than the number of the next item.“ The testee mus answer each item as he comes to it as he receives no direc- tions if he does not answer. Other response techniques which could be used are tabs, envelopes within envelopes, sliding masks, and scrambled books. A Diagram of a Sequential Test Used in This Study Figure l is a diagram of one of the sequential tests used in this study. It is the one constructed by the ”least Cl H squares" method which is desc'ibe later. The pattern shown is only one of many possible sequential patterns. Difficulty of items.—~ltems are represented by circles, the ordinate position of which represents the difficulty of the item. The closer the item is to the top of the page, the more difficult it is. Difficulty is expressed in standard score units, i.e., an item that fifty per cent of the normative 2Unpublished material developed in the Bureau of Edu- cational Research, Michigan state University, East Lansing, Michigan, 1956-1959. Q AI. .XMV.~ NZELDLL ULTTZfitu?‘ :NV T..:..,.J \ L a... >5 N 2; 7w.“ ‘2 Hess: same Hmapccsvom :mcmmSUm unwed: we» mo eoflpmpccmmsdmm oficdmswuu.a .wam pace esp wo mcwmpm swam swam smpH stem swam smpH msoom saw sum as: esm ssm pma oo.H - om. - om. - 0 ON. t . ow. . om. 0:. om. ‘ ON. Oar . . 00. OH. ON. om. 0:. . om. 00. on. ow. om. OO.H Ammmoom ohmccmpm Gav msmsH no spasoumuum P... a 9 group would answer correctly is labelled "0.00". An item that 8H per cent of the normative group would answer cor- rectly is labelled ”-1.00”. Sequence of items.~-The sequence is represented by the abcissa value for the item. The first item of the test is at theleft-hand side; the sixth item at the right of the diagram. The individual confronts one item at each "stage" of the test. Size of step.—-The size of the step or the increase or decrease in difficulty from the item at one stage to the item at the next stage is represented by the difference in ordinate positions of the items as can be seen in Figure 1. There would be a large increase in the difficulty of the second item if one were to correctly answer the first item. There would be less difference between the easiest item at stage four and the easiest item at stage five. Route taken.--Lines slanting upward designate that those who are considered to have passed an item at the preceding stage should proceed to a more difficult item for the next stage. Lines slanting downward designate that the individuals are considered to have failed the item at the previous stage and should proceed to a less difficult item at the next stage. It may occur that passing a less difficult item will lead the individual to a more difficult .4-.-"';"‘~ - 1-.--4AOV 1. nu .w v. .~. ~ «V -_~ v. _... T. as. L. .h. fi,» ‘J k. r” r. L. .C L» I» e . .—_ . a. . .J . . .... .? 10 item for the next stage than he would have encountered by failing a more difficult item. In this case the lines be- tween items will cross. (This case is not illustrated in Figure l.) The other alternative not yet mentioned is that individuals passing a less difficult item or failing a more difficult item may be lead to the same difficulty of item at the succeeding stage. II. NEED FOR TEST IMPROVEMENT In order to lay the background as to why the sequential test is worth considering, one should examine what problems have been encountered in the use of the cumulative test. Present test procedures seem to have encountered three im- portant problems related to: (l) utilization of items to operate most efficiently with the group taking the test, (2) controlling the score distribution to arrive at a useful scale, and (3) production of a score with a precise meaning. Maximally Efficient Use of the Items Selected Once one has decided upon a purpose, then one can solve the problem of the most efficient selection of items either completely empirically, or theoretically in terms of the effect of varying certain item characteristics. The approach in this paper is the theoretical one. If one uses this theoretical approach, one of the problems is that of utilizing the most precise items available in a pool. The -.. v u .-..A-> 1. n h I. ~ lu u _. c a. . .L . n. 5 H ‘._. v... 3‘ . ._ .t . . ... a . .. . ..n u y a» . v A: 2* . . ... . . L. . ..g s u 1.. . rs ~ 8 I ll cumulative test cannot always use all of the more precise items. m In the cumulative test, if the score is the number of correct responses and if all of the items are of equal dif- ficulty, then a test with less precise items would give a better measure of the scale of ability than a test with more precise items.3 The above phenomenon has been called the ”attenuation paradox." Violation of any one or a combination of the following assumptions has been given as an explanation for the attenuation paradox: (1) scores are normally distributed, (2) ability is normally distributed, (3) the regression of scores on ability is linear, (4) measurement produces an interval scale of ability, and (5) response distribution is homoscedastic. There is evidence to support the contention that violation of any one of these could be the reason for the lack of a monotonic relationship between item reliability (precision) and the validity of scores in the usual testing situation with the cumulative test. One method of using the most precise items and increasing test validity is to use a spread of item difficulties as sug~ l gested by Brogden.+ However, this does not seem to be a 3Ledyard R. Tucker, ”Maximum Validity of a Test with Equivalent Items,” Psychometrika, llzl-lM; March, 1946. “Hubert E. Brogden, "Variation in Test Validity with Variation in the Distribution of Item Difficulties, Number of Items, and Degree of their Intercorrelation," Psychometrika, 11:197-214; December, 1946. l2 completely satisfactory solution because (1) there is no scheme to determine the appropriate spread and (2) the most extreme difficulties cannot be efficiently used any time the majority of the individuals taking the item guess at the answer.5 There should be some procedure which would allow use of precise items no matter what their difficulty level. If items are to be efficiently used in the discrimination of a group into two parts, the items should be at the SO per cent level of difficulty for the hypothetical group the median ability level of which is at the point where the 6 discrimination is desired. This means that if discrimina- tions are desired among a few high ability individuals then difficult items should be used. The usual cumulative test cannot efficiently use such items. 5Paul E. Meehl and Albert Rosen, ”Antecedent Probability and the Efficiency of Psychometric Signs, Patterns, or Cutting Scores," Psychological Bulletin, 52:194—216; May, 1955. , OBrogden, 9p;_§;t,; Lee J. Cronbach and Willard G. War- rington, "Efficiency of Multiple-Choice Tests as a Function of Spread of Item Difficulties,” Psychometrigg, 17:127-147. June, 19523Frederick B. Davis, HThe Se ection of Test Items According to Difficulty Level," .merican Psychologist, 4:243, July, 1949; Harold Gulliksen, "The Relation of Item Difficulty and Inter—item Correlation to Test Variance and Reliability," Psychometrika, 10:79—91, June, 1945; Lloyd G. Humphreys, WThe Normal Curve and the Attenuation Paradox in Tes Theory, Psychological Bulletin, 53:472—476, November, 1956; D. N. Lawley, ”On Problems Connected with Item Selection and Test Construction," Proceedings of the Roya Society of Edinburgh él (Section A, Part III);273«2 V, lQME-lOMB; Jane Loevinger, The Attenuation Paradox in Test Theory, Psychological Bulletin, 51:493-5042 September, 1954; Frederic W. Lord, rrSome Perspectives on 'The Attenuation Paradox in Test Theory'," Psychological Bulletin, 52:505~510, November, 1955; Frederic H . l ,. . . 4 . . no I o. . . a. S. .. .. . _. . . _ . ~ ... . p u _ l- . .. 7.. .... -K. A -4 13 Control of the Score Distribution The problem of score distribution is not only to assign a certain number of individuals to a given score, but to assign only like individuals to that score. The particular type of distribution which is desired depends upon the pur- pose for which the test is designed. A normal distribution is assumed in most statistical computations and interpre— tations. A rectangular distribution would give the best set of rankings in that peOple are spread evenly over all the scores. A bimodal distribution may be desired to classify individuals into accept or reject categories. Other than differences in the use of scores, factors which influence the score distribution are the distribution of ability levels of those taking the test, the item precision, and the difficulty of the items. A test able to proouce any type of score distribution desiredirrespective of the distribution of ability level of those taking the test and irrespective of the precision or difficulty of items available would have considerable utility. M. Lord, A Theory of Test Scores; N. w. Richardson, "The Relation Between the Difficulty and the Differential Validity of a Test,” Psychometrika, 1:33-49, June, 1936; Thelma G. Thurstone, ”The Difficulty of a Test and Its Diagnostic Value,” Journal of Educational Psychology, 23:335—343, May, 1932; Ledyard B. Tucker, 9p4~pit.; and David A. Walker, "Answer-Pattern and Score—Scatter in Tests and Examinations," British Journal of Psychology,jfl3flflfli{%ih January, 1940. Meaning of a Score The problem in assigning a meaning to a score is that the conventional cumulative score is typically a conglomer— ation which may represent the ability level of the individ— ual, the rank of the individual, the pattern of response, or any combination of these. It is not possible to clearly represent the ability level of the individual with the usual cumulative test. While it is possible to just rank individ~ uals or to just indicate the pattern of response with the cumulative test, this is not usually done. (in indicating the pattern of response the score is assigned to the sequence of items passed not to the number of items passed.) It may be useful to examine each of these possible elements in turn. The ability level of the individual cannot be deter- mined by knowing that he passed a difficult item in a cumulative test, because all people must take each item and difficult it.ms are often passed by chance as the majority of the group must guess at these items. This clouds any interpretation of the number of correctly answered items as a measure of performance. To get a better measure of the ability level of the individual from the score, White and Saltz have argued that the items should be scaled as to di.- ficulty so that one knows which set of items a person has answered correctly if he knows the total number answered 15 correctly.7 The usual cumulative test score does not permit one to infer which items the individual has passed. The score in the type of test suggested by White and Saltz would probably be used to represent the level of subject matter learned rather than how the individual ranked with others. In addition to the infrequent use of the above solution, the suggestion does not solve the problem of the majority of individuals guessing the answer to difficult items. To rank individuals in a normal distribution of ability so they are spread evenly throughout the score range, the test must make finer discriminations of ability at the middle ability range than it does for the extremes. Thus the test designed to rank individuals does not have a score scale which has the same relationship to the ability scale at the middle as at the extremes. Rarely is this relationship of scores to ability level reported. The cumulative test often compromises between using scores which rank individuals best and scores which tend to be normally distributed (as assumed in many statistical computations). The cumulative test may do either of the above alternatives well, but the decision made should be explicit and communicated to the test user. The decision should be to use the test score which permits one to infer rank (if this is what is desired), not 7Benjamin w. White and Eli Saltz, "Measurement of Repro- ducibility,” Psychological Bulletin, 54:81-99, March, 1957. -"~- A_. .‘ 5-..." V a- o..- D'A‘-ll"‘\' Iv 'v‘ - ‘1 I .f—A P ‘--...‘.., a \ ~ -.r~v ‘fi 4-5:..- --. . . ...:._.\.. ,‘.._ . .. ~..-~--... n" P'. .'. ., ....,,>___ " ‘"I~p ‘ " ‘vvn.-_. “ 5 . ._',__ , .- V ' - v..,_'_“ “‘ - .' ’ I. - .. - ‘ _ ~-.«.- '_.v - .\q,. AI .... fi-WA 7 ‘7 \ ‘--.. ,_ *-~-_ ‘ I ----. _x "‘ Vax— "--.v ,4". a.~ .,., .. 'm' '. .u — .._-\ . r. _ ”AOV ‘ . ~_ "r- .' I0 ‘ ‘-—.. —.. . I ‘ra.! \. -.. r _ "'\v ‘ ":'; -..,‘ -- ,1 .H - -; .__V "‘r- .Zv-y.‘ "' o. u‘, ‘7 "a! I ow "Q ~ ~. '- \b . v ‘r ~~‘ ‘ »"~A . -— '3’. . ‘ "' .x ‘ v.4 v- . ‘ A “ a ‘- ‘\' & ~~_‘ ‘~.KV‘ -- fl. ‘ _ t _‘ , »_ . -‘ v _ ~ '7— ‘ a .._ r ‘*.. ‘-—\_ ‘\~ . N K e. - - \. A .-l_ ‘ N‘ v “.m ‘-“ 5—. v - - ~ \~‘_ ‘ ‘ —a “. v- x. k. s .. ‘ .— ‘- ,— ~I - \\. - ~... t - c x 3 “ ‘QO ‘ . ._ ‘0 \ ‘ ‘ ‘\\‘ : _ ‘ ‘ ‘ § 3 & \ 16 to contaminate the meaning of a score by forcing the scores into a distribution just to create a higher correlation co- efficient with normally distributed measures. Another use of a score is to indicate the pattern of response. Cronbach has concluded that one should be as concerned with heterogeneity in content as in difficulty. Since the ”level of difficulty" meaning for a score has been discussed above, the ”heterogeneity with respect to content" meaning is considered here. For example, one bit of information is given when an individual is placed above the mean in pitch discrimination. With another set of items, the individual might be placed relative to the mean in visual acuity. The two items (with heterogeneity with respect to content) together place him in one of four categories. (If the second item had been a further measure of pitch, then he would have been placed in one of three categories with respect to pitch). The use of items with heterogeneity in respect to content thus seems useful, but one must remember that to recover all four categories the test cannot be scored by the number correct. Too often the items in cumulative tests are heterogeneous with respect to content and the number correct is used for the score. This cumulative scoring pro- cedure permits the precise meaning of a score from a test with perfectly precise items to be inferred only when the individual possesses all of the characteristics above the specified levels or possesses none of the characteristics at a ' V WK" ..--v0 V ‘--b- v’ —. Uta-”v“- o .It‘f ‘. I w.- 1.. r‘ .H, on u a” v.. w!“ P a rL; C . 3:. 1. Z. L. i... .. _._ .w a a: 2. 3.. e o. S . I Q. r.. o z. .. . F C r.. a. t. . . a; (\ 2 w . . . . 4 o .1 2g . . a: ) ID 44 . _, v 44 2» r . a. .-.. .. . . 4. ~.. a. .. . . , .. a a .3 a... .. e r. . . a . a o .a z. N. . ..... :u n. . . c . «x» . . .,:. .. .. E Y ;. .... . a . . . . y . . . u a «a. I . r t. .. 17 or above the specified level. These cumulative scores are even more difficult to interpret when the items are not perfectly precise. Rarely is any method of scoring other than the number correct used, and, if the level of ability in any character- istic is desired in conjunction with the pattern of charac- teristics, the problems discussed above for reflection of ability are added to lack of knowledge about which charac- teristics the individual possesses. III. RATIONALE FOR THE SEQUENTIAL ITEM MODEL The sequential item model is now examined to show why this model is expected to (I) give maximally efficient use "of items, (2) control the score distribution, and (3) yield a score with a precise meaning. In addition, the rationale for using one of the several sequential procedures is presented. Maximally Efficient Use of Items The sequential test is expected to make optimal use of all items, irrespective of difficulty, because this test model provides that each item be at the fifty per cent level of difficulty for the group taking the item. At each suc- ceeding stage in testing the original group is divided into progressively more homogeneous ability groups and the dif- ficulties of items are matched to the average abilities of .n '— p In“ l-“' A- 4‘--- _. p; Q .. I: ha ,1 ... .. .y.. r.. . ‘ . U V“. VA. t» I. u . v . ».. r.. i. .. .. L. 3 e .. ‘N ~‘r ~|t a - c. 18 each group taking the item. Thus the easiest items are taken by the lowest ability groups and the hardest items by the highest ability ones. This procedure accords with the works of Brogden, Cron- bach, Davis, Gulliksen, Humphreys, Lawley, Loevinger, Lord, Richardson, Thurstone, Tucker, and Walker which indicate that if one wishes maximum discrimination of a group into two groups, then all items should be at the 50 per cent level of difficulty for a hypothetical group the median of which is at the point where the discrimination is desired.8 This means that one needs difficult items to best discriminate within high ability groups and easy items to discriminate within low ability groups. The sequential procedure allows the difficulty of the item to be suited to the ability level of the group answering the item. The second reason for assuming that the sequential test will operate better than a cumulative test is that since dif— ferent ability level individuals do not take the same items, the number of low ability people passing a difficult item by chance will not exceed the number of high ability people passing the item due to their ability. As has been pointed out by Meehl, in the cumulative test an item with poor dis- criminating power is better than one with greater discrimin- ating power if fifty per cent of the people are expected to 8See footnote 6. .. .. _ a: ... av :. .. t as .m. a: .... _ . .... A ,s. )1. has .... J. . .. .. 9. ‘v 3. o . . . s .- .54 u.‘ a.“ .fiA Ly ... . a. ~. . f A.» ~x~ g—t .. . ... ~. l9 9 pass the first, and only 10 per cent to pass the second. Control of the Score Distribution The problem of control of score distribution is to assign like people the same score, and to yield a score distribution which will best serve the purpose of the test. Since the distribution of scores depends upon the distri- bution of ability of those taking the test and upon the difficulty and precision of the items, Lord and Brogden have each stated that for a normal distribution of ability and with items of equal difficulty and usual precision, the cumulative test cannot produce normally distributed scores.1 Humphreys has suggested that the answer is to spread the item difficulties.ll He gives no method to show how such a spread of difficulties is determined. Another answer is the sequential process developed in this paper. It is assumed that the sequential procedure will more adequately control the score distribution because the items must operate well for only a small group of people not for all of the individuals taking the examination. After precise items are used to validly split a given group, the resulting groups may be further divided into whatever size is desired by using additional items of appropriate difficulty. Any number of subgroups may be combined if desired to produce appropriate 9Meehl and Rosen, op. cit. 10Lord, A Theory of Test Scores, op. cit., p. 11; and Brogden, op. cit., p. 207. llHumphreys, op. cit. 20 distributions or to combine like individuals. These methods of control should allow maximum control of the score distri- bution. Meaning of a Score A sequential test score may represent the ability level of the individual, the rank of the individual, or the pattern of response, but it does not represent more than one of these at the same time. The ability level of the individual is represented by the score when the score is the difficulty of the final item. The rank of the individual is represented by the rank of difficulty of the final item. (The rank scale is an equal interval scale on ability when equal discrimin~ ations are made at all ability levels--in this case rank of difficulty and difficulty represent the same factor-~the ability level of the individual. If unequal discriminations at different ability levels are made the scales represent different information.) The pattern of response of the individual would be represented by a score assigned to the sequence of items taken in the sequential test. Even though every individual may pass the same number of items, the se- quence of items taken by an individual may be specified and assigned a score different from that of an individual who passed the same number of items but via a different route. Different routes (sequences)will represent different items being passed even though the number of items passed is iden- tical. 2. p. JAVA. . ..~ ~. V L: n o .-¢-vv‘ v uvu'-" \ A 6-»:- . a. 7. C 3. 4.. us .v a. .v ..~. 2. .R‘ r. ... us 21 Since the sequential test has several scoring procedures each yielding a different but precise score meaning, the sequential score is more interpretable than the cumulative test score which is typically a conglomerate of all of these scoring procedures. In addition to the precision of meaning, the different scoring procedures allow great versatility in the use of the test. Selection of the Sequential Procedure The type of sequential procedure used depends upon the purpose of the test: (1) rapid classification of extreme ability individuals, (2) reaching a prescribed degree of accuracy for each score, or (3) doing the best job possible in the time allowed. In the present case the decision was made to do the best possible job with six items. The reasons for accepting this decision and the reasons for rejecting the other decisions are outlined briefly. The rapid classification of individuals may be thought of as either classification into such categories as accept, reject, and continue testing--or classification into score categories which would more closely represent the results of the more traditional scoring procedures. The classification into the three categories closely resembles the procedure developed by Wald for industry where the concern was to pre- dict the number of faulty objects in the population. A random sample of the population was used at each stage. . L. . r .. _ . o. ‘1‘ .. ,"A ._. \. Q .,_ .l .. .... . . .. u. I. v v ‘ . —v .‘4 o. v\. .h \ a .u I ‘ 22 In the Wald procedure two sets of values are computed: the one set is such that after each sample if results are lower (e.g., in number of correct items) than a specified number, then one may classify the population (or individual) as rejected with probability; and the other set of values such that after each sample if results are higher than a specified number, then one may classify the population (or individual) as accepted with probability.12 Fiske and Jones have advocated that the sequential pro- cedure as outlined by Wald be used only when the problem in- volves the choice between two possible parameter values which can be specified on a priori, but not arbitrary grounds.13 To classify people into additional categories, Cowden modified the Wald procedure. He assumed that the fewer items one needed to meet the criteria for classification into either the accept or reject categories, the farther the individual was from the specified level. He thus created five cate— gories with the extreme categories being classified very rapidly with few items. The second sequential procedure suggested above-~that is, classifying until a specific degree of accuracy has been reached-—has not yet been investigated. Exploration of this l2Abraham Wald, Sequential Analysis (New York: John Wiley and Sons, 1947). 13Donald W. Fiske and Lyle V. Jones, "Sequential Analysis in Psychological Research,” Psychological Bulletin, 51:264—275, May, 1954. . u. ._ .~ . . 1. . rt. . . 2. . 2. ”1 e . i; _. L, . a .. L. v. ... .v ; .... . . c . {a «v a 2. 3. .2 _ . v“ 2. a c o 2. .1 T. .a to u L. r~. .. L. _... u . .4 3. _ 0‘. 2‘ ”A . . 2.. not {a I» .. 3. ..... {a s. cc r» r». . . ... :~ 5 . . . v.. 2» rx. 3. .: w. .. -.. u L. r.. .... . . ._ . Q» a r“ . . .. . a. u; .~— 1 . .: w . ... .. . . ._. 4 ‘ ac. . . .r. v, ..' r . L . ... r» .H l. .. - ‘ ~ . v n u v ... . . ... .. . ..~ g a 2‘ x... .4. Pa 23 procedure was rejected because it was felt that this procedure might be more fruitfully explored after there was more ade- quate understanding of the interrelationships of the variables involved in the sequential procedure developed in this paper. Whereas in the industrial system of sequential testing the model assumes a random sample of ability at each level, this is not the best procedure for obtaining information about the ability level of an individual. Except in selec- tion situations, the purpose is to determine the level of ability the individual possesses rather than whether the individual is above or below a given ability level. In the sequential procedure developed in this paper, a random sample of the individual's behavior is not used; there is rather an attempt to classify individuals into as many ability cate- gories as can be adequately differentiated. There has been no mathematical model developed for the above procedure and the apparent alternative of developing one did not seem fruitful at this time. An empirical study of the problem did not seem fruitful because neither the ability level of individ- uals,the precision of the items nor the difficulty of the item can be determined exactly. The best alternative seemed to be that of creating exact data and then creating a model which would use this data in a manner resembling the actual situation. uHL “Arr"- . .,.r-v‘ ,- .— Ap-v-V‘ ". ..~.Av ‘4' r». V‘ 2.. n. r». .: .4 .1 ”a r“ rwa an“ w. »v 2. «v w”. 2v 2. . . a” : . b V 3. .WJ - 1 ~\& 24 Preliminary work with the sequential procedure had usediiprobability model that had been empirically checked with actual data and which had been programmed for the electronic computer.14 It was thus decided to take advantage of the computer program for this study. The program used six items and permitted calculation for any sequence possible where items were used to make dichotomous decisions. IV. HYPOTHESES The problems of testing are best described according to the type of decisions that need to be made; however, the investigation of these problems is best classified according to the variables that are changed. Changes in any variable, such as the type of ability distribution of those taking the examination, may affect one or more of these problems. From the rationale developed in the previous section, one can deduce the effects these variables should have on efficiency, control of score distribution, and type of score produced. The rationale will explain the effect of the variables when used with the six-item cumulative with all items at the fifty per cent level of difficulty as well as when used with the sequential model. The one exception to this statement is that Lawley's work would indicate that 14Unpublished material developed in the Bureau of Edu- cational Research, Michigan State University, East Lansing, Michigan, 1956-1959. 25 precise scores (scores which have small variance of ability level for individuals assigned the score) are created for only a single group by using items quite removed from the ability level of those individuals whom one wishes to precisely classify. For example, if we wished to have the extreme scores precisely defined then we would use items at the fifty per cent level of difficulty. The hypotheses on precision of score are derived from the above conclusion of Lawley. The score distribution examined in this study is the one actually produced although it is clear that scores could be combined to yiehdshapescd‘distributions different from the one initially produced. The score meaning that is examined here is that of reflection of the criterion ability scale. The general hypotheses arising out of the rationale will be described here. The operational hypotheses that are tested are stated in Chapter III. There are (l) a set of hypotheses concerned with the effect of the type of ability distribution on both the six-item cumulative model and the six-item sequential test model; (2) a set of hypotheses concerned with the effect of precision and difficulty on the output distri- bution of the sequential test model; and (3) a set of hypotheses concerned with the effect of the errors in estim- ating the parameter values on the output. .. . .._ 4 ... r. . .L 2. w. r“ r“ L. 7: NC 30 a. at ...A .V .3 r. 2. L. 4‘ Au r? 3. -.. 44 t.. . . a. .y.. a v“ a . v. v. .w; v» r;. .v. "w 3. :V A: v“ a o «J \ Av . L.” r“ a. a-» a a .1. 2.. v... v. 2.. w“ .. . .4 a . : . .. a r. 2. ... . . a o 2. .1“ .: a. a: 2.. .w . E to L. c . _ . .. . r“ E .3 a . r“ L. ... . c . .. ... L. L. -. . .. . u... .N a a: . . 5a . . ~.: c; «a 2. 2. . . w. c . ..I L. 2» r. L. .. . .. . 3. n... .14 I a .a L . J o . .. r w. r.. ah . . a: ... . v. . “I L. :e ._v n1 .‘u . v . ~ ~._. 1.. y \. ... . r5 .: s a In I. Z. n. .. . a v r“ o n L. l c . . a. 1. .vv u. . . y . .4. nn¢ - Ail. ~ v s v I s an. I - ~ . ~.~ a. 4 ...~ 26 Effect of the Type of Ability Distribution The effect of type of ability distribution on maximally efficient use of items may be examined by determining the variance of scores which are assigned to a given ability level, or by examining the variance of ability levels assigned to a score. "Discrimination among ability levels" shall be used to designate whether different ability levels are assigned different scores, and “precision of scores" shall be used to indicate whether all individuals at that score are of approximately the same ability level. Another method of determining the effect of type of ability distribution is to determine discrimination among people. (This procedure involves decisions as to both control of score distribution and meaning of the score produced.) Discrimination among people is a measure of the ability of the test to rank individuals according to ability. This type of discrimina- tion is not considered in the following hypotheses. As the sequential test being considered here is one de- signed to discriminate among ability levels, it should work quite efficiently for all distributions with respect to the separation of the ability levels and the reflection of the actual ability distribution in the score distribution. As will be shown in Chapter II in the review of Lawley's work, the cumulative test should have a greater precision of scores for extreme scores, but should be equal to the sequential in its ability to accurately discriminate among the ability 27 levels of individuals only at the middle ability levels. These expectations are examined under conditions where two different distributions are input--norma1 and U-shaped. Normal distribution.--(l) The cumulative and sequential test models should have equal ability tg classify individuals .gf mean ability level. This hypothesis follows from the fact that middle ability people will take 50 per cent level of difficulty items in the cumulative test, and should take items near the 50 per cent level of difficulty in the sequen- tial test. If the sequential does not operate efficiently, the cumulative test will have the more discriminating scores. (2) The sequential test model should more accurately classify the individuals gt the extremes Q; the ability scale than should the cumulative test model. This is based upon the rationale that the sequential test can use difficult items because it discriminates among high ability individuals (as these items are at the 50 per cent level of difficulty for these high ability individuals). The test item does not have to discriminate between low and high ability individuals as only high ability individuals will take the item. (3) The cumulative test model should have more precise scores gt the extremes pf ability than the sequential test model. This follows from the work of Lawley which showed that the variance of ability levels for individuals assigned to high scores would be low if the items were easy for these individuals. _u. u. . .u .r“ . . r - o‘ . Ty .nL. :._ r . pp; W . p . ._. L. wLu W. . . v“ 0 . .1. . N e "A . . . n—v “a pv.‘ . . ‘ .§. ._a .r‘ u. r.‘ 2v . r“ :a o a I. Wu ‘1‘ .—u .~. _. w. ;. .W‘ Z 1 . . he :. .1 .4 «L 0, so. vi. 2.. . o I. 2. . a 5 4 2. 3. a“ A I. ~1- .... 7 .2 . A: r.. .4 F: .Wu Z» a. ‘ ~... ..i J a C. .A. u .. c... a l . ¢ . a v.. p: A.» A «\V .flq a... .. . a: .. - K. 2‘ .N< .uv r .. p a e \ ‘ Q FU- ;» ... . 28 (4) The scores for the cumulative test model should represent finer ability units in the middle than_at the extremes while the sequential test model scores should reflect the ability level scale. The best discriminations among ability levels should be made by using items at 50 per cent level of difficulty for the hypothetical group the median ability of which is at the point where the discrimin— ation is desired. For the cumulative test the best discrim- inations should be at the 50 per cent level of ability; whereas, in the sequential test items should discriminate quite equally over the entire range of ability. U-shaped distribution.--(l) The sequential test model should more accurately classify the individuals of catetory 1; (see ”Ideal T Score" in Table 24) than the cumulative test model. Category 13 individuals are the focus of consid- eration because in a U-shaped distribution few people are at the mean and the question becomes how well one can classify individuals who exist in larger number and are not at the extreme. Category 13 represents this mean value for those individuals in the upper half of the distribution of ability. The reason that the sequential should more accurately classify these people is that the items are more appropriate for their level of ability than 50 per cent level of difficulty items used in the cumulative. (2) The sequential test model should more accurately classify the individuals at the extremes_gf the ability 29 distribution than the cumulative test model. The reason for these expected results is again that items are more appropri- ate for the individuals, and individuals taking the items have a smaller variance in ability than those taking the cumulative items. (3) The cumulative test model should have more precise scores at the extremes than the sequential test model. Again this follows from Lawley's work. (4) The sequential test model should have equal score discriminations for all groups including the mean group, whereas the cumulative test model should have finer score discriminations for middle ability levels than for the extreme ability group. This follows from the wide distribution of item difficulties used in the sequential as compared to the cumulative tests. Items discriminate best only at once ability level and should be used only with individuals close to that ability level. Effect of Item Precision and Difficulty for the Sequential Test The relationship of item precision and difficulty to output characteristics must be examined together as change in precision results in change of the appropriate difficulty levels in the manner described in Chapter III. There are five levels of precision used: rbis = .79, .j5, .71, .60, and .45. Since the ability distribution also effects score distribution, a normal distribution of ability is used as this is the type of distribution most likely to occur in the 3O practical situation. (1) The variance 9: scores for a_given ability level should pe_less with the test using the most precise items. The value for the precision of an item indicates how effec- tively the item differentiates individuals of one ability from those in the next closest ability level. If the item is precise then each item can make a different distinction in ability rather than more accurately making the distinction that should have been made by a prior item. (2) The test consistingp£_the most precise items should have more equal discrimination between adjacent ability levels than will the less precise test. If the ability of an item to discriminate among ability levels is dependent upon the difficulty level of the item, then the more precise test which has a wider range of difficulties should discriminate at all levels while the less precise test which has a smaller range of difficulties should discriminate well among middle ability individuals where difficulties are appropriate. The less precise test should not discriminate as well among extreme ability individuals where difficulties are not as appropriate. Effect of Errors in Estimating Parameters The usefulness of the model for practical purposes de- pends upon the sensitivity of the test design to the use of an item which only approximates the precision and dif- ficulty level which would be called for by the "ideal" model. 31 If the values need not be very accurately determined before use can be made of the sequential test model, one is more likely to use the model. Preliminary studies have indicated that the sequential test will probably be more sensitive to precision estimates than to difficulty estimates. The effect of errors of parameter estimates is the same effect as is involved in the use of items which have parameter values other than those required by the test. As is noted in Chapter III, Section 1, each succeeding item in a sequential test is selected in such a way as to maximize discrimination based on data from the effects of previous items. The effect of using a more precise item than called for should be that the next item would not be difficult enough or easy enough for maximum discrimination. The effect of using an item too easy should be to increase the precision of score for the upper group, but to decrease the discrimination among ability levels. Since the effect of errors made in early stages is either corrected or magnified by the effect of later items, and since the effect of errors made in later stages has no chance to be corrected or magnified, one would expect dif- ferences in the effect of errors at early and late stages. The hypotheses made as to effect of errors at these different stages are as follows: (1) Errors in difficulty ap.ap early stage should not have any serious effects as there would pg a wide range pf 32 ability and the item would operate well for some 9f that range. (2) Errors in difficulty 32 the final stages should increase the variance pf ability levels assigned_pg one pf the two subgroups into which the total group would_pe separated, but should not lower the variance pf scores assigned 39 the ability levels. (3) Errors_in estimates_g£ the precision_g£ the item should pg more serious in the initial stages where wide separations_in difficulty level pf the next item would pg used. (4) Errors ip the estimates 9: the precision f the items should make little difference ap the final stages as the next item would pg appropriate. If the sequential testing procedure is robust in that errors in estimating parameters do not seem to greatly effect type of output, then it would be possible to design the test with parameter values determined from one sample of a popula- tion and use this same test in different situations. (The value used for the precision of the item is dependent upon the spread of ability in the sample used to determine the precision value. If the spread of ability is great in con— trast to item sensitivity, one has a precise item. If the spread of ability is narrowed, the same item would be consid- ered a less precise item.) 33 V. LIMITATIONS OF THE STUDY The three major contributionscxfthis study are that it: (1) discusses the problems of the cumulative test and shows how the sequential model attempts a solution to each of these; (2) provides a model that may be used in construc- tion of any sequential test; and (3) presents a rationale for the sequential test model which, when tested, should allow the construction of additional sequential tests. There are, however, many problems that are not examined. Six of these are listed and discussed because the background material gives suggestions as to the probable answers to these problems also. These are: (l) the best possible cumulative test, (2) the score distributions desired for the cumulative and sequential models, (3) the types of ability distributions that may be present in the usual situation, (4) likely test parametensftu‘usual test items, (5) commercial test construc- tion procedures, and (6) test presentation procedures and the psychological effects of the sequential model. Best Cumulative Test The work of Brogden and Humphreys indicates that the best cumulative test with precise items is one with a spread of difficulties.15 The exact relationship between spread of 15Brogden, op. cit.; and Humphreys, op. cit. 34 difficulties and precision to yield maximum validity (measu- red by correlation with ability distribution) is not known, but Cronbach and Warrington indicate that for a cumulative test of a given length, 0&2 + oa2 will have a preferred 16 ( The term 0& is the standard deviation of the spread of item difficulties and oh is the measure of precision value. which is the same as the one used in this paper.) The sequential test models are not compared to the best possible cumulative model, but the use of items all at the 50 per cent level of difficulty creates a test that is more than sufficient for most uses for most levels of precision.17 The purpose of the cumulative test model in this dissertation is to put the sequential test model material into perspective. Distribution of Scores If the purpose of testing is selection, then a test need only produce two scores, one for the individual who is selected and the other for the one rejected. In this situation the sequential model developed here would require modification both in method of scoring and in number of items taken by individuals. The previously discussed sequential model devel- oped by Wald, involving a variable number of items taken by 16Cronbach and Warrington, op. cit. l7ibid. 35 individuals, is probably the optimal solution. The problem of test construction thus is no longer that of determining the difficulty of the item, but rather the number of items needed to make the most rapid classification. There is no score distribution as such, only accept, reject and continue testing categories of individuals. The cumulative test used to differentiate two groups would be one with all the items at the level of difficulty appropriate for the ability level at which one wishes to make the decision. A test of this nature would have a score distribution which would be platykurtic, rectangular, or bimodal depending upon the precision of the items in the test. The test with most precise items would have a bimodal score distribution. If one desired to rank individuals by the scores from the test, one would make fine discriminations in ability for those ability levels where there were many people. In this way the individuals would be assigned scores which would be rectangularly distributed. This can be accomplished by use of a cumulative test which has either fairly precise items at the 50 per cent level of difficulty or a spread of item difficulties for less precise items. For the sequential test, there would be more items included at the difficulty level appropriate for the discriminations that are desired. The construction of either a sequential or a cumulative test which has the score distribution discussed above is 36 outside the scope of this dissertation. Further research is needed to determine the items for a sequential test which would have a rectangular distribution with the input of a normal ability distribution. Ability Distributions Lord has stated that perhaps test constructors should not consider ability as normally distributed.19 It is possible that a bimodal distribution of ability is common in that there are many individuals who perform adequately and many individuals who perform inadequately with a large gap between these two performance groups. If this is true, the sequential test model should operate well for these distributions, as it should operate well with any type of distribution. Abberations in its operation would show up most clearly when the test model is tested against a U- shaped distribution of ability. In Chapter IV the results are reported for testing the model against the U-shaped and normal distributions. These results indicate how the sequen- tial test scores may be interpreted when used with different ability distributions. However, no rationale is developed to indicate what the results should be and, therefore, the interpretation of scores across ability levels depends upon a rationale developed post facto, not upon the rationale tested in the study. 19Lord,A Theory of Test Scores, op. cit. 37 Test Parameters The effect of the number of items has not been examined. The six-item test was used because the probability model for the test had been programmed for the electronic computer and six—items were the maximum for this program. Further research is needed to determine how rapidly the output characteristic changes (if at all) when the test consists of more items. Test Construction Procedures The computational model described in Chapter III for the construction of a sequential test has a method of selecting items with the best possible parameter values. This method could be used in the construction of a sequential test with the data in terms of difficulty and precision taken from actual items. The criterion may be a measure of the number of individuals desired to pass the item or a measure of the variance of ability levels of individuals assigned to the pass and fail categories. It would seem reasonable that one should use the most precise items to differentiate the individuals as to ability level and then the difficulty of a less precise item could be used to control the number of individuals assigned to any one score category. The second differentiation would not be as valid as the one made with the more precise item, but the shape of the distribution could be well controlled. In addition to lack of a complete evaluation of the score distribution control procedure, there has been no attempt 38 to follow the standard criteria such as that published by the Committee on Test Standards of the American Educational Research Association.20 These criteria include content validity, concurrent validity, predictive validity, con- struct validity, error of measurement at different score levels, equivalence of forms reliability, internal consis- tency reliability, stability reliability, and information on norms and scales. Since this dissertation uses hypothetical data, content validity is not considered. It is assumed that the test items are homogeneous and thus measure only one content or ability which may or may not be a composite of several abilities. The six~item sequential is compared with the six-item cumulative but no correlation is computed between the two sets of scores, as is common in concurrent validity studies. In this type of a model one can probably obtain more inform— ation from the correlation with a known criterion score than from correlation between sequential and cumulative test scores. The predictive validity of the test is not determined as it made no sense to use hypothetical data to predict hypothetical performance. Predictive validity needs to be 20American Educational Research Association, Committee on Test Standards, and National Council on Measurements Used in Education, Committee on Test Standards, Technical Recom— mendations for Achievement Tests, 1955. 39 studied through the construction of a sequential test with actual items, testing of a group, and then the prediction of future performance. This would be a logical next step if the model data studied here show that the sequential item test is a better test than a six-item cumulative under the conditions of this study. If sequential test does not have results which may be considered better than the results from the cumulative test, then there is no need to study the sequential under less favorable conditions. In construct validity it is assumed that the character- istics measured and related are not affected by the type of items used in the test. Results from this study may be used to indicate that these assumptions are not met in most situ- ations. A study of the attenuation paradox literature should make one aware of the problems involved in the measurement of characteristics and their relationships. There is no attempt to evaluate the construct validity of the sequential test. Neither is there any attempt made to correlate test scores with other abilities that should be related to the particular hypotheticalabilitybeing measured. That which is measured is any homogeneous ability measured by the items with the given level of precision—~all of the items in the sequential model have the same precision. Error of measurement at different score levels is examined in detail as suggested by the criteria for evaluation of a test. The discriminating power of the test at a given no level of test score is to be distinguished from the discrim- inating power at a given level of ability. Both the variance of the test scores of each ability level, and the variance of the ability levels at each score are examined. The equivalence of forms reliability is not determined as there is only one form. It would be quite simple to build two tests in a computer and determine how well the scores on the one test could be predicted from the scores on the other. It is possible that quite equivalent tests could be built from quite different items. This possibility is not examined in this dissertation. Due to the hypothetical nature of the data the internal consistency reliability is not examined. Stability reliability is not determined as it would be necessary to administer a test twice to a group to determine this, and no test is actually used in this paper. This is another area that needs to be examined. There is a fairly complete discussion of the score dis— tribution of the sequential item test. It is hoped that the rationale which predicted the type of score distribution would be proved correct and thus a tested rationale would be presented rather than a rationale derived from the results. Norms (like many of the criteria lised to evaluate a test rather than a test procedure) are irrelevant to the test procedure. All Another limitation to the study is that no attempt is made to examine the effects of errors of estimating the parameter values when the level of precision is low. However, one would suppose that the effect of errors will be less at lover precision levels. if the effects produced at high . ,fi . ‘ , .fi ‘ ...1 .. 4—..7‘ .,‘ ‘5 .‘h’ _'V 3'. .,\‘ 4 ,-.. .. “A" y. levels of item pICClriOH are Withlh the titer range ioi ~ ’ r ». (N q -? A. . #fi ' -- .7. a , -‘ ‘ V. 13.. . 'v’l - ° ‘ -' —' 4-— ” 1 - , - . r ‘ ' c‘ .\ pdxhjtlcai si»nrrllcan e, Infill CNQIB ixi_iittle lanai to exannjh: of) the eilects at low level of item precision. If the effects at high levels ofitem precision are beyond the error allowed for practical significance, then one must determine the effects of lower item precisions or develop methods of obtaining better estimates. This decision can be made later. ;ect Presentation Procedures and Effects In the area of sequential test presentation to the testee little is known as to how to proceed in actual practice. For example, it may be psychologically advantageous to give the easiest items first, allowing some individuals to subsequently try more difficult items, rather than to have everyone start at an item of 50 per cent difficulty. Since the test is not given to an actual group this procedure cannot be examined in this dissertation. The gnwmater the immdxar of scort33- ) E(X) _ O O l l 2 pl 3 ... When the mean difficulty of items is at the 50 per cent level of difficulties for the individual then the error variance of the score is defined as below: 2 0’ __ h r/ _ .___ s /9 E(X) 2v “’0 1 The terms are defined as follows: 0'2 = error variance of score E(X) n = number of items X z score value 23D. N. lawley, ”0n Problems Connected with Item Selec— tion and Test Construction," Proceedings of the Royal Society of Edinburgh, 61 (Section A, Part III):273—287, 19E2-1943, p. 273. 57 to, t1, t2, etc. = values from Table 29 of Pearsods Tables for Statisticians and Biometricians (ordinarily used to calculate r'Jce't) a ,0 = £1 1 05 + cda a , C7 :: variance of item difficulties (standard score 1 form) 1 . . .. '—75 = preClSlon of item 0; From these equations, and the assumptions mentioned above, one can determine that large.p3 would reduce the error term whether the ability level is equal to the mean difficulty of the items or not. The size Of/Dl can be in- 2 creased by decreasing 6' (using more precise items), by 0 decreasing 612 in the denominator (or using all items at one difficulty), or by increasing die in the numerator (using items at more than one difficulty level). This immediately suggests that the best procedure is to use more precise items if one wishes to reduce error variance in the score, as 6i appears in both the numerator and denominator. This is in contrast with the most valid test results reported by Tucker, who empirically found that the most valid test was the test with imperfect items.25 Another way of reducing error variance would be to use the small tO values. (The value, t is necessary to enter 0) 25Tucker, op. cit. 58 Pearson's tables.) Lawley gives the following formula for .26 to. ability level (standard score form) X: €2-= mean difficulty level of cumulative test 9 2 62: 66‘ + 61 (as defined above) To aid in the understanding of the interpretation of the formula given above the following summary data is reported for a test with the mean difficulty level of items nearly equal to mean ability level ( 51 = .045) and with a o' (a combination of the spread of item difficulty and precision 2‘ 2 of items) of 1.30 for a 100—item test. 7 The values of CT E(X) for given values of &i;§- are as follows: i 2 x - OK CI, 6' h(X) 0.0 20.8 0.1 20.7 0.2 20.4 0.3 19.8 0.4 19.0 0.5 18.0 0.6 16.9 0.7 15.6 0.8 14.3 0.9 13.0 1.0 11.6 26Lawley, op. cit., p. 279. Ibid. 59 As can be seen from the preceding data, for'given é;2. and cr values, the higher the ability level (x), the ]_c>wer the error variance for the score ( ‘7E(X) ) for a cumu— 1.21tive test. If the items had a large value (fixed) for the rneean difficulty level (i.e., the value of 5K increased) tzrien the value of -—§—:§E— would be smaller and thus the 6’ eaxrror variance (C7 ))would be larger. E(X Lawley also pointed out that the effective discrimin- 2 Eitsing power of a test may be computed. as follows: %.= _g(Xi - ZEX') E(X') E(X) + If x = 5i then the above formula becomes: x and x' are two different ability levels X and X‘ are two different score values Other terms are defined as before. As Lawley pointed out, in order to increase the effec— 1Dive discriminating power the numerator must be increased Wklich means obtaining large values for 55L, or the denominator maybe decreased, and, assuming oi? is constant (as one cannot <3P1ange precision) then one must change 6 l C)f‘ difficulty.29 The smaller the spread of difficulties the which is the spread Ibid., p. 280. 291bid., p. 281. 60 lower the value. The effective discriminating power for a test would thus be greatest when the mean difficulty of items was equal to the ability level for the extremes in ability, and, when there was no spread of item difficulties. This type of test would be used to create scores which would be assigned only to individuals that are the same. It is not used to differentiate between the ability levels of individ- uals. The same logic which states that middle scores will be more precise (i.e., representing only one type of ability level individual) when difficulties are extreme would indicate that extreme scores will be more precise when 0.00 level of difficulty items are used in the test. (Remember the formula uses 38' so it would operate for either extreme of difficulty. Support for this position is given by Lord who stated that the standard error of measurement would be practically zero for extreme positive or negative values of ability.30 He argued that there would exist individuals whose ability would be so low that the test would not be discriminating for them, and other individuals whose ability would be too high to be discriminated. The standard error of measurement is low for these zero or perfect scores and is necessarily smallest for those examinees for whom the test is least discriminating. The above solutions to changing the criteria of test validity still do not exhaust the solutions to the attenuation 3OLord, A Theory of Test Scores, op. cit., p. 14. 61 paradox. Brogden offers yet another solution. He found that the correlation continued upward when a spread of item difficulties instead of one level of difficulty was used.31 He concluded that the problem was that of determining the distribution of item difficulties to yield a more valid score. Brogden showed that by using items with rtet = .60 or higher, a distribution of item difficulties will produce (for an 18-item test) a higher validity than will be obtained with all items at the .50 difficulty level.32 The spread of difficulty seemed to be important when items were of this reliability. Brogden's solution of determining the spread of items for a test such that the results would correlate highest with a criterion seems to be inadequate since there remain the problems of measuring the relationship and the meaning of the coefficients that are computed. It is impossible to solve all of these problems at this time, but assuming that the difficulty of the item is an adequate score, and assuming that discrimination among ability levels (with an examination of the effective discriminating power) is the important ques- tion, a rationale can be built for the sequential test devel- oped in this dissertation. Two areas of literature will now be examined to build the rationale for effective use of items in the sequential 31Brogden, op. cit., p. 2A0. 32lbid. 62 test. They are (1) literature on Bayes' Theorem, and (2) literature on the use of items at the 50 per cent level of difficulty for the hypothetical group with a median ability level equal to the value at which the discrimination is desired. Meehl and Rosen, through the use of Bayes' Theorem, point out that the practical value of a psychometric sign, pattern, or cutting score depends jointly upon its intrinsic validity (in the usual sense of its discriminating power) and the distribution of the criterion variable (base rates) in the clinical population.33 They note that if the base rates of the criterion classification deviate greatly from a 50-50 split, the use of a test sign having only moderate validity will result in an increase of erroneous clinical decisions. One reason that the sequential test is assumed to have maximally efficient use of items is that the base rate does not have to deviate from the 50-50 split. The other reason is that the sequential test uses items at the 50 per cent level of difficulty for the group taking the item. These items have been found to be efficient with various criteria for efficiency. Lord concluded from maximizing the ratio of difference in means to standard error of difference, that if one desires 33Meehl, Paul E. and Rosen, Albert. ”Antecedent Proba- bility and the Efficiency of Psychometric Signs Patterns or Cutting Scores,” Psychological Bulletin, 52:194-2l6, No. 3, 1955- 63 to construct a test that will have the greatest possible discriminating power for examinees of a given level of ability, 0 = 00, then all items should be of equal difficulty (no spread) and of such difficulty that half of those exam- inees whose ability score is cO would answer each item 3” This correctly and half would answer it incorrectly. measure of discriminating power is completely independent of the distribution of ability in the group tested. However, when item precision is such that item—total biserial correlations are .447, Lord empirically showed that a test composed solely of items at the 50 per cent difficulty is more discriminating (as measured above) than any other test for examinees at gpy level of ability between -2.5 and +2.5.35 Lord does not show results of more highly correlated items which will be investigated in the present study. Lord's empirical study above is supported by Cronbach and Warrington's theoretical study. They stated that for items of the type ordinarily used in psychological tests, the test with uniform item difficulty gives greater over-all validity and superior validity for most cutting scores, as compared with a test with a range of item difficulties.36 It is the cutting score validity which is new here and of some relevance to the sequential test constructor. For 34Lord, A Theory of Test Scores, op, cit., p. 26. 351bid., p. 29. 36Cronbach and Warrington, op. cit., p. 127. 64 example, Cronbach and Warrington found that if 0’ = .2 0.. (i.e., rtet = .94 orlfl = .80), if no guessing is possible (or Ttet = .55 orlfl = .37 if the probability of chance suc- cess by guessing is one-third), and if all items are at the 50 per cent level of difficulty, better results are obtained for separating out from 40 to 62 per cent below the cutting score than if there were a normal distribution of item difficulties.37 The empirical determination of the best difficulties for discrimination has not always been as nonsupportive of the present rationale as the work of Lord. Lord used discrim- inating power (as defined by him) as his criterion. Richard- son's empirical study had more supportive results. He created five subtests of different difficulty levels: 78-95, 60-77, 41-54, 23—40, and 5-22.38 He then calculated the biserial correlations for 23 different divisions of the criterion starting at 4.17 per cent of the people in the lower category, and, by percentage units of 4.1667, continuing to 95.83 per cent in the lower category. He graphed these results and noted that the test consisting of items from 78-95 per cent passing produced the highest biserial correlation for those divisions where 4.17 to 25.00 per cent of the people were in the lower category. Likewise the 60-70 per cent pass test 37Ibid., p. 135. 38M. W. Richardson, "The Relation Between the Difficulty and the Differential Validity of a Test,II Psychometrika, 1:33- 49, No. 2, June, 1936. 65 was best for the 25.00 to 35.00 divisions; the 41-49 per cent pass test for the 35.00 to 61.50 divisions; the 23-40 per cent pass test for the 61.50 to 82.00 divisions; and the 5-22 per cent pass test produced the highest biserials where 82.00 to 95.83 per cent of the people were in the lower category. Although these results are from 50—item tests, the results indicate that different difficulty tests for differ- ent discriminations should be useful. Other results from studies which would support the position that items at the 50 per cent level of difficulty for the group are the best items, are those which indicate differentiation of a group by items of different difficulty. In these studies the ability level of the individuals are not known and differentiation for each ability level is not re- ported separately. The reader must assume that the individ- uals were normally distributed around an ability level equal to the difficulty of the items. If this assumption is made then low differentiation by difficult items support the con- clusion that items appropriate for the ability level are the best items. Such a study as described above is reported by Cleeton. Cleeton used four well selected ability groups--one superior group and three inferior groups.39 He then constructed two measures of the differential or predictive value of the test. 39G1en U. Cleeton, "Optimum Difficulty of Group Test Items," Journal of Applied Psychology, 10:327-340, No. 3, September, 1926. 66 One of these was (R1 - R4) in which R stands for the number of items answered correctly by group 1, 2, 3, or 4. The other measure was (Rl — R2) + (Rl - R3) + (Rl — R4) + (R2 - R3) + (R2 - R4) + (R3 — R4). (Terms having the same meaning as above). These are criterion II and criterion I in the following results, respectively. Cleeton examined difficulty by grouping 1/10 of the items in each interval and by grouping 1/10 of the range of difficulty in each interval. For present purposes it is most informative to look at the actual difficulty divided into 10 parts even though the number of items in each interval is different. The following data show the results of 240, 240, and 480 individuals each taking three tests of 400, 236, and 109 items. (For the computation of criterion indices, Cleeton assumed that he had only 720 individuals.) Interval for Rank Rank Value Value % Passing Criterion Criterion Criterion Criterion Item I II I II 91 - 100 8 8 44.4 14.7 81 - 9O 6 6 104.9 28.9 71 - 80 5 5 125.9 40.8 61 - 7O 4 4 152.6 46.8 51 - 60 3 3 158.9 47.6 41 - 50 1 1 175.2 51.3 31 - 40 2 2 163.9 51.1 21 - 30 7 7 85.8 26.1 11 - 20 10 10 35.9 11.1 0 - 10 9 9 37.3 11.9 67 From the above data one may determine that the slightly more difficult items seem to have the greatest predictive value as measured by both these estimates of predictive value. This would support the decision to use items at the 50 per cent level of difficulty for the group which is to be dis— criminated among. Logical analysis also supports the above decision. Flanagan pointed out the extremes of this difficulty and item validity argument. He stated that if one wanted the maximum amount of discrimination between the individuals in a particular group, a test should be composed of items all of which are at 50 per cent difficulty for that group-- provided the intercorrelations of all the items are zero.“0 If intercorrelations were other than zero, the decision would not be this clear. Lord studied theoretical test models which had either high or low item reliabilities with easy, difficult, or easy and difficult test items. After examining the relationship of the true score distribution to the distribution of ability, he reached the following conclusion:41 A test composed of items of equal discriminating power but of varying difficulty will not be as discriminating in the neighborhood of any single “OJohn C. Flanagan, "General Considerations in the Selection of Test Items and a Short Method of Estimating the Product-Moment Coefficient from Data at the Tails of the Distribution," Journal of Educational Psychology, 30:674-680, No. 9, December, 1939. 1+1Lord, "The Relation of Test Score to the Trait Under- lying the Test,” op. cit., p. 5A3. cs ability level as would a test composed of similar items all of appropriate difficulty for that level. Thus,most Of’the literature supports (1) the use of items at the appropriate difficulty for each level and (2) the separation of individuals into groups that would have a base rate of 50 per cent. Because the base rate is near a 50 per cent split each time, the sequential model should permit the use of only moderately discriminating items. In the cumulative test, there will be only 5 or 10 per cent of the individuals who should pass a difficult item, as all people take the item. In the sequential method 50 per cent should pass this dif- ficult item, as only those with high ability will take the item. According to Bayes' Theorem the probability of high ability people passing the item must be much higher than the probability of low ability people passing the item if 90 per cent of those taking the item have low ability. Once the group taking the item has a base rate of 50 per cent (as is the case in the sequential method), then the item should work better-~i.e., increase the number of correct clinical decisions. In the sequential test, those groups which are different in ability would use items at the 50 per cent level of dif- ficulty for that group. This would allow the use of diffi- cult items which are precise. Such items could not be efficiently used in a cumulative test. 09 II. CONTROL OF THE SCORE DISTRIBUTION The problem of score distribution is not only to assign a specified number of individuals to each score value, but also to assign like individuals to each score value. The score distribution is not only related to the item parameters, but should also be related to the use. The score distribution problem may be studied through the use of a theoretical model or empirically. Lord attempted to study the problem of control of score distribution through the use of a theoretical model. He made the following assumptions: (1) the item characteristic curves have the general shape typical of cognitive items that are not answered correctly by guessing; (2) the items are homogeneous in a certain specified sense; (3) the items are scored 0 or 1; and (4) the raw test score is the number of items answered correctly}+2 (A homogeneous test is, for Lord's purpose, defined as a test composed of items such that, within any group of examinees all of whom are at the same ability level, the response given to any item is statistically independent of the response given to the remaining items.) The generalizations reached by Lord were as followszu3 1. Since the test characteristic curve is in general nonlinear, the test score distribution will not in general have the same shape as the distribution of 421bid., p. 546. 431bid., pp. 541-542. 7O ability; in particular, if the ability distribution is normal, the score distribution in general will not be strictly normal. 2. U—shaped and roughly rectangular score distributions can be produced provided sufficiently discriminating test items can be found. (All appropriate individuals pass or all appropriate individuals fail an item if they are perfect items at the 50 per cent level of difficulty.) 3. Typically, if a test is at the appropriate difficulty level for the group tested, the more discriminating the test, the more platykurtic the score distribution. 4. The skewness of the test score distribution typically tends in a positive direction as the test dif~ ficulty is increased above the level appropriate for the group tested; in a negative direction as the test difficulty is decreased below that level. These generalizations aid in interpreting the empirical 11 results of a study made by Mollenkopf.’4 He selected 1000 answer sheets chosen on the bases that: (a) every person must have attempted every item, and (b) a wide range of scores should exist in the sample chosen. Items were then chosen to make up nine synthetic tests. These nine tests contained score distributions with three types of kurtosis and three types of skewness. A study of the literature revealed that the total test score distributions were believed to be con- trolled for skewness by item difficulty. However, since easy items tended to have higher correlations with the total score than did difficult items, control on mean difficulty alone was found not to be sufficient. When building a test with a symmetrical score distribution,Mollenkopf found that a set uaWilliam G. Mollenkopf, I'Variation of the Standard Error of Measurement,” Psychometrika, 14:189-229, No. 3, September, 1949. 71 of items of the same type (all of difficulty close to .50) yielded scores with a definitely flat distribution. (From Lord's work, it looks as though the item precision must have been very good.) To secure a leptokurtic score distribution Mollenkopf tried sets of items with .40 and .60 difficulties, but found that homogeneous sets of items of .20 and .80 dif- ficulties were needed. If one uses Lord's work to translate back from score distribution (by assumed highly precise items) to ability level, one can determine that the distribution of ability must have been near normal. Also of interest in the Mollen- kopf article is the fact that the standard error of measure- ment for a nonskewed platykurtic distribution of scores is greatest in the middle sections and lowest at the extremes. This may be accounted for by what Mollenkopf has labelled the "end effect."45 This effect means that at the ends large differences in parallel forms cannot occur. A perfect score is perfect in each half. Small empirically observed errors of measurement are inevitable in the tail where the pile-up occurs on skewed distributions but not for normal distributions. This explanation would suggest that the variance of ability levels for a given test score may be small, but it does not indicate, as Mollenkopf also pointed out, that there 72 is a small variance of scores for a given ability level. Both points are of interest if reflection of the ability distribution is desired in the score distribution. The cumulative test can be used to yield the type of score distribution that one wishes. The important parameters are item difficulty and item precision, but only general statements are available as to the relationship between these parameters and the score distribution. Empirical studies are used to determine exact parameter values for given score distributions. Hymphreys stated that the variance of item difficulties forces scores toward the center of the distribution and thus counters the effect of high item intercorrelations.46 It is thus necessary to have a spread of difficulties, only if the items are very precise. Whereas very highly intercor- related items of one difficulty level would produce two scores, if one were to use a spread, one could force people into a distribution that would be expected to have some validity. Humphreys advocated that the shape of the score distribution be controlled by the difficulty level of the test items.u7\The type of distribution favored by Humphreys was a rectangular distribution—-a distribution that would allow individuals to be ranked. u/ OHumphreys, op. cit., p. 474. 47Ibid., p. 475. 73 If the items were perfect, the procedure to produce the rectangular distribution desired by Humphreys would be as reported by Davis. Davis reported that if the tetrachoric item intercorrelations are all unity, a rectangular distri- bution of raw scores is most likely to be obtained by selecting items with difficulty levels of 1/(n + l), 2/(n + l), 3/(n + l), . . . n/(n + 1). However, if the tetrachoric intercorrelations are all .50, a rectangular dis» tribution of raw scores is most likely to be obtained by selecting all items at the 50 per cent level of difficulty.48 He argued that for any level of tetrachoric item intercorre— lations from zero to .50, the maximum number of discrimin- ations that could be made by the total score would be insured by selecting all items at the 50 per cent level of difficulty. Davis went on to say that this simple mathematical procedure employed to specify the exact difficulty levels of items for two- and three-item tests cannot be applied to specifying the exact difficulty levels of items for tests containing larger numbers of items except in the limiting case when the item intercorrelations are all unity. The reason one cannot generalize is that when intercorrelations are not unity, errors in classification will be made, and the spread of ability represented by those who pass or fail will be greater but undetermined. Thus, the appropriate difficulty 48Davis.,o . cit., p. 103. 74 for the resulting group cannot be easily determined. The effect of errors is difficult to determine, but as pointed out by Davis, there is need for a general solution. Whereas the general rules about control of score dis- tribution are known, there is no general solution in the sense that the actual score distributions are known. The actual score distributions must be empirically determined for each test. The literature indicates that if the sequen- tial method of testing could more easily and predictably control the score distribution, a real contribution would be made to the solution of a difficult measurement problem. III. MEANING AND USE OF SCORE PRODUCED Both the score distribution and the meaning of a score are related to the use of the test. Ferguson has pointed out that for discrimination between two groups one would need a bimodal distribution of scores; the discrimination between two groups and among the members of one group would require an asymmetrical distribution of scores; and, if one were establishing the order of ability of individuals, one would use a rectangular distribution. Ferguson concluded that the construction of tests to yield distributions ap- proximating the normal form results in a loss of discrimina- tory capacity.49 49George A. Ferguson, ”On the Theory of Test Discrimin— ation," Psychometrika, 14:61-68, No. 1, March,l949, p. 68. 75 Not all scoures have the same meaning. A score resulting from the discrimination between two groups is more a probabil- ity statement that the individual should be classified into a given category than it is a statement that the individual's ability is at a certain level. The score from a test designed to rank individuals compares any individual in relation to others. In addition to the meanings necessary for the above uses, Gulliksen (as stated in the first section) would have the score be the best estimate of the difficulty level reached.50 This type of score represents the ”true ability" level of the individual. This type of score is also advocated by those who argue for reproducibility as a measure of the best test. However, it should be noted that it has been the practice to determine how well a pattern of responses from an instrument will reproduce original results, not hypothesized "true" results. As reported by White and Saltz, these indices will reflect without equivocation the amount of information thrown away by representing the subject's performance on the test by a total score based on the number of items passed. "They indicate, in other words, how adequately a unidimensional 1 model fits the obtained data."5 50Gulliksen, op. cit. 51Benjamin W. White and Eli Saltz, HMeasurement of Reproducibility," Psychological Bulletin, 54:81-99, No. 2, March, 1957, p- 95- f 76 However, a reproducibility score from a unidimensional test does not insure either an interval scale or a known be- havior domain being sampled. Individuals may be ranked by the test scores (compared to other individuals) or be assigned an ability level (compared to a standard). The behavior domain may be related to the test label or it may not-—the only assurance one has is that the domain is unidimensional. The question as to domain samples (which seems like a validity question) has actually been studied as a part of reliability. Tryon in theory related reliability to the behavior domain sampled.52 He reviewed the two theories of test reliability: (1) the Spearman-Yule theory that tests are unreliable because of an error factor and reliable because of a true factor which may be a composite of more than one common factor; and (2) the Brown-Kelley theory that reliabi- 1ity may be explained by equivalent test-samples in which all items in the total score have equal standard deviations and equal intercorrelations. (To obtain equivalent test-samples the content and difficulty of items must be considered, but all items do not have to be equally difficult.) Tryon defined reliability as the value of “correlation, rtt: between the observed Xt scores and a second set of com- posite scores, Xt‘, earned on a 'comparable form' of the Xt 52Robert C. Tryon, "Reliability and Behavior Domain Validity: Reformulation and Historical Critique," Psycho- logical Bulletin, 54:229-249, No. 3, May, 1957. 77 composite."53 (A comparable Xt composite is one in which the n test-samples vary on the average as much in standard devia- tions and intercorrelations as do the n test-samples in the observed Xt composite.) If this definition of reliability is used, a reliable test is one that indicates how well the individual knew the domain or how he ranked with others in his knowledge of the domain. At least the domain sampled by the score is known and can be made part of the meaning of the score. The literature reviewed to this point would indicate that the score (1) may be a function of difficulty which prob— ably reflects the ability level of the individual, (2) may represent a pattern as to content, or (3) may indicate how well the individual did on the samples of the domain that the test is hypothesized to sample. Reliability measures may be a factor in determining what meaning can be assigned to the score, but there are still contributions coming from content and from difficulty. Swineford examined the importance of the difficulty of the item as a factor in the score assigned to the individual. Swineford has shown that only if the items are quite precise and intercorrelated is the difficulty of the item an important factor in the score of an individual. Swineford used present 531bid., p. 230. 78 day tests and attempted to measure the impact of variability of item difficulty and item-item correlation.54 The varia- bility of item difficulty was designated 6; , A: being the normal-curve deviate (for a distribution with mean of 13 and standard deviation of 4) above which lies the area under the curve equal to the proportion of successful examinees. For a measure of inter-item correlation Swineford used the recip- rocal of the square of the mean of the item-total correlation. The results of Swineford's study showed that when the score was the number correct that the best formula for pre- dicting this score was as follows: 2 = .1530 z + .8649 z 1 3 4 21 is the predicted standard score on the test Z3 is the measure of the spread of item difficulties in a standard score form 24 is the inter-item correlation measure in standard score form Rl.34 = .9648 for this formula. When the score was the number right minus k times the number wrong the results were as follows: 21 = .2117 Z3 + .9222 24 and R1 34 was .9642. The symbols are the same as above. As can be seen from these formulas, the contribution of spread 54Frances Swineford, “Some Relations Between Test Scores and Item Statistics," Journal of Educational Psycho- logy, 50:26—30, No. 1, February, 1959. 79 of item difficulties in the usual cumulative test is not great. Another way of looking at the contribution of item dif- ficulty spread is to specify the spread and inter-item corre- lation, and then examine the standard deviation of test scores. Swineford used (n - chance)/ included in the one axis of her chart where values of 028 range from 5.8 to 3.0 for the highest (.50)rbis, and from 14.8 to 11.9 for the lowest (.20) rbis' The mean rbis is .36, the highest rbis (.50) is .70 sigma units away from the mean, and the lowest rbis (.20) is 3.15 sigma units away from the mean. Thus, while the values of C; ‘may be considered to be close to normally distributed and likely to be encountered in the usual cumulative test, the values for rbis are not normally distributed. We might conclude that if rbis were normally distributed, then higher values of rbis might appropriately be investigated. A standard deviation unit on Cg_ would indicate that today most tests do use items centered around the mean difficulty level, but that the reliability of items has a larger range. If one examines i .70 sigma units of rbis’ one has about a three point change in (n - chance)/th values which is about the same change en— countered from i 3.0 sigma units of C55 . This supports the conclusion that conventional cumulative tests do not use 81 difficulty as a major factor in the score; the score is a con— glomerate of difficulties and other factors. The literature indicates that the cumulative test may be constructed to measure a single factor but that the attention of the test constructors has not been directed toward reporting the decisions made as to the meaning of the score. If one remains concerned with traditional operational definitions of reliability and validity, one may forget the construct operationalized and not change the construct when it needs to be changed. The sequential test procedure developed in this disser- tation will use reflection of true ability as the meaning of a scores. The literature indicates that this is only one of the many meanings that could be assigned to a score. IV. SEQUENTIAL TESTING PROCEDURES The literature indicates that there are many choices as to the use of the sequential testing procedure. The sequen- tial process may be used (1) to quickly determine score to be assigned to good and poor students; (2) to determine to which of two categories the individual should probably be assigned, if assigned at all; or (3) to classify each individ- ual as well as possible in time allowed. The sequential analysis developed by Wald would be most applicable to the second purpose, but this method has been modified by Cowden to serve the first purpose. 82 Cowden has indicated that when an examination is given to a student it sometimes happens that not enough questions are asked to permit a fair evaluation of his knowledge and ability.56 On the other hand the examination is sometimes drawn out longer than is necessary. If a student is very good or very poor, only a few questions may be needed to establish this fact beyond reasonable doubt; but borderline students need to be examined at considerable length before deciding whether they should be passed or failed. If sequen- tial testing is used, the fate of good students and of poor students tends to be quickly determined, but mediocre students must continue with the examination until the results give adequate grounds for a decision. By use of the sequential method the number of questions answered by a student is re- duced to a minimum, and at the same time the probability of passing a poor student or failing a good student is controlled. Cowden graded his students in a small class in elemen- tary statistics at the University of North Carolina. Using D1 (decision number I) to indicate the number of questions that could be missed and still permit a student to pass, D2 (decision number 2) to indicate the number of questions that must be answered incorrectly before a student is failed, and N to indicate the cumulative number of questions answered; the two linear equations used to make the decision follow:57 56Dudley J. Cowden, "An Application of Sequential Sampling to Testing Students,‘l Journal of the American Statistical Association, 41:547-556, No. 236, December, 1946, p. 548. 57Ibid., pp. 548asu9. 83 D1 = a1 + bN D2 = a2 + bN As can be seen, the straight lines representing these two equations are parallel and differ only as to the constants a1 and a2. These constants al and a2 are shown to depend on the values of p1, p2, 0< , and /3 when: "pl" is defined as the maximum proportion of errors in all possible ques- tions of a given type made by a student who is definitely good; "p2" is defined as the minimum proportion of errors in all possible questions of a given type made by a student who is definitely poor; "04‘ is defined as the probability of failing a good student; and ”/3" is defined as the probabil- ity of passing a poor student. The more widely p1 and p2 differ the closer together the lines will be, and, therefore, the more quickly will a decision be reached. The larger the values ofa‘ and/3 the smaller will be the value of a2 and the larger (algebraically) will be the value of a1. There- fore to bring the two lines closer together one must increase 0<. and/or /3 . The value of a1 is always negative, since answering all questions correctly does not strongely indicate knowledge of the subject until a reasonable number of questions is answered (what is a reasonable number depends on the value adopted.frm'fl?, becoming larger as [3 is made smaller). On the other hand, a2 is always positive, but a decision to fail cannot be reached until D2 = N, since a student cannot miss more questions than he answers. WhenOK =fl , a2 a: —al. The 84 slope b is independent;cd’0< andifl3 , but depends exclusively C on p1 and p2. Cowden gives the following formulaszj8 : 10 *r' g. 2 lo .______ $1 8; Pl 2 s 1 _ p2 —cK _ -8 :r: h : a”) :2 h :2 °< l 1 g + g. 2 l 2 El + g2 b _ S2 $1 + $2 Cowden thus develops two lines for pass, fail, and indeterminate, but has grades for six categories based on 59 the following decisions: After 20 questions if a student made errors in less than 10 percent of the questions, the grade of "A" was assigned; if 55 per cent or more of the questions were answered incorrectly, the grade of ”F” was assigned; if the percent of incorrect questions was between these percentage values then testing was con- tinued. After 40 questions if a student (not classi- fied before) made errors in less than 22.5 percent of the questions the grade of "B” was assigned; or if more than 45 percent of the 40 questions were incorrect, the grade of ”F" was assigned. Similar decisions were made after 60, 80, 100, 200, and 1,000 questions. After 1,000 questions those students not already classified were assigned ”D" or "E” grades. Those individuals having errors in less than 34.89 percent of the ques- tions were assigned ”D” and those students having errors in more than 35.3 percent of the questions were assigned a grade of "ET Sequential testing is thus changed to allow using more than three categories by changing the number of items that Ibid., p. 551. 591bid., p. 552. 85 are used to make the decision. Estimates of the size of the number of items can be obtained by the following formulas:6O _ h __ 1 - b N. 2 (l - 0< ) h]. "O< h2 E Z (1 -fl) h2 "flhl pl b ‘ 91 pg 92 “ b Cowden found that it took 13.5 items before it was possible to decide that the student should pass. This is due to a random sample of items assumed in the sequential process. It therefore seems worthwhile to investigate a pur- poseful sample of items instead of random sample even though the mathematics has not been worked out for this type of test. To use the model developed by Wald, one must first state the probability of type I and type II errors that one will accept (as to a given alternative) and then continue until one satisfies the conditions of the mathematical model with probabilities.61 The procedure may be used to decide upon pass or fail categories as was done by Moonan; or modified by making assumptions about the number of items needed to make the decision as done by Cowden; or an individual may wait for the mathematics of the multiple decision (or other modification) to be completed and reported as Wald indicates 62 might be done in his book on sequential analysis. 60Ibid., p. 553. 61 Wald, Sequential Analysis, op. cit. 62 Ibid., pp. 138-150. 86 The sequential procedure developed by Wald for a ”most powerful" test is built upon the assumption that one may con- tinue to sample the same universe. The procedure determines what decision is best after every sample and states whether one has attained the desired degree of probability (of being correct). It is not necessary to follow the lead of Cowden and Moonan and, therefore, use a random sample of items. It is known that certain items of different difficulties will give more information about an individual than other items, and this information Shauld be used: this means that one does not wish to sample from the same universe of items each time. While the aptitude or ability being tested must remain unidimensional, there may be great advantage in allowing the difficulty of items to change. The sequential model herein described thus departs from the Wald sequential model in that it uses different difficulty levels so that fewer items are needed for the decision. Fiske and Jones in an article intended to introduce sequential analysis to psychologists, stated that the un- critical use of sequential analysis obviously is not recom— mended.63 It is a design which can have advantages when one or more of the following conditions actually holds: (a) The problem involves the choice between two possible parameter values which can be specified on a_priori but not arbitrary ¥ 63Donald w. Fiske and Lyle v. Jones, "Sequential Analy- sis in Psychological Research,” Psychological Bulletin, 51: 264~275, No. 3, May, 1954, pp. £73~274. 87 grounds—-the null hypothesis will usually be one of the two; (b) the data are such that the cost per datum is high and economy is desired; and (c) the total amount of data is not fixed. Such criteria would lead one to believe that the sequen- tial model developed by Wald may not be the appropriate model for the test situation, as the total amount of data is fixed and one cannot afford to have 1,000 items as indicated by Cowden. It may be no more expensive to acquire the data from all candidates than from a few, unless one wishes to select only rather than classify. The decision to accept or not acceptwthe selection question—-seems to be the most ap- propriate decision which can be answered by the sequential method as described by Wald. The literature also indicates methods<1fpresenting the material to the testee. Some of these are noted here. Glaser, Damrin, and Gardner constructed a tab item test to aid in training of electronics specialists.6u In this test, the performance on one test yields information which supplies a cue for the selection of the next test and subsequent proce- dures. ‘0ne "tab item” test, for example, had the trainee read a description of the malfunction of a television set and then, rather than actually performing various checking 64Robert Glaser, Dora E. Damrin, and Floyd M. Gardner, "The Tab Item: A technique for the Measurement of Proficiency in Diagnostic Problem Solving Tasks,” Educational and Psycho- logical Measurement, 14z283—93, No. 2, Summer, 1954. 88 procedures, the trainee pulled the tabs of those checks he would make if he were actually trouble shooting a real tele- vision set. Whenever he pulled a tab he uncovered the information he would have obtained if he actually had per- formed that check on a real set. Another method of presentation was used by Krathwohl and Paterson in preliminary studies of the sequential test model. They had directions printed on the page, covered these with a transparent hard finish ink so that directions could not be erased, then covered this in turn with strips of opaque ink. The testee erased the strip of opaque ink under the letter he considered to be related to the correct answers. (This is similar to an IBM answer sheet, but in- stead of marking a spot, the testee erases a spot.) The appro- priate directions were thus made available to the student. Teaching machine presentations are also obvious methods to present material to the testee. The material is similar to that presented by teaching machines, but in the sequential model being developed in this paper, the individual does not obtain information about the correctness or the reason for the correctness or incorrectness of the response. However, the individual is told to take a more difficult item if he correctly answered the preceding item, or a less difficult item if he incorrectly answered the preceding item. The literature suggests that if the decision is to best classify the individual by a sequential procedure, the 89 present sequential model may be better than past models which have been developed from different assumptions and for differ« ent problems. The literature also suggests that traditional scores represent more than one meaning. The present sequential model has used reflection of input in the output as the proper meaning for a score; the cumulative test should not perform this function as well as the sequential test. The decision as how to measure the ef~ ficiency of these tests (and indirectly the items) was then related to the reflection of input in the output. The two factors considered in the output were (1) the means and variances of ability levels assigned to a score (precision of score) and (2) the means and variances of scores assigned to an ability level category (discrimination of test). It should be noted that the decisions as to the type of score distribution desired and the meaning that should be.assigned to a score had to be made before one could deter— nnxme the efficiency of the test (or items). The decisions made in the present study were those decisions which it was hoped would favor the sequential test procedure. There should be maximally efficient use of items in the sequential method as (1) there is a separation of individuals into groups which have a base rate of 50 per cent for the items used, and (2) the use of items at the 50 per cent level of difficulty for the subgroups permits the use of more 90 difficult items and makes better separation of these individ- uals (as the item is at the 50 per cent level of difficulty for the subgroup). CHAPTER III PROCEDURES There are six sections to this chapter. First, the actual construction of the six-item cumulative and the six-item sequential test model is considered. The second section outlines the method of evaluating the hypotheses stated in Chapter I which relate to the effect of input distributions. The third and fourth sections show the methods for testing the hypotheses about item precision and difficulty, and effect of errors of estimating a parameter, respectively--both for the sequential model. Fifth, some general comparisons between test score distribution and ability level distribution are examined. And finally, a summary of procedures and hypotheses is presented. I. TEST MODEL CONSTRUCTION This section deals with the construction of six-item sequential and cumulative test models. Later these test models are used with different inputs of ability and the type of score output is examined. The test model for the sequential and cumulative tests assumed that the probability of passing an item was dependent 91 92 upon three factors: (1) the ability level of the individual, (2) the precision of the item, and (3) the difficulty level of the item. The assumption was made that no one passed by randomly guessing the correct answer to the item. The ability level of the individual was specified in terms of standard score units for a normalized distribution of ability. The precision of the item was specified in terms of either r or dd“ These two terms are related by bis the following formulas:1 4 M, 2 s 1 " rbis (1) r’bis 8d or by algebraic manipulation; 1 rbis z (2) l +o§ As can be seen from the second formula, r . is equal to one bls if dd is equal to zero. The smaller the 6' value the more d precise the item, and if Gd were equal to zero, the individ- uals who had abilitylevels.above the difficulty level of the item would pass the item, and vice versa. The difficulty of the item was expressed in terms of standard score units for a normal population. It need be remembered that 80 or 90 per cent of a select group could pass (or fail) a 50 per cent difficulty item. lFrederic M. Lord, "Some Perspectives on 'The Attenua— tion Paradox in Test Theoryl,” Psychological Bulletin, 52: 505-10, No. 6, November, 1955, p. 506. 93 The probability of passing a single item for a given small segment of ability was computed by determining the a - d area under the normal curve frmm-—cfi to the value'——gr~—— ; where"a"is equal to the ability level of the individugl in standard score or sigma units,"d" is equal to difficulty level in standard score or sigma units, and”oa”is the {measure of precision described above. The probability of passing a sequence of items for bcpth the sequential and the cumulative was determined by an11_tiplying the probabilities of passing each item in that seeqfuence. This assumed that for that small segment of azk>i.l;ity (for which the probability of passing an item was cieat:eelrmined), performance on any one item was experimentally irdcieezoendent of performance on any other item. Since the con- C€?I“rd. was with classifying people by ability, it was assumed ”léinS each of these items measured only one factor other than ”163 error factor, i.e., the test was unidimensional. The er'I-“<:>:ra factor on any one item was assumed to be independent of) ESinceror on any other item. Using the above scheme, one six—item sequential test mCDCi‘Eiill was constructed for a hypothetical population of 1500 ind-?31~‘v:'lddals with 100 people at each of 15 ability levels as SFICD‘AJIfl in Table 24. The item precision for all items in this mc>Ci‘EE:1. was arbitrarily set at 6d = .882. The appropriate dif- ffLCZLiiLties were determined by the following procedure. First, t?) SE ‘rlumber of people at each of the 15 ability levels who 94 would pass or fail an item was computed. The value of the sum of the deviations from the mean squared for each of the ability scores was computed for the pass and fail groups. This value was computed and graphed for different trial values of difficulty until the difficulty level was found for which the sum of all sets of deviations of ability level eabout the mean ability level for the entire group was a rninimum. Since Edgavmmsa constant, the value for difficulty leavel was calculated by maximizing (ELX)2/N. The difficulty lxel/el of the item taken by each group was not the same. ITc1r* example, in Figure 1, both the group who passed the first j.tmern and failed the second item and the group who failed the .f211753‘t itemenklpassed the second item take the same item at {SiSEiégee 3-- a 0.00 item. If this had not been done, the six- it3€3IT1 sequential test would require 63 different items. It was decided to use the same item for those groups for Which 2(2X)2/N maximized at a difficulty level no more “153151 .20 standard score units away in difficulty from each Ot3k1<33;rr. This allowed the test to be built with fewer items arlci tzhus any test built to correspond to the model could use Or1:l‘57’ the most precise items in a pool of items. Also, this CC)17RE‘ ijave answered correctly. Both of these raw score distri- b115t3:1.ons were converted to normalized "T" scores so that the 'tVV‘:> score distributions might be compared on an equivalent Ln ‘t: e rval scale basis . II. EFFECT OF SHAPE OF DISTRIBUTION OF ABILITY It was hypothesized that the sequential test model con— 8 tbucted as described above should work well for any type (Bit? ilanput distribution and thus be better than the six-item (IIJtrrrlaglative test model. The six-item cumulative test con— 8 tr‘ucted with all items at 50 per cent level of difficulty 96 was not expected to be effective for those distributions which had many high ability individuals. It was hypothesized from the literature that these individuals would need more difficult items to discriminate among them. To test this hypothesis different ability distributions were used as input. The diffioplty levels of the items used in the sequen- tial test model were determined according to the method (described in the last section, and were for a precision level (if an rbis item total correlation of .75 (or dé = .882). A pornecision of .75 was used because differences between the sai)c~item cumulative and sequential models should be greatest azt: Idigh levels of precision——.75 would be considered very f1i4g1F1 by the standards in use. Few tests have an average 1 teerrl—total correlation of .75. A rectangular input distri- bL11::i_(3“EE’ZLs--the sequential and the cumulative--were each used V'i_t:1fil tklee in LI‘_ a normal and a U-shaped distribution of ability to make 1:otal of four tests. These four tests were constructed Eaflrl electronic computer. (For both the normal and the Eslfléaped distributions the individuals were assumed to be 97 distributed over 15 input categories. Since the values used in the computer program wereproportions at each category, any number of individuals may be assumed. The most common assumption made in interpreting this data is that there were 1000 individuals distributed over these 15 input categories.) The item difficulties for the sequential models were the ones computed above. The item difficulties for the two cumu- ;1ative models were all at the 50 per cent level. It was thus pnossible to compare not only the sequential with the cumulative rncxflels, but also the effect of an input of normal and U-shaped d i s tributions. Iiififezct of Normal Distribution The effect of an input of a normal distribution of atDfi.:L ity on the output distribution was examined in several WEiE/ Es , but before the examination of hypotheses related to tFIEB Es63 in the end categories. This was done because spreading thé Se individuals over the middle of the distribution would ha- RJ’EE underrepresented the number of people likely to be at E): f:”I7€eme values in ability. Using the above procedure the end 98 categories extended from il.6l2 to :l.736 sigma units. Since there were so few to consider at the levels beyond il.736 sigma units, these individuals were all considered to be at the mean ability level for all_people beyond :l.6l2: that is, at 1.942 sigma units (see Figure 2). To test the hypothesis that the cumulative and sequen- tial test models have equal ability to classify individuals of mean ability level, the means and variances of comparable Ilormalized scores from the six-item cumulative and ”least smzuares" sequential test models for those 100 individuals aasssumed to be in category eight of ability (the middle cate- ggcoxey) were tested for significance of difference. The means vveezre tested by use of a ”t” test and the variances by use of Elf) F ratio. To test the hypothesis that the "least squares'I sequen- ‘tdi.éa.l test model should more accurately classify the few 1¥r1<:1;ividuals at the extremes of the ability scale than the ‘34i-:>: —item cumulative model, the means and variances of com- EDEa-I“£able normalized test scores for the 84 individuals in abLinity categories 14 and 15 were tested. (Testing of the irlcg spa s HHQ< O K .m .mHm OJ OH Om om 0: On 00 on om om OOH OHH Oma Oma at sea deqmnm JO sea 100 the extremes than the sequential test model, the means and variances of ability level scores for the individuals ranked in the top 8.4 per cent of the score distribution for each test model were tested. When it was necessary to take only a proportion of a score group to complete the top 8.4 per cent of scores, then the ability levels were proportionately sampled. The value of 8.4 per cent was selected because tfldere were 84 individuals in the top two input ability leevels of the hypothetical population of 1000 individuals. It was hypothesized that the six~item cumulative test Incociel would produce scores representing finer ability units 111 izhe middle than at the extreme score values, while the S<3C1LAeHNAal test would more nearly reflect the ability scale. 'ITlea ‘thesized to be smaller in the middle and greater as ex“Cir“‘esmes were approached. These differences in mean ability VEIJ—lélees for the adjacent scores in one—half of the symmetrical SC‘CDSIT’EE distribution are shown in Tables 5 and 6. In addition tCD iZ¥kUis, the differences between mean normalized "T" scores fk3:r’ esach adjacent ability level for both the sequential and ”LlrrrLJ—Jiative tests are shown in Table 4. Hr \% of U-shaped Distribution The effect of the U-shaped distribution of ability was 531: ‘Ll‘:1fLed.by the same procedures used with the normal distribution 101 of ability. The distribution used in these tests is the one shown in Figure 2. To determine if the "least squares" sequential test would more accurately classify individuals at the mean of the absolute ability levels than would the six-item cumulative test, the means and variances of normal- ized scores assigned to category thirteen were tested for significance of difference between scores assigned by the isix-item sequential and the six-item cumulative models. cuategory 13 was selected as it included the mean value of aatxility for those individuals in the top half of the ability C13:r*eme values of the ability distribution than would the Sia>:‘-item cumulative test, the means and variances of normal" iZEECi scores assigned to category 15 individuals were compared fC>37’ the sequential and cumulative test models. To test the hypothesis that the cumulative test model WCDIJ~:L.3I‘<3 distribution were examined for differences in means arj‘:1 ‘variances of ability level. These top-scoring individuals VV€3:EIEE* proportionately selected as stated for the normal dis~ tzla - :LTEDXJtion. The top 13.5 per cent of the score distribution Wa‘g tzklesg lased as there were 13.5 per cent of the individuals in top ability category. 102 To determine if the classification of the middle ability level was more finely classified by the six—item cumulative, the mean normalized "T" score for each ability level was determined and shown in Table 24. The same was done for the sequential test model. The hypothesis was that the sequential model should have approximately equal distances between test score means for each of the ability categories, while the six-item cumulative model would have larger differences in mean test scores for the middle ability levels than for extreme values. The differences in mean score values for adjacent ability levels are shown in Table 4. The mean ability levels for each score are likewise shown in Tables 25 and 26. It was hypothesized from Lawley's work that the extreme scores of the cumulative test should have lower variance of ability level than the extreme scores for the sequential test.2 Since less variance of ability level means fewer lower ability individuals, it was assumed the extreme cumulative test scores would have higher mean values. Effect of Ability Distributions for Additional Sequential Tests In addition to the four tests described above, three other sequential tests were built with an electronic computer. 2D. N. Lawley, ”0n Problems Connected with Item Selec- tion and Test Construction," Proceedings of the Royal Society of Edinburgh, 61 (Section A, Part III): 273—287, 1942-1943. 103 However, in these tests the difficulties of the items were not determined by a "least squares” procedure, but used difficulties determined by an adaptation of Lord's work.3 The item difficulties used in these three tests were so selected that, it was hypothesized, depending on the particu- lar selection, a normal, rectangular, and a U-shaped distri— bution of scores would be obtained. The number of individuals assigned to each score and mean ability level of these individ- ualsarereported in Tables 18, 19, and 20. It was assumed that a score from a test designed to output a rectangular score distribution should correlate highest with a rectangular input of ability. Scores with normal distribution should likewise correlate highest with the normal input of ability, and scores with U—shaped distri- bution should correlate highest with U—shaped input of ability. However, information was obtained as to the effect on both output distribution and the correlation values of changing the input distribution. The rule stated by Lord was that if one wished to divide the group at a given point, then the item difficulty (expressed in standard score units) is represented by the item-total r times the standard score unit which represents bis the proportion below the point where the split is desired. The procedure followed in constructing these three tests was 3Lord, ”Some Perspectives on 'The Attentuation Paradox in Test Theory'," op. cit. 104 that if there were four different difficulties used at a given stage, then the abcissa should be divided into five equal ability segments. The difficulties necessary to pro- duce these proportions were them computed from Lord's formula. One time the distribution of scores to be produced was con- sidered normal; one time, rectangular; and one time, U—shaped. Since different proportions were to be selected for each distribution shape, different difficulties were needed for each. The rule used to determine the number of different difficulties at each stage was to add one more difficulty at each stage. It turned out that this rule gave results approximating the nasults from the determination of difficul» ties by the rules developed in the past section on "Test Model Construction.” Lord has shown how to select item difficulties to yield a desired split of individuals by a cumulative test. These Lord difficulties assume an input of a normal distribution of ability; therefore, in the sequential test one should com- pute difficulties with a normal distribution of ability for each item of the test. This was not possible in the present sequential model. The differences in the difficulty levels of the items selected by Lord‘s technique and the above tech“ nique when an r = .75 is used are noted, but no study of bis the effect at other values of rbis was made. 105 III. ITEM PRECISION AND DIFFICULTY FOR THE SEQUENTIAL TEST To determine the interrelationships among item precision, difficulty level, and output characteristics, five tests con- taining items of varying precision and difficulty were compared. The five tests were built in the electronic computer and varied in precision and difficulty of items used. The tests were built using Lord's rule in the selection of difficulties so that a normal distribution of scores should be obtained when the distributions of ability were normal. The five precision levels were for rbis equal to .79, .75, .71, .60, and .45. (The .75 precision test was the same as the one constructed above.) For an assumed N of 1000, the .79 and .71 values are one standard error of a This above and below .75. The .60 value was selected as it is a value common in the literature; the .AB to show the effect of meeting low precision standards. The .79 precision level is not consid- ered unrealistic if the spread of ability level is great. Precision was hypothesized to be one of the most important parameters in the behavior of the sequential test model. To examine the hypothesis that the more precise items would produce a better separation of people, the variances of scores for category eight ability level (the middle ability level) individuals were compared for each of the five tests by use of Bartlett's test for homogeneity of variance. This 106 test was repeated for the combination of categories 14 and 15 (the most extreme categories) for the five test models. It was hypothesized that there would be a difference in the variance of scores, with the more precise items producing the scores with the smaller variances. Since a lower precision of items means that the effective difficulty level regresses toward the mean and, therefore, is closer to the 50 per cent level, the middle difficulty items should increase the preci- sion of scores at the extremeSQ-although not the ability to classify individuals. Thus, the extreme scores would have small variance of ability levels for both precise and less precise items and it was hypothesized that the variances of ability level scores would be most different at the middle score values. The second hypothesis stated that a test consisting of more precise items would have the ability to discriminate evenly over the entire range of ability rather than making finer discriminations at the middle of the ability range. This hypothesis was tested by examining differences in the means of test scores for each category of ability. A table was made of the means and variances of test scores for each of the fifteen ability levels and for each of the five levels of precision. The discrimination index for adjacent ability levels was computed as suggested by Lord}1L The higher the “Lord, A Theory of Test Scores, op. cit., p. 24. 107 index the better the discrimination; values may range from zero to infinity. Lord's discrimination index was computed as follows: D = S-Pn. S Ca 4.6* ‘ MS c = mean of score values for ability level co ° 0 Ms.c1 = mean of score values for ability level 01 6* 2 some appropriate average of the standard deviation of the two score distributions Lord stated that this discrimination index is completely independent of the distribution of ability in the group tested:5 This is an advantage when a general description of the test is desired without reference to any particular group of examinees; it is adisadvantageii'the effective discrimination of the test for a specified group of examinees is desired. IV. ERRORS IN SEQUENTIAL TEST PARAMETER ESTIMATES The procedures used to determine the effects of errors in estimating the parameters of precision and difficulty for the sequential test items are related to the nature of the error involved. The difficulty of an item is usually specified in terms of the proportion of the group passing the item. This test model, however, uses difficulty specified in standard scores, so the standard error of a proportion must be translated into standard score terms. The standard 51bid. 108 error of a proportion (~/(§§X7N) is greatest when P = Q = .50. Thus the greatest error in estimating difficulty in terms of proportion passing an item would occur at the 50 per cent level of difficulty. The value of W is smallest at the extreme values of P or Q. The error in terms of proportion passing an item was thus investigated at .50 and .90. These errors were then translated into standard score units. The values of V/(fi5X7N_ (when N = 1000) were .016 and .010 for .50 and .90, respectively. When the values necessary to encompass two standard errors of the proportion were translated to standard score form the values were quite similar and equal to about :,l0. The error for estimating difficulty was thus assumed to be less than or equal to.: .10 no matter what the difficulty level of the item. The error made in precision depends upon the estimate 6 of r , which has a sampling error as follows: bis —\ PQ/Z - I)tglis rbis — ‘—— /N Terms as defined before. Thus for rbis equal to .75 (which was the only precision level for which the error was studied), and assuming P = Q = .50, and N = 1000; then dEbis = .02. Since the error in rbis is not likely to be greater than i .04, then Pbis 6Quinn McNemar, Psychological Statistics (second edi- tion; New York: John Wiley and Sons, 1955), p. 194. 109 of .75 is not likely to be outside of the interval of .71 to .79. The 6d value for .71 is .99 and the dd value for .79 is .78. Thus the error in terms of 65 is not likely to be greater than i .10. These estimates were the values used to determine the effect of parameter estimation on output. The testing of the first hypothesis as to the effect of errors of item difficulty was done with a normal distri- bution of ability; test items designed for rbis equal to .75; and by the least squares of deviations method described in "Test Model Construction." It was hypothesized that if one were to use at the second stage an item which was .40 more sigma units away from the mean than the items selected as above, then more people should be directed toward mean scores than if the ideal difficulty were used. This would imply fewer people at the extreme values than usual if the rest of the test did not correct this trend. It was hypothe— sized that the opposite should happen if the item were .40 sigma units toward the mean at the second stage. These changes were tested by use of the chi-square technique. If a difference of .40 did not make any difference it would seem obvious that errors of estimate (about .10) would not make any difference. The errors of estimate in the fifth stage were deter- mined when the item difficulties were shifted .40 sigma units away from the mean in one problem and .40 sigma units toward the mean on another problem. As the hypotheses on the effects 110 of error at the second and fifth stages derive from the same rationale,and as the effects of the fifth stage were expected to be in the same direction as the second stage effects only larger, the hypotheses on the second stage errors require only analysis of direction of change (i.e. chi-square) while the hypotheses on the fifth stage errors require more exten— sive variance analysis. It was hypothesized that the variance of ability level for the top 84 individuals would be greater for tests with the shifted difficulties than for the test where the items were at the ideal difficulty level. The significance of dif~ ferences in variances was tested by use of Bartlett's test for homogeneity of variance. The discrimination of the tests for an ability level was determined by examining, for these same tests, the variance of test scores for the category fifteen ability level individuals. It was hypothesized that the variance of scores forcategorylS individuals would be highest when difu ficulties were closest to the mean value. Variances for the three tests were compared by use of Bartlett's test for homo- geneity of variance. For the test with difficulties at the fifth stage dis- placed away from the mean by .40 sigma units, it was hypothe- sized from Lawley's work that the variance of ability level for the 100 middle-scoring individuals would be lower than the variance of these individuals on the other tests. Again lll Bartlett‘s test for homogeneity of variance was used as the test. Ability level discrimination was similarly determined by examination of the variance of test scores for category eight of ability. It was hypothesized that the original test (with ideal difficulties) would have better discrimin~ ation than the modified tests. Again Bartlett's test was used to compare variances. The third hypothesis~~that errors in estimating the precision of the items would be more serious in the initial stages than at later stagesm-was tested by placing items of rbis = .71 (instead of r .75) at the second stage. bis : Since subsequent items were designed with the assumption that the second item had rbis : .75, the spread of ability should be greater than ideal for discriminating among individ— uals arriving at subsequent items. These subsequent items are more difficult than ideal and this increased difficulty should thus force the individuals toward the center of the distribution. The greatest increase in variance of test scores should thus be noticed for high and low ability groups; middle ability groups should not change in variance of test scores produced. The variances of scores for extreme and middle ability levels were compared by use of the F ratio. Also, the variances of ability level scores for individuals ranked in top 8.4 per cent of the score distribution were tested by the F ratio. 112 The fourth hypothesis that errors in estimate of precision should make little difference at the fifth stage was examined by placing items of rbis equal to .71 at the fifth stage. The difficulty of the items remained the same. The effect of this should be that again the item would be more difficult than the Lord formula would suggest as ideal, because difficulty should be regressed toward the mean de- pending upon the rbis value. The lower the rbis the more the ideal difficulties should be regressed toward the mean. The results should be that more individuals than ideal would take an easier sixth item which, according to Lawley, should increase the precision of high ability scores. It was also hypothesized that this change in fifth item precision would increase the variance of score levels for high ability individ~ uals. These results were hypothesized to be in the same direction as results from changes at the second stage, and the F ratio was likewise used to test these hypotheses. V. GENERAL COMPARISONS A general comparison of the relationship between input distribution and output distribution of scores was felt to be of value even though no specific hypotheses were advanced due to the number of variables involved. The difficulty of the items, the precision of the items and the pattern of items taken by individuals of different abilitylevelseflj.interact to affect the score distribution. 113 The effect of difficulty of items was noted for the nine tests described in ”Effect of Ability Distribution for Additional Sequential Tests." As the difficulties of items in each test do not regress toward the mean at the same rate, no clear conclusion can be made as to the effect of dif- ficulty on output characteristics. The effect of difficulty can thus be determined only for certain ability levels. (The data for the distributions of only one-half of the scores were presented as the other half was symmetrical.) In addition to the distribution of scores, the cor- relation ratios were reported as these give information as to the general relationship between the input distribution of ability and the output distribution of scores. In former unpublished trials of the sequential test the value of the Pearson Product—Moment r was made to closely approach that for eta, by assigning the scores to the 64 different sequences of items from the rank of the mean ability level of the individuals at the score. (Another alternative would have been to assign scores according to rank of the sequence if ideal items had been used in the test model.) The best general comparison of output to input in regard to precision of item came from the five sequential tests described in ”Item Precision and Difficulty for the Sequential Test," where item difficulties and type of distribution remained constant over all five sequential tests. The general 114 comparisons were made in terms of correlation ratiosthe data were reported for one-half of the output distribution of scores for the five tests. A comparison of output to input in regard to the pattern of items taken by an individual came from using the Lord difficulties which yielded a rectangular output of scores when a rectangular distribution of ability was input. The rectangular distribution was used because this best ap— proximated the "least squares” solution. Two new test models were constructed: each had exactly the same items with same difficulties and same precision (rbis = .75); one test had items distributed as in Figure l, and the other test had items distributed by one item at first stage, two items at the second, three at the third, and continued until it had six—items at stage six. Only the pattern of items taken by the individuals was different in the two tests. Again eta and the distribution of one~half of the output distribution of scores for each of the two test models were reported. VI. SUMMARY OF PROCEDURES AND HYPOTHESES One sequential test model was constructed by the "least squares" (of the deviations from the mean ability level) rule for a rectangular distribution of ability over 15 ability categories and r equal to .75 for item precision. (Ability bis level one represented lowest ability level and ability level 15 represented highest ability level.) 115 The above test was then used with an input of normal and U-shaped distributions of ability. A six-item cumula- tive test with all items at the 50 per cent level of dif- ficulty and a precision level of the item-total rbis equal to .75 was likewise used with normal and U-shaped distri- butions of ability. The output distributions for comparable tests were then examined. The null statistical hypotheses concerning the effect of the normal ability distribution on output of scores stated that the cumulative and sequential test models should have the following: (The alternative hypothesis expected from the rationale is given in parentheses.) (1) equal means for the comparable normalized scores for category eight individuals (no alternate, hope to accept null); (2) equal variances for the comparable normalized scores for category eight individuals (hope to accept null; cumulative may be smaller); (3) equal means for the comparable normalized scores for combined category 14 and 15 individuals (cumulative lower); (4) equal variances for the comparable normalized scores for combined category 14 and 15 individuals (sequential smaller); (5) equal means for the ability level scores for the individuals ranked in the top 8.4 per cent of the score distribution (cumulative lower); and (6) equal variancesfor the ability level scores for the individuals ranked in the top 8.4 per cent of the score distribution (sequential smaller). The null statistical hypotheses concerning the effect of the U—shaped ability distribution on output stated that 116 the cumulative and sequential test models should have the following: (1) equal means for the comparable normalized scores for category 13 individuals (cumulative lower); (2) equal variances for the comparable normalized scores for category 13 individuals (sequential smaller); (3) equal means for the comparable normalized scores for category 15 individuals (cumulative lower); (4) equal variances for the comparable normalized scores for category 15 individuals (sequential smaller); (5) equal means for the ability level scores for the individ- uals ranked in the top 13.5 per cent of the score dis~ tribution (cumulative lower); and (6) equal variances for the ability level scores for the individuals ranked in the top 13.5 per cent of the score distribution (sequential smaller). In addition to the hypotheses listed above,mean score values for each ability level, and mean ability level for each score value were plotted for both the normal and U- shaped distributions of ability. Additional information as to effect of distribution of input on output is presented as part of the general comparisons. Three tests were constructed by Lord's rules and each of these was used with normal, rectangular, and U-ohaped distributions of ability, although each test was designed to reflect only one of the input distributions. Eta was used to compare the input distribution with output distri- bution for these nine tests. In addition, the actual output distribution of each of the nine tests was tabled. These tests were built for information, and no hypotheses were made as to results. .117 To determine the effect of item precision on the output of the sequential test, four test models were constructed with an input of a normal distribution of ability and item precision taking the values of r equal to .79, .71, .60, bis and .45. Item difficulties were those determined by Lord's procedure to be most appropriate for a given precision level when assuming a normal distribution of scores desired. The variances of ability levels for extreme and middle scores, and the variances of scores for extreme and middle ability levels were examined by use of Bartlett‘s test. The null statistical hypotheses (and expected alterna— tives) concerning the effect of item precision and dif- ficulty stated that tests which use a normal distribution of ability for input and a nearly normal output of scores should yield the following: (The alternative hypothesis is given in parentheses:) (1) equal variances of scores for category eight ability level individuals for all five tests of different precision levels (most precise test smallest); (2) equal variances of scores for category 14 and 15 ability level individuals for all five tests of dif- ferent precision levels (most precise test smallest); (3) equal variances of ability level scores for the individ— uals ranked in the top 8.4 per cent by each of the five tests of different precision levels (most precise test smallest); and (4) equal variances of ability level scores for the individuals ranked in the middle 10 per cent by each of the five tests of different precision levels (most precise test smallest). ll8 In addition to these hypotheses, the meansand variances of the test scores, and the discrimination indices between each of the adjacent ability levels were computed for each of the five different precision tests. To determine the effect of errors of using other than the difficulty level computed by ”least squares" method for certain items, four sequential tests were constructed. One had the second item shifted away from the sample mean in difficulty; another had the second item toward the mean value. The fifth item encountered by the individual was likewise displaced toward or away from the mean difficulty value in the third and fourth test models, respectively. Again the characteristics of the "error" and "error free" output distributions were examined. The null statistical hypotheses that were tested con- cerning the effect of errors in estimating the difficulty of the item at the second stage are as follows: (These hypotheses were used to determine if differences were in direction hypothesized.) (l) the number of people in each of a set of score categories would be independent of whether distributed by an error free difficulty test or one in which difficulties at the second stage were away from the mean (50 per cent) difficulty (more people at middle for "error' test); (2) the number of people in each of a set of score categories would be independent of whether the people were distri- buted by an "error free" difficulty test or one in which difficulties at the second stage were toward the mean (50 per cent) level of difficulty (more people at extreme for ”error" test). 119 The null statistical hypotheses that were tested con- cerning the effect of errors in estimating the difficulty of the item at the fifth stage predicted the following: (These hypotheses were deduced from same rationale as ones above, and data were examined more closely as it was hypothesized that those differences would be in same direction as dif- ferences above and of a larger magnitude.) (1) equal variances for the ability level scores for the individuals ranked in the top 8.4 per cent of the score distribution (”error free” test smallest); (2) equal variances of test scores for individuals in ability category 15 (test with items near 50 per cent largest); (3) equal variances of ability level scores for the individuals ranked in the middle 10 per cent of the score distribution (test with items away from mean smallest); and (4) equal variances of test scores for individuals in ability category 8 (”error free'I test smallest). The effect of error in estimating the precision of items was examined by constructing two additional ”least squares" test models. One test had less precise items for the second item encountered; the other had less precise items substituted for the fifth item encountered. Again the I'error free" distributions of scores for the ”error" and tests were examined. The null statistical hypotheses concerning the effect of error in estimating the precision of items at the second stage predicted the following: (1) equal variances of test scores for individuals in ability category 15 ("error free” test smaller); 120 (2) equal variances of test scores for individuals in ability category 8 (”error free” test smaller); and (3) equal variances of ability level scores for individ— uals ranked in top 8.4 per cent of the score distri- bution ("error free" test smaller). The null statistical hypotheses concerning the effect of error in estimating the precision of items at the fifth stage predicted the following: (1) equal variances of test scores for individuals in ability category 15 (”error free" test smaller); and (2) equal variances of ability level scores for individuals ranked in top 8.4 per cent of the score distribution (”error free' test smaller). The general comparison examined the effect of difficulty on score output, the effect of precision of items, and the effect of the pattern of items. Difficulty effects were examined for normal, rectangular, and U—shaped inputs on tests with item precision of rbis equal to .75 and item dif- ficulties as listed in Table 20 of the Appendix. (The rule for selection of difficulties of items is that one should use an item not at the difficulty level equal to ability level where split between groups is desired, but difficulty level should be regressed toward the mean value of 50 per cent. The lower the rbis the greater should be the regres- sion.) The distributions and mean ability level scores for each score were tabled. Distributions and mean scores were also tabled for five tests with different item precision and for two tests with different patterns of items. In addition to these tables eta between in ut and output scores was re orted for each of these r tests. CHAPTER IV ANALYSES AND RESULTS There are six sections to this chapter. Section one gives the results of building the sixmitem sequential test model. Section two reports results of the input distribu— tion on the score distribution of both the sequential and the six~item cumulative test models. Section three presents the effects of item precision and difficulty on the score distribution of the siXWitem sequential test model. Sec~ tion four gives the effects of errors of estimating preci~ sion and difficulty parameters on the score distribution of the sequential test. Section five gives some general results of changes in difficulty of items, precision of items, and pattern of items. Section six is a summary of the analyses and results. In all sections results are simply reported; interpretation is reserved for Chapter V. I. SEQUENTIAL TEST CONSTRUCTION As stated in Chapter III, the sequential test model 2 . . . was constructed so that the 2;(2iX) /N was maxlmlzed; graphic methods were used to aid in determination of maximum values. (é_X refers to sum of ability level scores for any one group. 122 123 Z (2X)2/N refers to squaring the sum of scores for the group dividing by the number in the group and then summing over the two or more groups that used the particular item.) The only restriction was that any item difficulty had to be more than .20 standard score units away from other difficul- ties to be considered different from them, and thus to be used. (The reader will be aided in following the item deci- sions given below by referring to Figure 3.) First Item Decision The values of 2 (EXP/N for + .01., .OO, and —.01 difficulty items were as follows: 109073.85, 179931.86, and 109073.85. The maximum value was thus obtained from a .00 difficulty level item and this item fulfilled the criterion of selection. Thus out of the 1500 people taking the hypo- thetical test, 750 would pass and 750 would fail this item. The mean ability level of these groups was i.°73~ Second Item Decision The second item produced four groups over which. §L(ZX)2/N was maximized. The three strategic Values for this item were i-23: i324: and i,25 which had values for §_(2_X)2/N of 113796.15, 113796.21, and 113796.00. (Strategic values were determined by estimating values and plotting these values of i (2X)2/N until the maximum value was stradled by three points that could be read from the graph.) The i,24 items were selected for the second stage. The resulting four a\./’,,.Wa\~s§ww§s x. o a. T’w’ ¢e>i . l x. /. . a. .. at)... anon..- a? m wasxwg 4.x _ 5.1.1.430]. 1IOO/Ru 76RTILrQJfiL101In; 314. 56789019.. 345 1;.Iilllal llllLai .._.._____.____ Ht>mq abaaaa< cats Score 6th 5th Fig. 3.-—Mean Ability Level of Groups Separated ties .. a. ) A rnd Difficu‘ by Sequential Test 4. U Ou Jsed Items T of 125 groups had mean ability levels of +1.04, +.lO, -.10, and -1.04. At this point 504 individuals had passed both the first and second items; 246 had passed the first and failed the second; and like numbers had failed both, and the first only. Third Item Decision The third stage items were reduced to three in number as the two middle groups were both given the same difficulty. Both of these middle groups took the same difficulty because each has the sum of (LX)2/N of 3287855 for i.09 items and 32878.02 for i,lO items. As :; (2QX)%/N maximized at less than .10, the ideal difficulty levels would be less than .20 sigma units apart. As this would vio‘ate a condition of the test construction, the two middle groups were given the same item which yielded a 2 (2.x)2/N of 32885.60. The two extremes ability groups produced 2.(£X)2/N equal to 83416.29, 83417.00, and 83416.84 for .48, .49, and .50 difficulty items, respectively. Thus the three difficulty levels used at the third stage are +.49, .00, and -.49. The mean ability levels of the eight resulting groups were from highest to lowest 1.21, .62, .48, .33, -.33, -.48, -.62, and -1.21. Fourth Item Decision At this stage there were eight groups taking four dif— ferent difficulty items (i.73 and i.40) and resulting in 126 sixteen groups. Those individuals who had passed (or failed) the first three items had 2 (1X)2/N equal to 62860.59, 62860.69, and 62860.50 for :,72,.i.73, and i,74 items respec- tively. The 1.73 item difficulty was selected. The second group (PPQ or QQP) had maximum values between i345 and i,50 which were more than .20 standard score units away from i.73. However, the third group (PQP or QPQ) had a (2X)2/N that maximized above .32. The similarity of the groups is shown in that while a .32 maximum is 18242.13, the .41 maximum is 18243.05. Since such values would giveiimmmsless than .20 standard deviation units away, the second and third groups were each given the same difficulty. The remaining group (QPP or PQQ) maximized between .29 and .35 for 21(iLXI/N of 15387.12 and 15386.92, respectively. Since the best dif- ficulty level for the previous two groups would be less than .20 standard deviation units away, all three groups were given an item of the same difficulty level. The strategic values for difficulty of item assigned to the three groups were i,39, i,40, and i_.4l which had 2_(2_X)2/N values of 54983.63, 54983.81, and 54983.57. Thus the :,40 item dif- ficulties were used. 0f the eight groups at this stage, one group took +.73, three took +.40, three took -.40 and one took an item of -.73 difficulty level. Fifth Item Decision The fifth stage decisions resulted in sixteen groups taking six items of different difficulty thus producing thirty- 127 two new groups. The groups that took the different difficulty levels were as follows: The PPPP and QQQQ groups took items of +.87 and -.87 difficulty. The PPPQ, PPQP, PQPP, and QPPP groups took an item of +.66 level of difficulty. (The QQQP, QQPQ, etc. opposites of above took an item at -.66.) The PPQQ, PQPQ, and QPPQ groups each took an item of +.15 dif- ficulty level. (Opposite groups took -.15 difficulty item.) In other words for the eight groups above the mean, one group took an item of .87 difficulty, four groups took an item of .66, and three groups took an item of .15 level of difficulty. The PPPP (and opposite) rmui gi (Z;X)é/N values of 45791.18, 45791.20, and 45791.18 for .86, .87, and .88 levels of difficulty. The PPPQ group maximized the 2;(z;x)2/N Just above the .71 difficulty level, thus the decision had to be made to give this group either the same difficulty item as the PPPP group or the difficulty of the PPQP group. The PPQP group maximized between .67 and .71-- 2: (2_X)2/N values of 13411.14 and 13411.11, respectively. These two groups were thus given the same difficulty level as their curves remained fairly near maximum for the difficulty level common to both. The PQPP group maximized Z (EXP/N between .60 and .65 with values of 10395.92 and 10395.98. the QPPP group maximized at about .60 with 2 (2X)2/N of 7826.26. Since none of these was .20 standard score units apart in difficulty, the one difficulty value that would maximize i(ZX)2/N for all four groups was determined. 128 The difficulties of .65, .66, and .67 had §L(ZLX)2/N values of the eight groups of 49000.44, 49000.65, and 49000.60. The item of 1,66 difficulty level was thus used for these eight groups. The PPQQ group maximized Z (ZX)2/N values between .20 and .28--8208.84 and 8298.88, respectively. This was more than .20 standard deviation units from .66, so this group was not given the item of .66 difficulty level. The PQPQ group maximized €L(2;X)2/N at .15 with 8096.18. (Dif- ficulty levels .10 and .14 had 2 (EX)2/N values of 8096.15 and 8096.17, respectively.) The QPPQ group maximized between .00 and .10 difficulty levels. This was not .20 stan- dard score units of difficulty away, so the one difficulty level that would maximize the sum of (2X)2/N for these six groups was determined. The strategic difficulty levels of .14, .15, and .16 had I§;(2.X)2/N for six groups of 24092.29, 24092.36, and 24092.30. Sixth Item Decision The sixth stage had 32 groups taking items at five dif- ferent difficulty levels (i,87, i,49, and .00). The PPPPP group had maximized Z. (2X)2/N between .90 and 1.00 dif- ficulty--the respective Z(2_X)2/N values are 32852.58 and 32852.64. The group PPPPQ had £(2X)2/N values of 13051.00, 13051.01, and 13051.00 at .86, .87, and .88, respectively. Thus it was clear that these two would not use different difficulty of item and neither would any group that maximized 129 above .75. The other groups which maximized about.75 were as follows: the PPPQP group which for .85, .86, .87, and .88 had 3 (ZX)2/N values of 11334.16, 11334.17, 11334.17, and 11334.16, respectively; the PPQPP group which for .85, .86, .87, and .88 had 8373.86, 8373.86, 8373.86, and 8373.84, respectively3the PQPP group which maximized Z (ZX)2/N between .80 and .85 with values of 6059.83 and 6059.79, respectively; and the QPPPP group which maximized between .74 and .80 both with Z;(ZLX)2/N value of 4227.53. The Z (ZX)2/N for the 12 groups using the same dif- ficulty level of item were 75898.65, 75898.66, and 75898.60 for the .86, .87, and .88 level of difficulty, respectively. The decision was thus to use a .87 difficulty item for these groups. The PPPQQ group (the next highest ability level group) maximized between .55 and .65 with ‘£_(E;X)2/N values of 6150.61 and 6150.64, respectively. (The approximate value for maximum was determined by plotting of the curve from six points.) Since the group maximized more than .20 standard deviation units away from the .87 groups and also maximized within five points of the next lowest group, the decision was made to use a new difficulty for all remaining groups that maximized above .40. The remaining groups which maximized at difficulty levels greater than .40 (but below.60) were as follows: The PPQPQ group which for difficulty levels of .50, .55, and .65, had £(iX)2/N values of 5135.92, 5136.00, and 130 5135.88, respectively; the PQPPQ group which for difficulty levels of .43, .48, .49, and .50 had 2 (ZX)2/N values of 4422.34 4422.41, 4422.41, and 4422.40, respectively; the PPQQP group which for difficulty levels of .43, .48, .49, and .50 had 2f (2iX)2/N values of 4680.27, 4680.33, 4680.33, and 4680.32, respectively; the PQPQP group which for difficulty levels of .32, .43, and .48 had 2: (fiiX)2/N values of 4198.70, 4198.83, and 4198.77, respectively; and the QPPPQ group which for .32, .43, and .48 had i(éx)2/N values of 3670.06, 3670.19, and 3670.14, respectively. The QPPQP group maximized €i(€iX)2/N between .32 and .43. Difficulty levels of .29, .32, and .43 had values of 3628.46, 3628.48, and 3628.35, respectively. A decision thus was whether to include this group with the higher or lower groups. The PQQPP group (next in line) for difficulty levels of .OO and .09 had 2L(2LX)2/N values of 4244.60 and 4243.82 and maximized below .09. For this reason the QPPQP group was included with the higher group instead of .20 units lower in difficulty which would have yielded a lower z(ix)2/N value. The sum of {(iXF/N for the 14 groups for difficulty levels of .48, .49, and .50 were 31886.02, 31886.04, and 31885.95. Thus a difficulty of in49 was used with each of these groups. The remaining six groups all maximized between i,09, thus .00 item was used here. The QPPQQ group for difficulty 131 levels of .OO and .09 had éi(E;X)2/N values of 4244.60 and 4243.82, respectively, The PQPQQ group had 3983.96 and 3983.54 for these same values, and the PPQQQ group for dif— ficulty levels of .00 and .09 had.2i(§_X)2/N values of 3615.52 and 3615.32, respectively. Thus of the 16 groups above the mean, six groups took the .87 difficulty item, seven groups took the .49 difficulty item, and three groups took the .00 difficulty item at the final stage. The above sequential test was compared with the cumu- lative test to determine how well the score differentiated individuals of different ability levels and to determine the range of ability levels assigned to any one score. The above sequential test was also used in the deter- mination of the effects of errors in estimating the parameter values for the items in this test. Parameter values consid- ered were difficulty and precision. This pattern of items determined above was also used with different difficulties to determine how a test with an arbitrary pattern and easily computed difficulties compared with a test using pattern of items determined above. II. INPUT DISTRIBUTION EFFECTS Normal and U-shaped distributions were each used with the cumulative and the ”least squares" sequential test. The results from the two distributions are presented separately. 132 Results from the Normal Distribution The first null hypothesis was that there should be equal means of the comparable normalized scores for the middle category, category eight, individuals taking the se- quential and cumulative test both with a normal distribution of ability input. Results are shown in Table 1. As can be seen from the table, the null hypothesis tested by a "t" test must be accepted. This was expected as both have a symmetrical distribution of scores. This hypothesis was included as a parallel hypothesis to hypothesis one on U- shaped distribution (and as a check on the accuracy of computer computations). In this, and all other hypotheses, the reader should be aware of the fact that the number of individuals is dependent only upon the accuracy of the cal- culations. Since the figures were carried to between eight and twelve places a larger N could well be assumed. This would make the error terms smaller and differences signifi- cant. The theoretical lOOO individualswerg used to give the reader a point of reference. If the differences exist in the proper direction, the rationale may be said to be supported. The second null hypothesis was that there would be equal variances of the comparable normalized scores for middle category (number 8) individuals taking the sequential test and the cumulative test both with a normal distribution of ability input. Results are shown in Table 1. Again 133 the null hypothesis tested by a F ratio test must be accepted. This was expected from the rationale. The third null hypothesis was that there should be equal means of the comparable normalized scores for combined category 14 and 15 individuals taking the sequential and cumulative tests both with a normal distribution of ability input. Results are shown in Table 2. The null hypothesis was based upon 1000 individuals and accepted. The scores were in the expected direction with the sequential test assigning the more extreme value; therefore, the rationale tends to be supported. The fourth null hypothesis was that there should be equal variances of the comparable normalized scores for com- bined category 14 and 15 individuals taking the sequential '\and cumulative tests both with a normal distribution of ability input. Results are shown in Table 2. The null hypothesis was rejected at the .01 level of significance. The sequential test had lower variance for high ability in- dividuals as was predicted from the rationale. The fifth null hypothesis was that there should be equal means of ability level scores for the individuals in the top 8.4 per cent of the score distributions taking the sequential and cumulative tests both with a normal distribution of ability input. Results are shown in Table 3. The null hypothesis was rejected at the .01 level of significance. The sequential test had a higher mean ability level for the 134 TABLE 1 ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES FOR CATEGORY 8 INDIVIDUALS WHEN NORMAL DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TEST MODELS Significance Parameter Sequential Test Cumulative Test Between Tests Mean 50.00 50.00 n.s. Variance 16.37 21.22 n.s. TABLE 2 ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES FOR CATEGORY 14 AND 15 INDIVIDUALS WHEN NORMAL DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TEST MODELS Significance Parameter Sequential Test Cumulative Test Between Tests Mean 63.40 62.96 n. s. Variance 3.87 6.77 p<:.01 TABLE 3 ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES FOR THE TOP 8.4 PER CENT OF THE SCORE DISTRIBUTION WHEN NORMAL DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TESTS Significance Parameter Sequential Test Cumulative Test Between Tests Mean 13.66 g 12.92 p < .01 Variance 2.35 '3.47 p‘<-O5 135 top 8.4 per cent of the score distribution as had been pre- dicted. Null hypothesis six was that there should be equal variances of ability level scores for the individuals in the top 8.4 per cent of score distributions taking sequential and cumulative tests, both with a normal distribution of ability input. The results are shown in Table 3. The null hypothe— sis was rejected at the .05 level of significance. The sequential test had smaller variance of ability level scores for the top 8.4 per cent of the score distribution as had been predicted. To examine the hypothesis that the six-item cumulative test model would have smaller differences in mean ability levels between the middle and adjacent scores than between the extreme and adjacent scores, the differences in mean ability level for adjacent scores were computed. These dif- ferences are reported in Table 5, column 3. As was hypothe- sized, the smaller differences in ability level were between the middle score 4, and the adjacent score 5. However, it should be noted that the differences between ability level scores for adjacent scores for the sequential test model (shown in Table 6) were not equal interval and there is no pattern to the differences shown, although in both cases the differences were greatest for the extreme scores. If one wishes to examine the mean ability level and number of individuals at each score, these values are shown 136 . TMflE4 DIFFERENCES BETWEEN NORMALIZED “T“ SCORES FOR ADJACENT TOP ABILITY LEVELS FOR NORMAL AND U-SHAPED INPUT Between Ability Ideal Normal Input U-Shaped Input Levels Difference Cumulative Sequential Cumulative Sequential 15-14 4.5 2.3 2.5 1.2 2.6 14—13 2.5 1.1 2.1 1.1 1.7 13-12 2.5 1.9 2.1 1.5 1.8 12—11 2.5 2.0 2.3 1.7 1.7 11-10 2.4 2.4 2.3 1.7 1.4 10— 9 2.5 2.3 2.3 1.6 1.4 9- 8 2.5 2.5 2.3 1.6 1.2 TMEE5 DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR ADJACENT TOP SCORES FOR CUMULATIVE TEST MODEL FOR NORMAL AND U-SHAPED INPUT Input Between Scores* Ideal Difference Normal U—Shaped 7—6 2.33 2 1 1.9 6-5 2.33 l 5 2.0 5-4 2.33 1 3 2.0 *Scores range from 1-7. 137 TABLE 6 DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR ADJACENT TOP SCORES FOR SEQUENTIAL TEST MODEL FOR NORMAL AND U-SHAPED INPUT Between Input Between Input Scores* Normal U-Shaped Scores* Normal U-Shaped 64-63 1.4 .8 48-47 .3 .1 63—62 .0 .0 47-46 .3 .3 62-61 .2 .1 46~45 .2 .5 61—6' .1 .2 45-44 .3 .2 60-59 .3 .2 44—43 .2 .0 59-58 .1 .2 43-42 .0 .4 58-57 .6 .6 42-41 .0 .1 57-56 .3 .1 41-40 .0 —.1 56-55 .0 .2 40-39 .5 .5 55-54 .0 .1 39-38 —.4 -.1 54-53 .5 .2 38-37 .4 .1 53-52 .0 .0 37-36 .0 .0 52-51 .0 .3 36-35 .0 .2 51.50 .0 .1 35—34 .0 .2 50-49 .2 -.1 34-33 .3 .3 49-48 .0 .4 *gdeaégdifference if all had been equal intervals would e . . 138 in Tables 25 and 26 of the Appendix. The mean normalized "T" score for each ability level is reported in Table 24 of the Appendix. Results from the U-Shaped Distribution The first null hypothesis was that there should be equal means of the comparable normalized scores for category 13 individuals taking the sequential and cumulative tests both with an input of a U-shaped distribution of ability. Results are shown in Table 7. As can be seen, the null hypothesis must be accepted. The sequential test did have the higher mean value as expected, but not Significantly so if 1000 individuals are assumed to have taken the test. Rationale would tend to be supported though the effect is small. (See comments on size of N under HResults from Normal DistributionAU The second null hypothesis was that there should be equal variances of the comparable normalized scores for category 13 individuals taking the sequential and cumulative tests each with an input of a U—shaped distribution of ability. From Table 7 one can determine that the null hypothe- sis must be accepted if only 1000 individuals are assumed to have taken the test. The variance of the sequential test was less, however, than the cumulative test as anticipated though the effect was small. The third null hypothesis was that there should be equal means of the comparable normalized scores for category 139 TABLE 7 ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES FOR CATEGORY 13 INDIVIDUALS WHEN A U-SHAPED DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TEST MODELS Significance Parameter Sequential Test Cumulative Test Between Tests Mean 58.44 58.03 n.s. Variance 13.99 14.00 n.s. TABLE 8 ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES FOR CATEGORY 15 INDIVIDUALS WHEN A U-SHAPED DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TEST MODELS Significance Parameter Sequential Test Cumulative Test Between Tests Mean 60.73 60.44 n.s. Variance 1.96 3.62 p<..01 TABLE 9 ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES FOR THE TOP 13.5 PER CENT OF THE SCORE DISTRIBUTION WHEN A U-SHAPED DISTRIBUTION OF ABILITY IS INPUT INTO SEQUENTIAL AND CUMULATIVE TESTS SignIfICance Parameter Sequential Test Cumulative Test Between Tests lean 14.43 13.87 p'<.01 Variance .77 1.86 ;><.Ol 140 15 individuals taking the sequential and cumulative tests both with an input of a U-shaped distribution of ability. The results are shown in Table 8. The null hypothesis must be accepted, although the results were in the direction indicated by the research hypothesis. The cumulative had a lower value for the mean. Again significance depends upon number of individuals assumed to have taken the test. The fourth null hypothesis was that there should be equal variances of the comparable normalized scores for category 15 individuals taking the sequential and cumula- tive tests both with an input of a U-shaped distribution of ability. As shown in Table 8, the null hypothesis was re— jected at the .01 level of significance. The sequential test had less variance of scores for the highest ability level individuals than did the cumulative test. 5 The fifth null hypothesis was that there should be equal means of ability level scores for the individuals in the top 13.5 per cent of the score distribution taking the sequential and cumulative tests both with an input of a U-Shaped distri- bution of ability. The results are shown in Table 9. The sequential test had a significantly higher mean ability level for the top 13.5 per cent of the score distribution than did the cumulative. This was in the direction hypothesized. The sixth hypothesis was that there should be equal variances of ability level scores for the individuals in the top 13.5 per cent of the score distribution taking the sequen- tial and cumulative tests both with an input of a U—shaped 141 distribution of ability. The results in Table 9 indicate that the sequential test had at the .01 level of signifi- cance, a smaller variance of ability level scores for the top 13.5 per cent of the score distribution than did the cumulative test. This was in the direction hypothesized. The difference in mean ability level between adjacent top scores for the cumulative and sequential test models are Shown in Tables 5 and 6, respectively. The scores on the sequential test did not yield equal intervals on the ability level scale as had been hypothesized. The cumulative scores are a good approximation of equal intervals on the ability level scale. To examine the hypothesis that the sequential test model should have approximately equal distance between test score means for each of the ability categories, while the six-item cumulative would have larger differences in mean test scores for the middle ability levels than for extreme ability levels, the differences between adjacent scores were computed. These differences are reported in Table 4. The cumulative test did have smaller score differences between the extreme ability levels than any other point in ability distribution. However, the sequential test did not have an equal interval scale, but in general decreased in size of difference between mean scores of adjacent ability levels from extreme ability category to middle ability category. It should be noted that neither test represented the ability 142 levels with any real accuracy. The top ability level shown had an ideal ”T” score of 69 instead of the 61.8 assigned by the sequential or the 60.4 assigned by the cumulative test. (See Table 24.) III. ITEM PRECISION AND DIFFICULTY FOR THE SEQUENTIAL TEST Five levels of precision and the appropriate levels of difficulty for each were used in the construction of five sequential test models. For these tests the variances of scores for the extreme and middle ability levels and the variances of ability level for extreme and middle scores were examined. Variance of Scores The first null hypothesis was that there would be equal variances of scores for category 8 ability level individuals for all five tests of different precision level. Data and results are shown in Table 10. The null hypothesis was re- jected at the .001 level of significance. As was hypothe— sized, the more precise tests had smaller variances. The second null hypothesis was that there would be equal variances of scores for a combination of ability level categories 14 and 15 for all five tests of different preci- sion level. From data in Table 10 it can be seen that the null hypothesis was rejected at the .001 level of signifi- cance;the more precise tests had smaller variances of scores. 143 TABLE 10 ANALYSIS OF THE VARIANCE 0F SCORES FOR INDIVIDUALS AT SPECIFIED ABILITY LEVELS FOR FIVE TESTS OF DIFFERENT PRECISION —— t ‘ Precision of Test Ability “ Significance Category .45 .60 .71 .75 .79 of Difference 8 260.03 198.70 147.52 127.33 111.65 p<:.001 14 and 15 94.74 40.90 19.69 15.50 11.86 p<’.OOl TABLE 11 ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR FIVE TESTS OF DIFFERENT PRECISION Score Level ___> Precision Of Test Significance (Per Cent) ,45 .60 .71 .75 .79 of Difference Top 8.4 5.88 3.71 2.45 2.10 1.77 ‘p< .001 p< .001 OK Middle 10 8.22 5.37 3.62 3.06 2. \ 144 As was hypothesized, the precision of the item was an impor- tant variable in precision of scores. Variance of Ability Levels The third null hypothesis was that there would be equal variances of ability level scores for the individuals ranked in the top 8.4 per cent of the score distribution by each of the five tests of different precision level. Data and results are shown in Table 11. The null hypothesis was rejected at the .001 level of significance. As can be seen, the preci- sion of item was important in determining the precision of the scores as hypothesized. The individuals assigned to the top 8.4 per cent of the score distribution were not as variable in ability level when assigned by a test with items having an rbis of .79 as when assigned by a test with items having an rbis of .45. The fourth null hypothesis was that there would be equal variances of ability level scores for the individuals ranked in the middle 10 per cent of the score distribution by each of the five tests 0f different precision level. As can be seen in Table 11, the null hypothesis was rejected at the .001 level of significance. The results were in the direction hypothesized--the more precise tests had smaller variance of ability levels. However, it should be noted that for the middle 10 per cent of the score distribution, the variances were 8.22 and 2.56 for the .45 and .79 tests, respectively. The one variance is 3.21 times larger than the 145 other. For the top 8.4 per cent of the score distribution the variances were 5.88 and 1.77 for the .45 and .79 tests, respectively. The larger variance is 3.32 times the other. Greater differences at the top than at the middle of the score distribution were contrary to what had been expected. Table 12 gives the means and variances of rank scores assigned to each ability level by the five tests of different precision. The means for Category 8 individuals were always the same. However, the mean rank scores assigned to category 1 individuals were lower as the precision of the item in— creased. This was especiallylmoticeab16rat the lower preci— sion levels. The variances of the test scores for each ability level decreased with the precision of item as was hypothesized. Also, it should be noted that the variances of extreme scores were much lower than the variances of the middle value scores. The discrimination indices are reported in Table 13. (Only one-half of the score distribution is tabled because the two halves are symmetrical about the mean.) The higher the value, the better the discrimination. The test con— sisting of the most precise items had the highest discrimin- ation index. The test was more discriminating for the extremes in ability than it was for the other ability values. However, the other values of the discrimination index were remarkably close to each other for all ability levels other than the extremes. This was what had been hoped for with the sequen- tial test. TABLE 12 THE MEANS AND VARIANCES OF RANK SCORES ASSIGNED TO EACH ABILITY LEVEL BY FIVE TESTS OF DIFFERENT PRECISION Variances of Rank Scores for Different Precision Levels 79 J5 .71 A5 \0 b. H O\\OKO LON-KO LflFr—iONCDCh HMJ‘OCD N- 896 68.7 495 3L6 59 0’) .1 O H m43 H H H (IDLflxOCDCfiOi‘FiTOCDCDKOLflCD PMCIDQDCU NMMMNNQQM¥ CUMLDQDOCUCUCUOCIDLDMCU r-ir—ir—{t-i 1 OOHCUQ'MQLDOMHTCUr—JOO r-JO\O\CU[\:fr—i[\—r—1—:I'[\CUO\O\r—l Mean of Rank Scores for Different Precision Levels .60 .71 .75 .79 .45 Ability Level CID—if OMLONKOLOKDMLONOKOCU CULflCDr—{LOOKOCUCUQ‘QMN-QCU r—It—iCUCUMM-If—d' 10me NOJQDOH-d'mfiChOCUKOOCD M‘OCIDH\Or—JKOCUCOMO\CY)\OO\r—i r—lr—ICUCUMNUQ'Q' LOLOLOKD \OLflCULflLfl-Zf'fIDLDCUKOLflLQCD Lflzi' (Y)\OO\CU\Or—i ocuoomoocumooH HHcvcummxzmmhrvo r—lKOCULfl :quqosmmocomoo GXULRGDHLHQDN 0C3dikxnolo H Hraolmcuofiruidwi:rmmn ram “PTUYOPWKDOWDrdanQ'W\ HeariHra 147 TABLE 13 THE DISCRIMINATION INDICES BETWEEN ADJACENT ABILITY LEVELS FOR THE INPUT OF A NORMAL DISTRIBUTION OF ABILITY INTO TESTS OF DIFFERENT PRECISION Precision Level of Test Between Ability Levels .45 .60 .71 75 .79 1 and 2 .42 .58 .89 .70 .81 2 and 3 .22 .30 .42 .43 .50 3 and 4 .22 .33 .43 .47 .51 4 and 5 .24 .34 .43 .50 .54 5 and 6 .23 .34 .46 .52 .57 8 and 7 .24 .38 .46 .52 .59 7 and 8 .24 .35 .48 .52 .58 148 IV. ERRORS IN THE SEQUENTIAL TEST PARAMETER ESTIMATES Errors in estimating the difficulty level of items and errors hlestimating the precision of items were investigated. Four different tests with errors in estimates of difficulty were constructed, and two tests with errors in precision were built. All tests used the "least squares" difficulties as the base for comparison. The results of investigating these two types of errors will be discussed separately. Errors in Estimating Difficulty Of the four tests with errors in estimates of item dif- ficulty, two had the error at the second item encountered and two had the error at the fifth item encountered. Second item error.--The first null hypothesis was that the number of people in each set of score categories would be independent of whether the people were classified by an ”error free” test or one which had items too far from the mean at the second stage. The distributions are reported in Table 27 of the Appendix. The number of individuals at 12 selected categories, the expected values from an indepen- denaaassumption, and the chi-square value are reported in Table 14. The null hypothesis had to be accepted. There were more people at the middle values as hypothesized, but the differences were not significant if 1000 people were 149 assumed to have taken the test. It can be concluded that the effects of second-item errors are small. The second null hypothesis was that the number of people in each set of score categories would be independent of whether the people were classified by an "error free" test, or by a test which had the second item encountered too near the mean value. The distribution is reported in Table 27 in the Appendix. The number of individuals at 12 selected categories, the expected values from an indepen- dence assumption, and the chi—square value are reported in Table 15. The null hypothesis had to be accepted However, there were more people at the extreme categories in the modified test than in the "error free” test, as was hypothe- sized. The differences were not significant due to the assumption of 1000 individuals. Fifth item error.-—The first null hypothesis was that of equal variances of the ability level scores for the individuals ranked in the top 8.4 per cent of the score dis- tribution by the ”error free" difficulty test and the tests which had the fifth item too far and too near the mean value. The variances of the ability level scores for the top 8.4 per cent in each of the tests are reported in Table 16. The differences in variances were not significantly different from each other. However the ”error free” test did not ‘) have the smallest variance as was hypothesized. The test with 150 TABLE 14 DISTRIBUTION OF INDIVIDUALS BY TWO TESTS--0NE TEST WITH SECOND ITEM DIFFICULTIES FARTHER FROM 50 PER CENT LEVEL THAN THE "ERROR FREE" TEST* Rank Scores Test 64 58-63 54-57 46753 40‘45 33-39 2nd Item (62.44) (100.20) (57.91) (96.68) (86.61) (94.16) extreme 59 97 50 106 8c 100 "Error (61.56) ( 98.80) (57.09) (as 32) (85.39) (92.84) Free" 65 102 65 86 86 87 x2 = 10.624 d.f. = 11 TABLE 15 DISTRIBUTION OF INDIVIDUALS BY TWO TESTS-—ONE TEST WITH SECOND ITEM DIFFICULTIES NEARER TO 50 PER CENT LEVEL THAN THE ”ERROR FREE" TEST* —_ Rank Scores Test 64 58-63 54-57 46-53 40-45 33-39 2nd Item (67.24) (104.14) (73.81) (82.40) (88.47) (85.94) near 50 68 104 81 77 89 83 "Error (65.76) (101.86) (72.19) (80.60) (68.53) (84.06) Free" 65 102 65 86 86 87 x2 = 10.624 d.f. = 11 *NOTE: The rank scores are broken to make approximately equal intervals on the ability scale. The scores 1-32 are not reported in the table but are symmetrical about 32.5. All values were used in the calculations of chi-square. Expected cell frequencies are given in parentheses. 151 items nearer to 50 per cent level of difficulty had least variance of ability represented in the top 8.4 per cent of the score distribution. Rationale was not supported here. The second null hypothesis was that there would be equal variances of test scores for individuals in ability category 15 on the ”error free" test and the tests which had the fifth item too far and too near the mean value. The results in Table 17 show that the null hypothesis must be accepted. The largest variance was for the test with items nearer the 50 per cent level of difficulty as was hypothe— sized even though the results were not significant due to the assumptions of only 1000 individuals. The third null hypothesis was that there would be equal variances of ability level scores for the individuals ranked in the middle 10 per cent of the score distribution by the three tests. The results of these tests are shown in Table 16. The test with the items at the fifth stage near the 50 per cent level of difficulty had lower variance than other tests, but not significantly so. It was hypothesized from Lawley's work on the cumulative that the test with the difficulties away from the mean would have had the smallest variance. Rationale was not supported. The fourth null hypothesis was that there would be equal variances of test scores for individuals in ability category 8 on all three tests. Again the null hypothesis had to be accepted. The lowest variance was for the test with 152 TABLE 16 ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR ONE "ERROR FREE” TEST AND TWO “ERROR IN DIFFICULTIES 0F FIFTH ITEMS" TESTS 5th Items 5th Items Significance "Error Free" Nearer Away from . Of Score Level Test 50% 50% Differences Top 8.4 % 2.30 2.25 2.33 n.s. Middle 10% 3.08 3.06 3.30 n.s. TABLE 17 ANALYSIS OF THE VARIANCE OF RANK SCORES FOR INDIVIDUALS AT SPECIFIED ABILITY LEVELS FOR ONE “ERROR FREE" TEST AND TWO "ERROR IN DIFFICULTIES 0F FIFTH ITEMS" TESTS 5th Items 5th Items Significance "Error Free” Nearer Away from of Score Level Test 50% 50% Differences 15 148.08 173.34 129.01 n.s. 8 5.57 4.75 6.67 n.s. 153 the fifth item nearer the 50 per cent level of difficulty. It had been hypothesized that the ”least squares" would have the smallest variance. Errors in Estimating Precision Two tests were built to examine the error of estimating F4 i precision: one with rbis equal to .71 items at the second 1 stage of the ”least squares” (rbis = .75) test, and the other with rbis equal to .71 items at the fifth stage. These are v - discussed separately. Errors at the second stage.--The first null hypothesis was that there would be equal variances of test scores for individuals in ability category 15 for the ”error free” test and the test where the precision was lowered at the second stage. The variances of the "error free” and ”error” tests were 5.57 and 5.94, respectively, for ability category 15. The F ratio was 1.06 and thus the null hypothesis had to be accepted. The variance increased with error as was expected, but not to a significant degree if only 1000 individuals were assumed to have taken the test. The second null hypothesis was that there would be equal variances of test scores for individuals in ability category 8 for the "error free" test and the test where precision was lowered at second stage test. The variance of the "error free" test was 148.08 and for the "error” test was 149.09. The F ratio was 1.01 and again the variance increased as was 154 hypothesized, but not significantly so if an N of 1000 was assumed. It should be noted that the F ratio for the variances at ability category 15 was greater than the F ratio for variances at level 8—-i.e., errors at the second stage seemed to have a greater effect on extreme scores as was anticipated. The third null hypothesis was that there would be equal variances of ability level scores for individuals ranked in the top 8.4 per cent of the score distribution of each of these two tests. The variance of ability level scores for top 8.4 per cent on the "error free" test was 2.30 and the variances of ability level scores from the "error” test was 2.37. The null hypothesis had to be ac- cepted, but the variance did increase with error in item precision. Again significance depended upon the value assumed for N. Errors at the fifth stag_,--The first null hypothesis was that there would be equal variances of test scores for individuals in ability category 15 for the "error free preci- sion" test and the test with ”error” in precision at the fifth stage. The "error free” test had a variance of test scores of 5.57 and the ”error” test had a variance of 5.71 for ability category 15. The null hypothesis had to be accepted, but the variance was larger for the test with errors as had been hypothesized. (Changes in assumption of N would change significance test.) 155 The second null hypothesis was that there would be equal variances of ability level scores for individuals ranked in the top 8.4 per cent of the score distribution by the two tests. The “error free" test had a variance of 2.30 and the ”error'' test had a variance of 2.40. Again the null hypothesis was accepted; but the variance of the test with the error was larger as hypothesized. It had been assumed that at the middle ability level the effects of errors in precision would be slight The variance for the "error free" test was 148.08 and for the "error” test was 150.69. The difference between the variances of the two tests is slight and the F ratio for the middle ability variances is the same as the F ratio of variances for category 15 individuals. This was as expected. It had also been assumed that errors in precision at the second stage would be more serious than those at the fifth stage. In the variances of scores for high ability individuals, the error in precision at the second stage increased variance more than error in precision at fifth stage. (The variances were 5.57, 5.94, and 5.71 for "error free,” error at second and error at fifth stage precision ) tests, respectively.) However, the variance of scores for middle ability level individuals was higher for the test with error in the fifth stage than the test with error in the second stage. (The variances were 148.08, 149.09, and 150.69 for ”error free," error at second, and error at fifth 156 stage precision tests, respectively.) The error at the fifth stage test also had the highest variance of ability level scores for individuals in top 8.4 per cent of the score distribution. (Variances were 2.30, 2.37, and 2.40 for "error free," error at second, and error at fifth stage precision tests, respectively.) The hypotheses that errors in second stage would be more serious than errors in fifth stage was not confirmed. General Comparisons The three areas of general comparisons were effects of difficulty, effects of precision, and effect of pattern of items. There were no hypotheses made about these general comparisons. The information is presented to suggest new hypotheses and to aid in forming tentative conclusions. Effects_of difficulty.--In addition to the hypothesis testing material already reported, examination of Tables 18, 19, and 20 yields information on difficulty. Only the dif- ficulty of the items has changed from column to column within any one of the three tables. It should be noted that the distribution of difficulties to form a normal output of scores yielded the highest mean ability level for the top score, no matter what type of distribution was input. Also, the distribution of difficulties to produce a U-shaped output of scores yielded the greatest number of individuals in the extreme score irrespective of the type of distribution that 157 .osoom m an mHm3UH>HUQH mo H0>0H szHHnm :moE 0:» mo xcmm* 0.0 0 0.0 HH 0.0 NH mm 0.0 NH N.0 :H H.0H HH 0: H.0 0 H.0 HH m.m SH :m 0.0 NH 0.0 0H 3.0H HN 00 H.0 w N.0 NH 3.0 SH mm N.0 NH 0.0H MH 0.0H SH Hm H.0 0 m.w HH 3.0 MH 0m N.0 NH H.0H :H 0.0H 0H Nm N.0 0 :.0 HH 0.0 0 mm 0.0 NH N.0H :H 0.0H 0H mm m.m 0 0.0 0 N.0 0H mm 0.0H NH m.0H MH N.HH NH :0 m.0 0 0.0 HH 0.0 0H 0m 0.0H HH :.0H NH :.HH MH 00 m.0 0 0.0 0H 0.0 NH 0: H.0H HH 0.0H NH 0.HH 0 00 0.0 m 0.0 0 0.0 0H H: m.0H 0H 0.0H 0H N.HH MH mm 0.0 s 0.0 m N.0 NH N: o.HH 0H N.HH 0H H.NH 0 00 0.0 0 H.0 s 3.0 :H m: m.HH :N 0.HH :N m.NH HN 0m N.0 0H N.0 HH 0.0 0H 3: :.HH 0N 0.HH 0N 0.NH 0N 00 m.m HH 4.0 NH s.0 0H m: 0.HH sm H.NH sN H.MH mm H0 :.0 HH 0.0 MH 0.0 MH 0: 0.HH SN m.NH 0N :.MH 0N N0 :.0 HH 0.0 NH 0.0H 0 s: 0.HH 0N 0.NH :N S.MH 0H 00 0.0 0H 0.0 HH H.0H HN w: 0.MH HOH 0.MH HS 0.:H 0N :0 saw: 2 can: 2 can: 2 onoom sec: 2 cat: 2 new: 2 msoom vommcmpa an3wcmpoom HmEnoz *xcmm 000630-: an3wcmpoom HmEnoz *xcmm 030030 No shoppmm copooaxm 030030 No stoppmm Umpoomsm BszH MBHHHm< mo ZOHBDmHmBmHQ H mmoom 008 mom mmmoom Hm>mq MBHHHm< zH6:H 00 H0>0H :pHHHnm :00E 0:0 0o x:mm* 158 6.0 0 6.0 0 6.0 NH 00 0.0 6H H.6H NH 6.6H 6H 0: H.0 s N.0 0 0.0 0H :0 6.6H 6H 0.6H 0H 6.HH 0H 60 N.0 0 0.0 0 4.0 0H 00 H.6H 6H H.6H NH H.HH 0H H0 N.0 0 0.0 0 0.0 6H 00 N.6H 6H 0.6H NH m.HH 0H N0 0.0 s 0.0 0 s.0 a s0 N.6H 6H 0.6H NH 0.HH 0H 00 H.0 0 0.0 s 0.0 NH 00 H.6H 6H 0.6H NH 0.HH 0H an 0.0 0 0.0 0 H.0 0 0m H.6H 6H 0.6H HH 0.HH NH mm 3.0 0 s.0 s N.0 0 6s 0.6H 6H 6.HH HH H.NH 6H 60 0.0 0 6.0 s N.0 NH He N.6H 0 0.HH 6H H.NH 0H s0 0.0 m N.0 0 0.0 0 Ne 0.HH 0H N.HH 6H 0.NH HH 00 0.0 s 0.0 0 0.0 NH ms 0.HH 0N H.NH 6N N.NH 0N 00 0.0 0 0.0 0 0.0 0 a: 0.HH 0N H.NH 60 H.0H N0 60 0.0 0 s.0 6H H.6H 0H ms N.NH 0N 0.NH N0 H.0H H0 H0 s.0 0 0.0 6H N.6H HH 0: 0.NH Hm 0.NH H0 0.0H sN N0 0.0 0 0.0 6H H.6H m as H.NH 60 6.0H Hm 0.0H 0N 00 0.0 0 H.6H 0 0.6H 0H 0H 0.0H 00H 6.HH 66H 0.NH 6s :6 :00: z :00: z :00: Z 0:oOm :00: z :00: z :00: z 0noom 0000:0u0 :0H30:0060m Hsssoz *x:mm 0000:010 :0H30:mpo0m Hmsnoz *x:mm 030030 06 :000000 000600xm 030030 06 :000000 000600xm EbmzH HBHHHm< mo ZOHBDmHmBmHQ m mmoom 009 mom mmmoom Hm>mH HBHHHm< z0©:H 00 00>00 :000000 :00: 0:0 06 ::0m* 0.0 z 0.0 m H.0 0 mm :.0H 0 0.00 0 0.00 0 0: 6.0 3 0.0 0 3.0 0 30 0.6H 0 0.6H 0 0.HH 0H 60 0.0 3 0.0 m 0.0 0 mm 0.0H 0 0.00 0 o.HH 0H 00 0.0 z 0.0 m 0.0 0 0m 0.6H 0 6.HH 0 0.HH 0H N0 3.0 0 0.0 m 0.0 m 00 0.6H 0 N.HH 0 6.NH 0H 00 3.0 d 0.0 0 0.0 0 0m 0.0H 0 0.00 0 3.0H 0H :0 0.0 a 0.0 3 0.0 0 00 0.6H 0 :.HH 0 0.NH 0H 00 0.0 z 0.0 m 0.0 0 0: 0.0H 0 0.HH 0 0.0H 00 00 0.0 a 3.0 z 0.0 0 0: 0.0H 0 0.HH 0 0.00 :0 00 0.0 0 3.0 z 0.0 0 m: 0.NH 00 0.0H 00 0.0H 00 00 3.0 0 0.0 a 0.00 0 m: 0.00 mm 0.00 00 0.00 00 00 0.0 m 0.0 m 0.00 0 :0 0.00 pm 0.0H mm 0.0H 0: 00 0.0 0 0.0H 0 0.0H 0H m: 0.0H om 0.00 mm 0.00 00 00 0.0H 0 0.0H 0 0.0H 0 0: 0.NH mm 0.00 mm 0.30 00 00 0.0H 0 0.00 0 0.0H : 0: 0.00 mm 0.0H 00 0.0H mm 00 :.6H 0 0.6H 0 6.HH :H 0: N.:H m6N n.sH 00H 0.0H H0 :0 :00: z :00: z :00: 2 00600 :00: z :00: z :00: z 0:660 . 00:00: . *vfiHmm 0000:0ID :0030:0060m 005:6: 0000:0u0 00030:00o0m 00:06: 030030 00 :000000 000600xm 030030 06 :000000 000000xm BbmzH NBHqu¢ mo ZOHBDmHmEmHm Gmm mmoom 008 mom mmmoom dm>mq :BHQHmd z