fl...» I“... _...‘u ‘ . cu‘sw m - m! Iii n‘. ~ 1A.. Vu‘ \ 1'. m. ‘K I A “ : b". ' "'hbrnéyr ; ::.- u: u a} II! a. ‘ : mung! l I! This in to certify that the them entitled An Item-Analysis Technique Based Upon Adjacent-Group Differences presented by Robert Arthur Jackson has been accepted towards fulfillment of the requirements for Ede De degree in Eduution kwfi—«AQ H. W. Sundwall Major professor Date December 2: 1952 0—169 ’v‘fF'mr‘h‘ ‘ 41-17 A;-I" 'AN ITEM ANALYSIS TECHNIQUE.BASED UPON ADJACENT GROUP DIFFERENCES By Robert. Arthur Jackson A 135318 Submitted to the School of Creduete Studies of Michigan State College of Agriculture and Applied Science in partial fulfillment of the requiremnte for the degree of DOCTOR OF EDUCATION Dependant. of Edueetlon 1952 IHE81$ H: ACK;JO‘J'.’LEDGI.1£E~J ‘3 To Dr. H. W. Sundwall, the author is very grateful for the patience, encouragement, and direction during the period of dissertation-preparation. To Dr. W. D. Baten is due gratitude for criticism of the statistical portion of the manuscript. To Dr. h. Muntyan, the author is deeply indebted for his critical examination of the manuscript. To Dr. V. H. Noll, the author wishes to extend his sincere thanks for his guidance and supervision in the period he directed the Guidance Committee. ii 300015 AN ITEM ANALYSIS TECHNIQUE BASED UPON ADJACENT GROUP DIFFERENCES By Robert Arthur Jackson AN ABSTRACT Submitted to the School of Graduate Studies of Michigan State College of Agriculture and Applied Science in partial fulfillment of the requirements for the degree of DOCTOR OF EDUCATION Department of Education 1952 Approvedfi/édwec 2 Robert Arthur Jackson The major purpose of this study was to present and evaluate a new item-analysis technique applicable in situations where the primary in- terest is in the discrimination between the members of two or more groups, rather than discriminating between the members within groups. This type of problem occurs frequently in assigning letter grades and in selection work where individuals are to be divided into two groups (that is, those who may be expected to succeed and those who may be ex- pected to fail in a particular situation). A procedure for computing the adjacent-group itenrvalidity indices was presented. This adjacent- group technique resulted in a maximized ratio of between-groups variance to total variance. It was assumed in this investigation that the test-score distribu- tion should be the one best fitting the need of the particular situation where the test is to be used. In the case of two groups, the most discriminating test was found to be one that yields a score distribution with a point of partition at the abscissa of the minimal ordinate be- tween the two group modes. A theoretical examination of the score dis— tribution for two groups showed that a non—overlapping bimodal distri- bution may be obtained by selecting a sufficient number of appropriate items. In the theoretical comparison of the bimodal test-score distri- bution with a normal test-score distribution, it was demonstrated that the bimodal distribution resulted in fewer chance errors than the normal distribution. In the case of more than two groups, the distribution should have points of partition at the abscissa of the minimal ordinate between any two adjacent-group modes. 3 Robert Arthur Jackson The empirical findings of this study indicated that the selected test items tended to be stable under cross-validation. The empirical studies on the error of measurement of the multimodal test-score dis- tribution also showed this error to be minimal at the points of parti- tion separating two adjacent groups; the tendency seemed to be that the error of measurement approaches zero at the points of partition. These findings were interpreted in terms of the small number of cases involved in the data. Since test items discriminating perfectly between two adjacent groups are difficult to obtain, it is quite apparent that the adjacent- group technique for the selection of items is used to greater advantage in situations where a large source of test items is available. However, the adjacent-group technique can also be applied in Situations where intra—group comparisons are to be made and the source of items is more limited. This technique was found to be as satisfactory as Horst's more laborious technique of maximizing function in the selection of the most valid items in terms of an external criterion. It was fOund to be superior to the technique of Flanagan. TABLE OF CONTENTS CHAPTER I. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . Item analysis with a continuous score. . . . . . . . Item analysis with groups. . . . . . . . . . . . . . Item analysis to maximize validity . . . . . . . . . Use of item analysis data. . . . . . . . . . . . . . Purpose of this study. . . . . . . . . . . . . . . . II. THE THEORETICAL SOLUTION OF THE PROBLEM . . . . . . . . Procedure for item selection . . . . . . . . . . . . Theoretical analysis of the score distribution in the case of two groups . . . . . . . . . . . . . . . Comparison of the bimodal distribution and the normal distribution. . . . . . . . . . . . . . . . . III. DATA RELATED TO THE PROBLEM . . . . . . . . . . . . . . Stability of the selected items under cross—validation Comparison of the adjacent-group technique and two other techniques . . . . . . . . . . . . . . . . . . IV. SUMLARY AND CONCLUSIONS . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . iii PAGE H \OE’W 11 12 1h 15 l? 25 29 29 36 hh h? TABLE II. III. IV. VI. VII. VIII. IX. XI. LIST OF TABLES Means, Ranges, and Frequencies for the Two Groups on the S9-Item‘Test . . . . . . . . . .’. . . . . . . . . Frequency Distribution of Each Sample on the lZ—Item Test. . . . . . . . . . . . . . . . . . . . . . . Original Test Scores for the Individuals Incorrectly Placed on the lZ-Item Test. . . . . . . . . . . . . . . . Original Test Scores for the Individuals Incorrectly Placed on the 28-Item Test. . . . . . . . . . . . . . . . Means, Variances, and Frequencies for the Five Groups on Three Tests. . . . . . . . . . . . . . . . . . . . . . Reliability Estimates for the Four Tests. . . . . . . . . Test of the Significance of the Difference Between the Validity Coefficient Estimates of the Two Bl-ItemTestS.....................o Test of the Significance of the Difference Between the Validity Coefficient Estimates of the Two 59-Item Tests . Reliability Estimates for the Two 30-Item Tests . . . . . Validity Estimates for the Two BO-Item Tests. . . . . . . Test of the Significance of the Difference Between the Validity Coefficient of Form I and the Sub-test Selected from Form I by the Adjacent-Group Method. . . . . . . . . iv PAGE 31 31 32 32 3h 39 39 39 ha L2 143 FIGURE 1. LIST OF FIGURES PAGE The Number of Items Required to Make a Discrimination Between Two Groups as a Function of the Test Reliability and the Number of Alternatives for Each Item. . . . . . . 25 The Frequency Distribution of the Total Score and the magnitude of the Means of the Means of the Squared Differences Throughout the Total Score Range. . . . . . . 37 CHAPTER I INTRODUCTION In the construction of a test for measurement purposes, the test writer is confronted with a two-fbld.problem. He must adjust the length of the test to stay within the amount of time available and to avoid fa- tiguing the individuals taking the test. At the same time, he must make certain that the items in the test constitute an adequate sampling of all possible items related to the trait being measured. A basic purpose of a test is to place individuals along a scale for measurement of a given trait in accordance with real differences. This means that a test used in the measurement of a trait must possess discriminative power; and since tests are made up of individual items, each item should contribute to this discrimina- tion. The original construction of items, which are to represent the the- oretical pool of possible items, depends upon the skill and judgment of the test writer. Since the personal judgment of an individual is subject to error, many statistical processes, called item analysis techniques, have been utilized to evaluate each of the test items. All item analysis tech- niques are subject to certain limitations: (1) no item analysis technique can by itself turn poor items into good items or operate satisfactorily without a reliable criterion; (2) the results obtained by item analysis techniques must be under- stood before they may be used efficiently; (3) item analysis results from one eXperimental group may not be exactly parallel for another group; and (h) the item analysis data should supplement, not supplant, sub- jective opinion. 2 The test items are classified as satisfactory or unsatisfactory by examining two statistical characteristics fer each item; (1) the diffi- multy of each item (percent of the students failing to succeed on the item); and (2) an index of discrimination (degree to which the item is effective in differentiating between those who are high and those who are low in respect to the trait being measured). A satisfactory item would not be failed or passed by all of the students; it would be passed by students who possess the trait to a high degree more often than students who possess the trait to a lOW'degree. .An item which was not satisfactory' would be passed by the lower individual more often than the higher one. Unsatisfactory test items occur when the item.and the general criterion are not measuring the same trait. Satisfactory test items Should function well; they should have a firm theoretical basis. It seems desirable to point out that a fundamental assumption, under- lying all item analysis techniques, is that the items differ from one to another in respect to difficulty and discrimination. Merrill stated: If the items are heterogeneous with respect to validity, one can say with some confidence that the most valid items in one sample will in general be the most valid in any other sample, and the use of good items for predictive purposes is therefore justified. In the event, however, that there is a strong probability of the items being homo- geneous, there is no justification for any selection.1 A great variety of procedures have been employed to determine which items should be selected for a test. These procedures yield statistical data to be used as a guide in assembling the final form of the test, and they do not take the place of ability in.item construction. Useful sur- lW} W} merrill, "Sampling Theory in Item Analysis," Psychometrika, 3 veys of indices of validity or consistency have been provided by Lentz, et. al.,1 Lindquist and Cookz, Zubin3, Long and Sandifordb, Guilfords, Swinefordé, and Davis7. When item selection techniques are applied, two major types of situ- ations are encountered. In the first type, we are relating the perfbrm- ance of the individuals on an item to their performance on some type of continuous measure. This continuous measure is usually the total score on the test, but it may be some external criterion. The second type is one in which the individuals' performance on the item is related to a di- chotomous grouping on the criterion variable. Item Analysis With a Continuous Score When the variable with which the item is being analyzed is continu- ously distributed, two statistical approaches are possible. In one, the 1T. F. Lentz, B. Hirshstein, and J. H. Finch, "Evaluation of Methods of Evaluating Test Items," Journal 23 Educational Psychology, XXIII, 3hh— 350, 1932. 2E. F. Lindquist ande.‘W. Cook, ”Experimental Procedures in Test Evaluation," JOurnal 2; Experimental Education, I, 163-185, 1933. 3J. Zubin, "The Method of Internal Consistency for Selecting Test Items," Journal 9_f_ Educational Psychology, XXI, 3145-356, 19314. 1‘J. A. Long and P. Sandiford, The Validation 9: Test Items. Toronto: Department of Educational Research, University of Toronto, Bulletin No. 3, 1935, pp. 126. 5J. P. Guilford, Psychometric Methods. New York: McGraw-Hill, ppe h28'h37, 19360 6F. Swineford, "Validity of Test Items," Journal 9f Educational Psyphology, XXIVV, 68-78, 1936. 7F. B. Davis, Chapter 9. Item Selection Techniques. E. F. Lind- quist, et. al., Educational measurement. washington: American Council on Education, pp. 266-328, 195I: h degree of relationship between success on the item and.the criterion score 1 is determined by using rbis’ rp,T\, and V All of these indices are mlb' dependent upon the proportion of the group answering the item correctly, the standard deviation of the criterion scores of the entire group, and the difference between the mean criterion score of the students answering the item correctly and the mean criterion score of the students answering the item incorrectly. The second approach is dependent upon the difference in the criterion scores of the individuals passing the item and those fail- ing. Two statistical techniques used to indicate whether there is a sig- nificant relationship between the performance of the students on the item and the criterion are the standardized difference between the means and the F-ratio. The simple difference between the mean criterion score of the indi- viduals answering the item right and those answering it wrong, or'the overlapping methods derived from the proportion of the individuals fail- ing the item whose criterion scores exceeded the median scores of those passing the item yield rough indications of difference.2 Item Analysis With Groups Some simplified item analysis procedures have been developed for use when the criterion scores are treated as a dichotomy. When the criterion variable is a natural dichotomy these techniques must be used; these pro- cedures may also be used when a continuous criterion is divided into a 1Long and Sandiford, pp. 213., pp. 2h-29. 21bid. S dichotomy for ease of computation. In treating a continuous variable as a dichotomy, an arbitrary dividing line is set up for the continuous score; those individuals falling below the dividing line constitute one group and those with scores greater than the dividing score constitute the other. Most of the techniques used with a dichotomous criterion are computed from the cell entries of a fourfold table. The techniques ap- plicable in this situation are either correlation—methods or difference- methods. The degree of relationship between success on the item and suc- cess on the criterion can be measured by'a tetrachoric coefficient of correlation, a coefficient of colligation, or a phi coefficient.1 The other methods depend upon the percent of the upper and the percent of the lower groups getting the item right. The simple difference between the two percentages is the easiest to compute but a chi-square comparison is preferable because it indicates the significance of a difference.2 The selection of test items with either a continuous score or a di- chotomy requires a considerable amount of time for the computation. To reduce this computational time, extreme groups are used for item selec- tion purposes. These short-cut methods economize on the time by sacrific- ing the quantitative nature of a continuous test score distribution. If the relationship of item.score to test score is linear, so that the per— centage of successes on the items increases as the total score increases, the differences on a single item between the upper and lower groups will 1P. E. Vernon, "Indices of Item Consistency and Validity," British Journal of Psychology, Statistical Section, I, 152-66, l9h8. 2Ibid. 6 be sharpened by taking extreme groups. However, the increased sharpness of discrimination is somewhat offset by a loss of information which re— sults from excluding some cases in the middle of the test score distri- bution. The use of extreme groups necessitates balancing the sharpness of the discrimination and the stability of the indices. Kelleyl has pre- sented the mathematical proof that the upper and lower 27 percent of a sample are the Optimum groups to use, provided the difference in the cri- terion scores among the members of each group is not utilized. The 27 percent maximizes the critical ratio based upon the difference between the means of the two groups. Each item and the criterion score are re- garded as normally distributed and continuous variables. Kelley2 also outlined a procedure for estimating a product-moment correlation coeffi- cient between the item and the criterion score, excluding the item in question. Three techniques have been developed that are applicable where the two extreme groups each constitute 27 percent of the total group. A summary description of these methods follows. (I) Biserial r (rbis) approximation by Flanagan's method3 A table containing the values of the correlation coefficients in a normal bivariate surface corresponding to various combinations of propor— 1T. L. Kelley, "The Selection of Upper and Lower Groups for the Vali- dation of Test Items," Journal ngEducational Psychology, XXX, pp. 17-2b, 1939. 21bid. 3J. C. Flanagan, ”General Considerations in the Selection of Test Items and a Short method of Estimating the Product-Mbment Coefficient from the Data at the Tails of the Distribution,” Journal of Educational Psychology} XXX, pp. 67h-80, 1939. tions in the upper and lower 27 percent of the group was constructed. A normal bivariate surface assumes a normal distribution underlying both the dichotomous item response and the criterion variable; it also assumes rectilinearity'of regression. To use this technique, it is necessary to determine the number of the high group that answered the item correctly and express this as the proportion of the high group; a similar number is obtained for the low group. These two proportions are looked up in the table and the approximate correlation coefficient is fOund. (2) 2 method1 Since equal increments in rbis do not represent equal increments in discriminating power, Davis transformed the r's into 2's and converted z to a scale with a mean of fifty and a standard deviation of twentybone. The 2 values may be added, subtracted, or averaged. A chart is provided from which one may read off the difficulty (expressed in sigma units) and the discrimination indices corresponding to various values of'the upper and lower proportions of success. To use this chart the criterion scores and the percent knowing the correct answer to an item are corrected for chance. (3) Probable error of percent difference Votaw2 and Arnold} gave formulae and nomographs for reading off the probable error of the percent difference in the upper and lower 27 percents. 1F. B. Davis, Item—Analysis Data: Their Computation, Interpretation and Egg in_Test Construction, (Harvard Education Papers, No. 2.), Cam- bridge: Craduate School of Education, Harvard University, l9h6. 2D. F. Votaw, "Graphical Determination of Probable Error in Valida- tion of Test Items," Journal of Educational Psychology, XXVI, 682-86, 1935. 3J. N. Arnold, "Nomogram for Determining Validity of Test Items," Journal of Educational Psychology, XXVI, 151-53, 1935. 8 The difference-methods subsumed under the rubric dichotomous groups1 are applicable using the extreme 27 percents of the criterion group. The level of significance for the difference-method indices is dependent up— on the number of cases in the groups; this makes it impossible to compare, by a statistical test, indices based on extreme groups with those based on dichotomous groups. Other techniques of item validation have been proposed. One of these, the double tetrachoric index?, is computed by dividing the criterion into three groups on the basis of the thirtybthird and the sixtybsixth percen— tiles and averaging the tetrachoric correlations obtained by the two splits. A simplified formula for the product-moment correlation.coefficient of a dichotomous variable with a multiple-categoried variable, when the criterion is coded 2, l, O, -l, -2 to yield a rectangular distribution, is 2a+£~fl~28 \fléthn-JF) where n=number of persons, k: number of persons selecting the correct l1. 37 response, and a, b, c, d, and e reSpectively denote the frequency of cor- rect choices by the five coded groups.3 Although many item selection techniques have been presented that are based upon the relationship between the item and the criterion, when one considers the computational time and the stability of the discrimination indices, Flanagan's method appears to be the most satisfactory one to use. 1?. h. 2Vernon, loc. git. 3D. C. Adkins and H. A. Toops, "Simplified Formulas for Item Selec— tion and Construction,” Psychometrika, II, 165-171, 1937. Item Analysis to Maximize Validity In most cases it is probable that the selection of items to increase reliability will increase validity; however, it has been demonstrated that it is possible to increase reliability and decrease validity or in- crease validity while decreasing reliability.l Thus it is found that more selection of items correlating highest with an external criterion does not necessarily produce the most valid test. The ideal test is one composed of items which correlate highly with an external criterion and poorly with one another. Theoretically, if suitable external criterion scores were available, correlation coefficients between each item and the criterion could be obtained; the intercorrelations between test items could also be obtained. A multiple regression weight could be computed for each item and those items having regression weights significantly dif- ferent from zero at a specified level of confidence could be selected fer the final form of the test. Since the computations necessary to determine which combination of items would yield the largest multiple correlation coefficient is laborious, many approximation methods to this multiple re- gression problem have been suggested. The method of successive residuals2 and the L-method3 depend on building up successive composites of the most valid items; they require fewer item intercorrelations than the multiple 1H. E. Brogden, "Variation in Test Validity with Variations in the Distribution of Item.Difficulties, Number of Items and Degree of Their Intercorrelations," Psychometrika, XI, 197-21h, 19L6. 2A. P. Horst, "Item Analysis by the Method of Successive Residuals," JOurnal‘gfquperimental Education, II, 25h-63, l93h. 3H. A. Toops, "The L-Method," Psychometrika, VI, 2h9-66, l9hl. lO regression solution, but the task is a lengthy one. A presentation of these two techniques is given by the authors. 1 Richardson and Adkins presented a simple approximation to the mul- tiple correlation procedure that compared favorably with the L-method.2 This formula is [L .../'L II. c yu :0 X)’ n. X y (nkj’n'yu Raw 0'; where y: criterion variable, xztest variable, and u:any test item.3 Flanaganh adapted the method of solving for regression coefficients by means of successive approximations to provide an item selection method. A second technique proposed by HorstS, the maximizing function, is dependent on the ratio of the validity of the item to the consistency of the item. Items are selected that correlate highly with the external cri- terion and poorly with the test score. 1M. W. Richardson, and D. C. Adkins, "A Rapid Method of Selecting Test Items ," Journal o_f Educational Psychology, XXIX, 5147-52, 1938. 2H. A. Toops, 193. c_i_t. 3M. W. Richardson and Adkins, op. git. p. 5h9. L‘J. C. Flanagan, "A Short Method for Selecting the Best Combination of Test-Items for a Particular Purpose ," Psychology Bulletin, XXXIII, 603-14, 1936. (Seen in abstract only). 5A. P. Horst, "Item Selection by Means of a Maximizing Function," Psychometrika, I, 229-141;, 1936. Use of Item Analysis Data In the construction of a test, the distribution of test scores may be predetermined, within limits, by the proper selection of items with certain difficulty indices. The following formulae show that the sample mean and variance are functions of the item difficulty'indices and the interaction between items. )7 = 221?, 0-2::IZJEZ§+«2.g;-(XEp'IEIE).> 4'.“ where Pi: the proportion of the individuals in the sample passing item i, qi : l-pi, and Pij=:the prOportion of the individuals in the sample passing both items 1 and j. The symmetry or asymmetry of the distribution of test scores, the skew- ness, kurtosis, or modality are also functions of the item difficulty in- dices and the item interactions. Since the score distribution properties are dependent upon the item indices, items should be selected which will yield a score distribution best serving the purpose for which the test is to be used. No one fre- quency distribution exists which would be ideal fer all testing situations. In objective testing, extensive use has been made of the normal test score distributions because: (1) the ability being measured was assumed to be normally distributed, and (2) the statistical methods applied in the the- ory of measurement are based on normal probability theory. The first reason is meaningless unless the ability is given an operational defini- tion; normality is usually assumed on a philosophical basis. The second 12 reason is not relevant since the test was constructed for a specific pur- pose other than the application of statistical methods to the obtained data. There are certain testing situations where the primary purpose is to classify individuals into two or more groups; no attempt is made to identify differences between the individuals within a particular group. Some personality testing is undertaken to measure the presence or the ab- sence of a trait, with no attempt being made to measure the intensity of the trait. A test, best serving the purpose of this situation, would have sufficient accuracy at the critical point of partition to insure that the classification of a student into one of the two groups was not a result of chance fluctuations. In the measurement of interest, a test is desired 'which would classify individuals into certain interest groups. The test would need accuracy of distinction between groups rather than within groups. In achievement testing for the purpose of assigning grades on the basis of a specified point scale, a test should identify students re— ceiving one grade from students receiving other grades. Test items se- lected by the item analysis techniques now in use generally result in a test score distribution not significantly different from a normal distri- bution. Consequently, various mathematical transformations are applied to the raw score data to obtain the critical points of partition between groups 0 Purpose of This Study This study was undertaken to develop and test a new technique for the selection of those items most applicable in testing situations where it is 13 desired to place individuals into a number of mutually exclusive groups. A technique was desired which would select the test items yielding a maxr imum discrimination between the groups. To obtain this maximum difference between the mean scores of the groups, it is necessary to maximize the ratio of the between groups variance to the total variance. The theoretical aspects of the problem consist of the presentation of the technique and an analysis of its effectiveness. To test the worth of this technique an analysis of the resulting test score distribution is necessary; it is also necessary to compare the resulting distribution with a normal test score distribution to see which is more efficient in differentiating between two groups of individuals. It is possible that a theoretical proof is true even though such a technique does not work in an actual situation. For this reason, some empirical data are necessary in order to determine the practical value of the technique. The primary need of any statistical technique is that it is stable and consistent when reapplied in another situation; to investi- gate the stability of the new technique it is necessary to use cross- validation procedures. Empirical data are necessary to investigate wheth— er the new technique results in a score distribution with minimum error at the critical score points between two adjacent groups. Also, a com- parison with other techniques seems desirable. Since it is not practi- cal to compare the new procedure with all the present procedures, it will be compared with Flanagan's technique, which is based on item-criterion relationship only, and the method of maximizing function,1 which is one of the better techniques based upon both the item—criterion and the item- item.relationship. 1A. P. Horst, 193. 93.2. CHAPTER II THE THEORETICAL SOLUTION OF THE PROBLEM The assumption underlying the placing of individuals within a Spec- ified grouping arrangement should be that there exists a real difference between the members of different groups and that the members of any one group are fairly homogeneous with reSpect to the trait being measured. Any test utilized for grouping purposes should yield an array of scores for a particular group that does not overlap the score distribution of any of the other groups. In the ideal case, with perfect items, individuals could be placed into m categories with m—l items. Item one would be failed by group 1 and passed by groups 2, 3,...,m; item two would be failed by groups 1 and 2 and passed by groups 3, h,...,m; item mr2'would be failed by groups 1, 2, 3,...,m-2 and passed by groups mrl and m; the last item mrl would be failed by all groups except group m. These mrl perfect items would yield a score distribution where all the individuals in group 1 re- ceived a score of zero, those in group 2 a score of one, those in group m-l a score of m-2, and those in group m.a score of m—l. It is not prac- tical to utilize a single item because (1) a single item is subject to fluctuation in response from trial to trial, and (2) the correlation between an item and the criterion being predicted by the total test is so low that an item curve is comparatively flat and not representative of the total test discrimination. The many factors that operate to re- duce the efficiency of a test result in a greater likelihood of error . in predicting when a single item is used than when the total test is used. Since perfection is not likely attainable, a nunber of items 15 that function within chance limits of a perfect item might serve the same purpose as one perfect item. To place students into m categories we would need mrl groups of items with each group of items approximating a perfect test item functioning about a particular critical point separat- ing two adjacent groups. A technique is proposed for the selection of test items when the trait being measured is on a continuum and the groups are separated by a specified number of critical points along the continu- um. The application of this technique in the situation.where the test is measuring traits not on a continuum but rather on two or more continua would possibly result in a single item's functioning at a critical point on more than one of the continua. In this situation it is necessary to score the tests on the basis of sub-parts identifiable with one of the continua; this could result in a single item's being included in the scor- ing arrangement of more than one sub-test. It should be noted that if a single trait is being measured, an item would function at only one criti- cal point; an individual's performance on the test would be indicated by a single score. In the case of two or more continua, an item may func- tion at one critical point on one or more continua; an individual's per- formance must be indicated by more than one score. Procedure for Item Selection According to the above discussion, the procedure for the selection of test items where individuals are to be classified into m groups would be as follows: 1. Classify the individuals into the proper one of the m groups on the basis of either an external or internal criterion. 2. Select a sample of size ni from group i. The calculations will be simplified if all 111 are equal. 16 3. For each test item determine the number of each group that answered the item correctly. b. On an a priori basis obtain the theoretical frequency of group i. This is obtained by assuming a perfect test item so all n1 people of a group would pass the item if that group were above the critical point at which the item was discriminating, and all n peOple of a group below the critical point would fail the i em. 5. The observed frequency is obtained by examining the number of each group that answered the item correctly. If the group is above the critical point the observed frequency is equal to the number of the group that answered the item correctly. The nume ber of the group that failed the item we will denote by'e . The observed frequency of successful predictions by the item or the group i is then equal to ni minus e.. If the group is below the critical point we would predict that all the members of the group would fail the item. The number of errors is thus equal to the number of the group that answered the item correctly. If we al- so designate this number by e., the observed frequency of suc- cessful predictions is also equal to n. minus 6 . 6. Using the above theoretical and observed frequencies we use chi- square to test whether the observed frequencies deviate signifi- cantly from the theoretical frequencies. 7. By specifying the chi-square limits of acceptance and.rejectance, we can identify the items that are acceptable at the various ' critical points. 8. If we restrict ourselves to chance deviations from the theoret- ical frequencies, the value of ei is equal to qn , where q is the probability of getting an item right on the asis of chance alone. For a test item.of a alternatives q is equal to l/a. Using these limits we have an acceptance point for chi-square equal to 1; qzni. A The chi-square value is obtained by the formula 2 (v 4 )1 X :2 ° 9 where fo observed frequency, *9 and fe expected frequency. An example of the procedure in the case of five groups is presented for clarification purposes. Let I denote the highest group and V the lowest and have all n equal to 10. 1 Groups (from high to low) I II III IV V Number in each group 10 10 10 10 10 Number in each group answering the items correctly 10 8 h h l 17 This item tends to separate groups I and II from groups III, IV, and V. For a perfect test item Operating at this critical point we would ex? pect all the individuals of groups I and II to pass and the other indi- viduals to fail. The chi-square value is calculated as follows: Groups (from high to low) I 11 III IV v fe 10 10 1o 10 10 fo 10 8 6 6 9 fo-fe 0 2 h h 1 (re-re)2 o h 16 16 1 $£932923- .00 .ho 1.60 1.60 .10 fe e_ out)" = , , X- Z fife 370 It is unlikely that we can find a sufficient number of items which will satisfy the chance deviation limits; but we may then use some other chi- square value, based on predetermined levels of significance, for the ac— ceptance point. A satisfactory item has a chi-square value less than the acceptance value. Theoretical Analysis of the Score Distribution in the Case of Two Groups Let us denote the higher group by'l and the lower group by 2. Con- sider a test consisting of k items with a alternatives and assume that the credit given is either 1 or 0 depending on whether the response is correct or incorrect. Let us further assume that for each item the prob- ability of success is 2h-D/a. for group 1 and 5 Va. for group 2. If we denote the probability of success for group 1 on item i by pi, we consider 18 the case of k trials with different probabilities of success pi, where i=1, 2,...,k. Aitkenl has shown that this type of distribution has (1) 0:2 BZRZ; where 3;=l"1,.:° Let p represent the mean probability of success for group 1, F: 7‘"? ,3.)- the variance of the probability in the k trials is (2) are: £(B‘?)% ' We have L 2 (3) Zfi3.'=KP-K?z‘Z(R"P)o Substituting the value for the last term on the right from equation (2), we have (u) git-SwW-kpz-K‘T-z- Simplifying equation (h), we obtain (5) 2:21;: th- K673- Substituting from equation (5) in equation (1), we have (6) 072 = Kfsv - K 07.6. It is apparent that 6:2 will be a maximum for a mean probability, when 07.1:0 . If we let 07,2: 0 , it follows that (7) Z (ii-Pf: 0, Hence I}: P fdr allc. In a similar manner it may be shown that (8) 02,2: K??- “K 0;; , where p is the mean probability of group 2, z . and 0’2 is a maximumwhen 3:? for all 1. 1A. C. Aitken, Statistical Mathematics, 2nd Edition, Interscience Publishers Inc., New York, N. Y., 19h2, pp. 50-51. 19 In the theoretical analysis of the problem we will choose P: so that the maximum variance is obtained, which means that 1: -’- (4" 9A..) 5‘ 521"" K) for group 1 and B-=%)£=’,2,-'-, K , for group 2. The distribution of scores for group 1 is characterized by (9) Z=k§a~0 ) 5:3..- KCa-l) Q; Q2 The mean and variance for group 2 is ‘ k(e~D 42 ' _. 2 (10) x2: '2,“ ’ 3: ._. It is now necessary to determine whether the observed score distri- butions, identified by equations 9 and 10, classify the individuals in- to one of two groups with a small probability of error. The exact amount of error could be determined if the true scores for all individuals were known. Since true scores are unattainable, reasonable limits for the difference between the true scores of the individuals must be expressed in terms of the observed scores. To derive the relationship between true score differences and observed score differences, it is necessary to make some assumptions regarding the relationship between observed scores and true scores. The relationship between the observed scores, true scores, and error scores is assumed to be (11) X“ : t6 *6; where xi = observed deviation score of individual i, t. 1 true deviation score of individual i, and N 81 = error deviation component of individual i. All errors are assumed to be random errors and are such that (12) '5 : 0 rte - r =’CL 8163 20 In the derivation of the relationship between true score and observed score differences, the summation index is omitted for ease of presenta- tion. The summation is over 1 unless otherwise stated. The relationship between the variance of the true, the error, and the observed scores is determined. From equation (11), we have an X=t+e' Squaring and summing gives (no zxZ=ZtZ+ZeZ+ZZet Dividing both sides by N, we have 3 2 Z A S . (15) s} s 5. +3. .2 ta .5, From (12), we see that the last term of equation (15) is zero, and we have 2 2 (16) 5x = 5‘: + Se - we may solve for the variance of true scores in terms of the relia- bility of the test and the variance of the observed scores. The correla- tion between two parallel tests is defined, from elementary statistics, as (17) fl. _, z x1 x2 ,2 .— N31 52 where x1 and x2 are the scores of an individual on tests 1 and 2, and s1 and 32 are the variance on tests 1 and 2. From (13) we may express the numerator on the right side of (l?) as follows, (18) Z )9": = Z (l‘.*9.Xl‘z*‘-’2)~ EXpanding equation (18) gives (19) 2X,Xz:Zt'fz +Ze‘tz +§€2t,+zecez' From the definitions of (12) we see that the last three terms of equation (19) are each zero. Since we have parallel tests, the true score on 1 and the true score on 2 are equal. Therefore equation (19) becomes a (m Zhn:ft- 21 we may divide both sides of equation (20) by N, and fronxthe definition of a variance we see that 2 X' P . (21) ——————Z 'X‘ ‘ St N substituting equation (21) in equation (g7) gives t—L—S 0 Since tests 1 and 2 are parallel, 31::3 and we see that Z Z (23) - St " [L12 SX) where sx: 81: 52. Since the reliability of a test, rxx’ is defined as the correlation be- tween two parallel tests, we have 2 (210 S: -= Ru 5} ' Next we may solve for the error variance by substituting equation (2b) in equation (16), obtaining Z Z 2 (25) S, = 5e *h'xx 5): ' Solving equation (25) for 52 give; 2 (26) Se = S): (l’n'xx ' It is necessary to determine the standard error of the difference between two scores, xi-x . To write the formula for this error, we use 3 equation (11) and write The term in the parentheses indicates the error. The variation of the observed difference from the true difference is denoted by 2 2 z (28) {(62-95) = 279‘. .z e: . aime‘. 22 From (12), we see that the last term of equation (28) is zero. Substi- tuting equation (26) in equation (28), we have (29) Z (9395f = 2 NS‘XZO'J‘H). Dividing by N and taking the square root, we have (30) 52%;?» = 3,, {—2-— VM?" . It should be noted that in the development of the equation for the stan— dard error of a difference, no assumptions were made regarding the dis- tribution of errors. However, in order to utilize this error to obtain reasonable limits for the value of the difference between true scores, some assumption regarding the frequency distribution of errors is neces- sary. Let us make the usual assumption that the distribution of errors is normal. For two individuals with a given score difference xi-xj, reasonable limits for the difference of true scores, ti—tj’ may be taken as (31) x,-x5 +3T5' 51m; >124,- >X.--¥;~3V75,VI-An . If the above limits include zero, there is no significant difference be- tween ti and t since ti-t may be zero. Since true differences were .1 J assumed to exist between the individuals of the different groups, that is ti-t )0, it follows that both of the limits of the above inequality 3 must be positive and (32) 1f‘_.--t8 >x,.-xd--3.S;V"é"m 20' Hence it is necessary that (33) Xiva-3sxfi’q“nn ->- 0 to be certain that true score differences exist between the individuals of group 1 and those of group 2. 23 If our selected test items were perfect for group 1 and Chance 0p- erated for group 2, the groups would have the following means and vari- ances: (3h) Group 1 Group 2 x k k/a a: O (a—l)k/a2. It is quite apparent that no individual was placed in the wrong group on the basis of chance errors. The k test items were selected so that the two groups have the means and variances given in (9) and (10). Since we have assumed that a real difference exists between the individuals of group 1 and the ones in group 2, we are saying the true scores of the individual at the low extreme of the group 1 distribution is greater than the true score of the highest scoring individual in the other group. If we let i represent the indi- vidual from the high group and j the one from the low group, we can ex- press our assumption as ti-t ) O. For our distributions we may be cer- J tain that (35) x5?- a; K -235. We») , and XJ-f g. +%UK(Q-O 0 Taking the maximum.value for xj and the minimum value for X1 and substi- tuting these values in (33), we have - x’ _ - 2-C>- (36) (LQDK-‘grfla-D ~3-évkw 35;“ VIM Multiplying both sides of the inequality by a and combining like terms, we have (37) (m-a)x 3 (Ha-I) [6* 3V?- Vl-Axx] . 2h Squaring both sides and simplifying, we have (38) K2 (%~;)2[g+3\f_§" V732] / 6L >2 . Considering the quantityVl:r;; as one variable and k as the other vari- able, the inequality (38) yields boundary values that represent a family of parabolas dependent upon the parameter a. The value of k must always lie in the positive quadrant because positive square roots are taken. Since 05 rnél.00, we are concerned only with the half parabolas. The value of k which will meet the equality of (38) for a fixed a lies on the curve defined by (38); all k greater than this value of k will satis— fy (38). Figure 1 indicates the number of test items needed for various combinations of test reliabilities and the number of alternatives for each test item. An example is given to illustrate the method of reading Figure l. A four-choice item test with a reliability of .20 would require 72 items to efficiently separate the individuals into two groups; a three-choice item.test would need a reliability of 1.00 to accomplish the same end. From Figure 1 it appears that the number of alternatives each item has is an important factor in the number of items required to perform.the discrimination between the two groups; however, the test will usually have a reliability greater than .50, so it is apparent that the number of alternatives is not of great importance when the number of alternatives is five or more. 1.00 y) (reliabilit r 1: number of items) L— , ii 777,. FIGURE 1. The Number of Items Required to Make a Discrimination Between Two Groups as a Function of the Test Reliability and the Num- ber of Alternatives for Each Item. It follows that for any test of reliability rxx’ we may be reason- ably certain, probability of 9,987.5 in 10,000, that ti-tj )0 by select- ing k large enough so that (38) is satisfied. It is apparent that if the items were more homogeneous for group 1, the variance of the group would become smaller and the mean would approach k. Comparison of the Bimodal Distribution and the Normal Distribution When the probability of success of the group is a constant for each item, the scores for the group will form a Bernoulli distribution. The theoretical relative frequencies for the dichotomous situation are given k by the terms of (p-rq) . This type of distribution is characterized by 26 the following functions:L ((40) .x. = K F a-2 .— K P t ‘(3 '7 (s—P)/\fi<_?—é— ._L_... ‘ 49:1:1’3 m+3‘ The skewness is positive for pl/2, and zero for p =1/2. As k approaches infinity, «’3 tends to zero and dqtends to 3. In the comparison let the range of the distributions be from 0 to h; and let the bimodal distribution be such that 551-3 Wye $2.3 W/a, which means that the score distributions for the two groups intersect at a point. This point is the weighted average of their respective means; when the size of the two groups is equal the point of intersection is equidis- tant between the two means. For the normal distribution we have x=h/2, and 8X: 11/6. Let us assume that for the bimodal distribution we have a standard error of measurement equal to the standard deviation of the group, or assume the reliability of the test is zero for the group. Let us also assume the entire distribution of each of the two groups in the bimodal case lies in the interval 3'6 1 381,(l =1, 2) . For the normal distribution nearly all the cases lie in the interval 3138:. If we define a critical region about the critical point of partition between the two groups, we would have a certain percentage of the cases of each bimodal group within this band. For our assumed error of measurement equal to the standard 1 Aitken, 92. 933., pp. 19-50. 27 deviation of one of the groups, the area under the curve for group 1 that is within this critical region is equal to the area under the normal prob- ability curve between the t values of 2 sigma and 3 sigma. This area is equal to .023 of the total area, and it follows that 2.3% of the individ- uals in group 1 lie within this critical region. The percentage of the individuals in group 2 that are within this region is also equal to 2.3% of the group. For the combined total 2.3% of all of the individuals lie within this error band. For the normal distribution to have as small a percentage of the cases within a critical region, it is necessary that (kl) I :96; t2 ____ "27 v27; 3 cu : .023, where c is the critical point de- c- 6' pending upon the proportions in each t of the two parts of the normal curve. It is apparent from the Table of the Normal Curve and the standard error of measurement of a standard score1 that the reliability of the test, necessary to obtain this accuracy, is dependent upon the point of parti- tion between the two groups. If the point of partition is in the tail of the normal distribution at a t value of 2.00, this accuracy is attainable only if the reliability of the test is greater than .91. As the point of partition approaches the mean of the normal distribution, the reliability of the test must increase to maintain the same degree of accuracy; in the limiting case with a t value of .00, the reliability of the test must be at least equal to .9991. If we assume that the two distributions have equal errors of meas- lStandard error of measurement of a standard score G‘fi-r . 2 xx 28 urement, we have the following areas under the curve in the critical re- gion .3 00 -)z 3.00 ~ 2 L *1- , arm (’42) Bimodal caseo_ f e 2‘} ch, + e 90;" 09" 2 2w- 5 a‘V"' -3 6 34-5. . Z C+S€ - L ' 2 Normal case 63 091' V217 c-S-E where c is the point of partition between the two groups. From the properties of the normal curve, it is immediately seen that -200 3-09 - 3 C995 2 ,1... -(x, ~59,“ _(X;‘Yz)& , €- ‘3."an HORNE/e 2” x' 621'??? e 2“ XZ