AN EMPIRICAL COMPARISON OF THREE DISTRIBUTIONS OF ITEM E‘IFFICULTY WITH RESPECT TO THE RELIABILITY AND VALI-BITY OF THE RESULTING MEASURES Thssis hr the Degree OI Ph. D. I MICHIGAN STATE UNIVERSITY “fired .I. Reynoicis 1965 ”ESIS . ,. Inga: ‘p, .‘5' ‘7‘"? - 1-. LIBRA R Y I\\IM\\\;II\WI;IIII§I M52325?” This is to certify that the thesis entitled AN EMPIRICAL COMPARISON OF THREE DISTRIBUTIONS OF ITEM DIFFICULTY WITH RESPECT TO THE RELIABIL- ITY AND VALIDITY OF THE RESULTING MEASURES presented by Alfred J. Reynolds has been accepted towards fulfillment of the requirements for Mdegree in_Edo P8X Ch. / 312 a J Shag“ ‘/ Major professor/ Date JUIY 22'. 1965 0-169 _ A. __-,_‘_ '_.._._ A v“ "I __ 7 ~— ABSTRACT AN EMPIRICAL COMPARISON OF THREE DISTRIBUTIONS OF ITEM DIFFICULTY WITH RESPECT TO THE RELIABILITY AND VALIDITY OF THE RESULTING MEASURES by Alfred J. Reynolds 2E1 Pzghlen There is a discrepancy between the practice of test constructionists and that advocated by test theorists. Most test theorists advocate that item difficulties be concentrated near the mean ability level of the examinees whenever it appears likely that item inter—correlations are low. However, in practice test constructionists continue to use items with.a wide range of difficulty. It is the purpose of this study to determine which of three distributions of item difficulty, used in the con— struction of academic achievement tests, is most effective in terms of the homogeneity of test scores and their validity for grading purposes. 322520.229. Existing data from achievement tests were used to investigate the problem. Items from three term—end exam. inations were pooled. Items were selected from these pools to construct three 50 item eXperimental tests which represented a “Peaked', "Rectangular“ and “Multimodal' Alfred J. Reynolds distribution of item difficulties for two subject areas. Reliabilities were computed and validities were determined, first by correlating total test scores with total instruc- tor grades, and second by comparing the abilities of the tests to discriminate among criterion group means. Relia- bilities and validities were compared statistically where possible and rationally where statistical comparisons did not seem appropriate. Engines 1. The 'Peaked Test” tended to have larger reliabili- ties than the "Rectangular Test“ or the 'Multimodal Test“. 2. The “Rectangular Tests" tended to have larger re- liabilities than the "Multimodal Tests“. 3. When the validating criterion was total instructor grade, the ”Peaked Tests" had larger validities than either the "Rectangular” or ”Multimodal” tests. 4. The "Rectangular Tests” correlated higher with total instructor grades than the 'Multimodal Tests“. 5. The "Peaked Tests“ discriminated most effectively between criterion groups when there was moderate interitem correlation. 6. When interitem correlations are high a greater spread of item difficulties will produce larger validities. 7. The ”Rectangular Tests” were better discriminators than the 'Multimodal Tests". AN EMPIRICAL COMPARISON OF WEE DISTRIBUTIONS OF ITEM DIFFICULTY WITH RESPECT TO THE RELIABILITY AND VALIDITY OF THE RESULTING'NEASURES by rt" \(\ Alfred J? Reynolds A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY ' College of Education 1966 SH! ACKNO IsLEDGFJI ENT 8 The author wishes to acknowledge his indebted- ness to Dr. Joseph L. Saupe, thesis director and com- mittee chairman, without whose counsel, guidance, and assistance this investigation could never have been completed. Appreciation is also extended to Dr. Willard G. Warrington, guidance committee member, for his helpful suggestions and also to him and his staff at Evaluation Services, Michigan State University, who made possible the collection of data. The investigator is also indebted to both Dr. John E. Hunter and to Dr. Edward B. Blackman for counsel, advice and willingness to serve on the guidance commitee. A special word of gratitude is due to the author's wife, Bette, for her assistance, encouragement, and patience, and also to Nancy, Becky, Al and Amy, who still love their father. ii TABLE OF CONTENTS ACKIIOIZLEleEI‘IT S O O C O O O O O O O O O O O O 0 LIST OF TABLES O O I O O O O O O O O O O O O 0 LIST OF FIGURES O O O O O O O O O O O O O O O 0 CHAPTER I II III IV INTRODUCP ION O O O O I O O O O O O O O 0 Review of Related Literature . . . The College Achievement Test . . . Theoretical Implications of Previous Research . . . . . . The Setting . . . . . . The Problem . . . . . . Prospectus . . . . . . PROCEDURES . . . . . . . . General Procedure . . . Selection of Tests . . Selection of Students . The EXperimental Tests The Peaked Test . . . . The Rectangular Test . The Multimodal Test . Summary . . . . . . . . RESULTS 0 O O O O O O O O O O I O O O O Reliabilities of EXperimental Tests Validities of the EXperimental Test Comparison of Group Means . . . . . sumwoeeeeeeeoeeeoe SUMMARY, CONCLUSIONS AND RECOMMENDATIONS Summary . . . . . Limitations . . . Conclusions . . . Recommendations . BIBLIOGRAPM O O O O O O O O O O O O O O APPmDIX o e o e o o o o o e e e o o o 0 iii Page 11 iv vi TABLE II. III. IV. VI. VII. VIII. IX. XI. LIST OF TABLES Frequency Distributions of Item Difficul- ties for Natural Science and Social Science Item P0018 O O 0 O O O O O O O O O 0 O O O 0 Frequency Distributions of Item Difficul- ties for EXperimental Tests in Natural Science 0 O O O O I O O O O O O O O O O O 0 Frequency Distributions of Item Difficul- ties for Experimental Tests in Social Selence O O O O O O O O O O O I O O O O O 0 Mean Difficulties, Mean Discrimination Indices and Source of Items for EXperimental Tests 0 O O O O I O O O O O O O O O O O O O Percentages of Students Receiving Each Instruc- tor Grade for Natural Science . . . . . . . Percentages of Students Receiving Each Instruc— tor Grade for Social Science . . . . . . . Percentages of Students Receiving Instructor Grades and Number of Items Needed at Each Level of Difficulty for the Natural Science Multimodal Test . . . . . . . . . . . . . . Percentages of Students Receiving Instructor Grades and Number of Items Needed at Each Level of Difficulty for the Social Science MultlmodalTest.............. Range of Item Difficulty in Each Category for the Mult ideal Tests 0 O O O O O O O O O O 0 Means Standard Deviations, and Reliabilities or £118 EXpeI'lmental TSBLS . o o e e e e e 0 Reliability Coefficients for Original and EXperimental Tests Adjusted for Length . . . iv Page 20 23 24 25 26 26 27 27 32 TABLE XII. XIII. XIV. XV. XVI. XVII. XVIII. validities, Differences Between Them and t's for These Differences tal Tests in Natural Sc Science . Correlations Between Experimental Tests and Instructor Grades and Correlations Between EXperimental Tests, Used in Computing ‘t's" for Significant Differences Between Validity LIST OF TABLES for the EXperimen— ience and Social Coefficients for EXperimental Tests . . . . . validity Coefficients for Original and EXperi- mental Tests Adjusted for Length Numbers in Each Criterion Group, Group Means, and Group Standard Deviations for the EXperi- mental Tests in Natural Science and Social Science . Differences Between the Means of Adjacent Criterion Groups on the EXperimental Tests for Natural Science and Social Science z‘s for the Differences Between Criterion Groups for the EXperimental Tests in Natural Science and Social Science "2 ' Values for the Differences Between the 3 “z's' of Group Differences of the Experi— mental Tests in Natural Science and Social Science . Page 58 45 50 51 56 LIST OF FIGURES FIGURES Page 1. Criterion Group Means For The EXperimental Tests in Natural Science . . . . . . . . . . 46 2. Criterion Group Means For The Experimental Tests in Social Science . . . . . . . . . . 47 S. z's Between Adjacent Criterion Groups for the EXperimental Tests in Natural Science . . . 53 4. z's Between Adjacent Criterion Groups for the EXperimental Tests in Social Science . . . . 54 vi CHAPTER I INTRODUCTION There is a discrepancy between the practice of test constructionists and that advocated by test theorists. Cronbach and Harrington (1952), Gulliksen (1945), Richard- son (1956) and Ebel (1959) are generally agreed that maxi- mum precision of measurement will result from homogeneous item difficulties, concentrated near the mean ability level of the examinees. Myers (1962:565) says, "those who produce standardized tests continue to follow a tradi- tion that items selected for a test should represent a wide range in difficulty.“ Noll (1957) states that ' a test with adequate range of difficulty should include items ranging from quite easy to fairly difficult.” The Office of Evaluation Services, at Michigan State Univer- sity (1963) states that “it is generally desirable that item difficulty values vary from .20 to .80 so that an ex. amination will discriminate at all levels.” It is the pur- pose of this study to investigate this apparent paradox. Rgview’Q£_Relateg Litgrgtggg The psychometric literature presents numerous arti- cles dealing with the determination of appropriate distri. butions of item difficulty for specific purposes.- Methods employed in investigation of this problem include rational (\‘I 0 analysis, empirical investigations with real data and empirical studies using hypothetical data. Thurstone (1952) analyzed real data resulting from the construction and administration of numerous diagnostic spelling tests. The results indicated that tests composed of items concentrated at the 50% difficulty level would yield results most meaningful in the diagnoses of spell— ing difficulties. Davis (1963:310) relying on the rational approach said that “we can see intuitively, however, that since many kinds of test items have low intercorrelations, a distribution of item difficulties clustered around the 50% level would often approximate the distribution re— quired to obtain maximum discrimination throughout the range of scores." Cronbach and warringtcn (1952) analyzed hypothet— ical data in an attempt to determine the effect that spread of item difficulty would have on screening effi- ciency for various degrees of item reliability. Consid. eration was also given to the most appropriate level of item difficulty for maximum screening validity of a multiple choice test. The data consisted of conditional probability matrices mathematically manipulated to yield validity coefficients. The results indicated that the spread of item difficulty should vary directly as the 3. intercorrelations of the items so that validity would be maximized over the whole range of scores. validity at the extremes could be increased by either an increase in the spread of item difficulty or an increase in item unrelia- bility. Cronbach and Harrington (1952) assumed that most tests of mental ability have low intercorrelations be- tween items and therefore resommended that “constructors of educational and psychOIOgical tests would be wise to make item difficulty constant in most of their tests, since this lowers validity only for persons having extreme- ly high or low ability." (p. 147) Ebel (1959) used a theoretical model and construc- ted hypothetical tests to analyze the relationship between item difficulty and examines scores. He pointed out that it is not necessary, for effective measurement, to widen the spread of item difficulties as the range of ability of the examinees increases. Elsewhere (1954) he stated that ”where item intercorrelations are low selection of items whose difficulty is near 50% tend to flatten the score distribution, to increase the dispersion of scores and thus to improve the discrimination power of the test as a whole.“ Richardson (1935) suggested that if a certain per— cent of the examinees are to be sorted out, tests composed 4. of items of the same difficulty will be more valid than those made of items which vary in difficulty. He also indicated that these items should have a difficulty level which corresponds to the ability level of the examinees in the group that are to be accepted. Cronbach and War- rington (1952) also supported this position when they stated that "in order to design a test which rejects‘the poorest F percent of the men tested, items should on the average be located at or above the threshold for men whose true ability is at the Fth percentile.” (p. 147) The literature, referred to above, reveals that test theorists have been concerned chiefly with two types of problems. One\was to build tests for selection pur- poses, which would divide a group into two parts, i.e. those to be accepted versus those to be rejected. The other problem was to discriminate among the examinees over the entire continuum of the specified behavioral characteristic. A third problem is reflected in the following lit- erature. It is concerned with the need for achievement tests which will separate a group of examinees into sub— groups for marking purposes. Davis (1950:511) says that ”the assignment of marks (which calls for the division of a group into several parts) demands maximum accuracy of measurement at the several dividing points scattered along the range of scores.” In regard to the construction of tests for various purposes, Gulliksen (1945:91) states that "whether it is actually best to concentrate all items at one difficulty level, --- or to distribute items over a difficulty range in accordance with present test practice, can be deter. mined only by eXperiments such as those reported by Thur- stone (1932) and Richardson (1963).“ Jackson (1952) engaged in such a study in an effort to develop a practical method for selecting items for a test which would separate examinees into sub groups for marking purposes so that there would be a minimal error of measurement at the critical diVision points. He devel- oped an item analysis procedure which employed the use of chi-square for selecting items which discriminated between adjacent groups. This procedure was validated with the use of data from achievement tests given at Michigan State University. The results indicated that the ”adjacent group technique” proved a satisfactory method for selec- ting test items under the conditions of this eXperiment. Two limitations are apparent in this study. First, there was no real external criterion to use in computing a validity coefficient. Second, the ratio of items selec- ted to those needed was too low (less than 7%) for this technique to be seriously considered as a procedure for 6. practical test construction. Myers (1962) attempted to ascertain which of two types of item difficulty distributions would produce greater reliability and validity. He used the items from a l50-item test designed to predict the scholastic apti- tude of college students to construct four tests of 24 items each. Two ”Peaked Tests" consisted of items whose difficulty was between 40% passing and 70% passing. \The other two tests were “U shaped” and consisted of items whose difficulty was outside the range of those on the "Peaked Test". Samples from twelve colleges were selec- ted. Reliabilities were computed for each sample by cor- relating the two 24 item tests of each type. Validities resulted, in each sample, from correlating the test scores of freshman with their average college grades. The results failed to show any statistically significant difference between the validities for the two types of distributions. The hypothesis that "Peaked Tests" yield the best reliabil— ity was given tentative support. Several factors may have contributed to the incon- clusiveness of Myers' (1962) study. The criterion, used to compute validities, consisted of the average grades re- ceived by freshman. These students were selected from twelve different liberal arts colleges which differed widely, both academically and geographically. The sample size from each institution varied from 57 to 592. The samples were also selected in different ways. Reliabili- ties and validities were computed for the test results for each sample and were then compared for the two tests by using Wilcoxon's matched pairs signed test. Meyers considered each sample to be a matched pair since the same individuals in each sample responded to both tests. The assignment of equal weights, in the sig— nificance tests, to the results of the samples which dif- fered so much in size, appears to be questionable. Other limitations to the study are the short, 24 item, tests used to compute reliability, and finally that the items were chosen solely on the basis of difficulty indices with only a reference, which is not clearly defined in the report, to a discrimination index. Tng lelg g Aghievegegt Test The college objective achievement test is becoming increasingly popular as a means for determining grades for students. The Committee on Measurement and Evaluation of the American Council of Education emphasizes this impor— tance. It states that ”test data frequently are not the sole determiners of course grades, but normally they make a relatively major contribution. —-- the final examination may carry equal or greater weight than other work." (1959) The objective standardized test has grown in 8. popularity among college personnel to the extent that Ed— ucational Testing Service is now in the process of develOp. ing "Course Examinations, intended to measure end-of-course achievement in widely taught undergraduate courses, in- cluding technical and professional subjects offered for college credit.“ (1965) Innovations such as educational television, independent study, programmed teaching, team teaching, etc., have forced attention to the use of achieve- ment examinations as evaluative devices. Larger universities often use the scores from ob- jective achievement tests to assist in assigning final course grades, or to give credit in lieu of taking speci- fied basic courses within the university. Michigan State University is one which uses objective achievement test scores to determine 50% of the letter grade for thousands of students in Natural Science, Social Science, Humanities, and American Thought and Language courses. Tssszsiissl.Iaslisaiisnassi.stiissa.Esssszsh The psychometric literature indicates that appro- priate item difficulty distribution is a function of the use that is to be made of the test scores. Three types of item difficulty distribution are indicated for using achievement tests in the assignment of grades. First, a ”Rectangular“ distribution results from the common sense approach. It is assumed that there is a rather wide range 9. of ability among examinees and that these various levels of ability require appropriate levels of item difficulty for proper discrimination. Items must range from the very easy to the very difficult and include every level of abil- ity. The frequency distribution of these items difficul- ties appears rectangular in shape since there are few items in each category but there are many categories corres- ponding to the many levels of ability. The “Peaked‘ distribution has received most theor- etical and experimental support. It is assumed that if item intercorrelations are low, an item of 50% difficulty will make more discriminations than an item of any other difficulty. By concentrating all of the test items as near this level of difficulty as possible, it is assumed that maximum discrimination will result over the whole range of ability levels.1 If the test is to be used to divide a group into well defined sub groups, discrimination among individuals near the critical division points is imperative (Davis 1965; Jackson 1952). Test theory implies that item diffi- culty be concentrated at these points, according to l At'nornal' distribution of item difficulties would be intermediate between rectangular and peaked. Hence, by interpolation the results of this study may be tentatively extended to other test types. 10. Richardson (1956) and Davis (1965). A threefold problem is thus presented to the test constructor who wishes to build a test which will function in this manner. First, he must determine the position of the critical division points on the continuum of test scores. Second, he must determine the apprOpriate number of items to concentrate at each level. Finally, the difficulty levels of these items must be determined. A solution to the first problem can be found in the assumption that a certain percentage of the examinees will receive a particular letter grade. If the desired per- centages of students receiving each grade can be determin- ed, then these percentages indicate the appropriate point of division on the score continuum. The second problem is solved by the following reasoning. It is commonly accepted that the reliability of a test is a function of test length, provided that all items are somewhat equally effective. Discrimination among a larger number of scores concentrated about a given score should require greater precision of measurement than would be the case where discrimination is necessary among a smaller number of scores concentrated in a given segment of the score continuum. It follows then, that the number of items concentrated at each level must be proportional to the number of scores expected near this level. The ll. proportion of each letter grade indicates the proportion of students to be retained in each score category. Know; ing the number of items to be included in a particular test, the appropriate number of items to be selected for each level can be computed from equation (1). (1) GXIaN where: G = the percent of students receiving each letter grade I a the total number of items included on the test 2 II the number of items desired for a particular grade level. A solution to the problem of the appropriate item difficulty to concentrate at each division point is sug— gested by Lord (1952), Cronbach and warrington (1952) and Davis (1965). The consensus is that if a given per— cent of a group is to be selected item difficulty should be near that corresponding to the percentage of examinees to be retained. The percentage of students receiving a particular letter grade indicates the percentage of students to be retained in each category. But, students are also to be retained for all categories above the one in question. Therefore, the total percentage of students retained at each division point can only be determined by 12. summing the percentages of students in all categories above that point. This percentage will also indicate the difficulty of the items to be concentrated at that parti- cular point, assuming that items of a given degree of dif- ficulty discriminate most effectively at a corresponding degree of ability. The commonly accepted phenomenon known as "regres- sion toward the mean” implies that true item difficulties for a group of examinees will be located in the direction of the mean from the actual item difficulties computed from a sample group (Hayes, 1963). Most effective discrim- ination for groups should result, therefore, from a limited range of item difficulty focused about the various divi- sion points, but skewed toward the mean. A theory has been deve10ped here which incorpor— ates principles of concentration and spread of item diffi- culty. Groups of items are concentrated but the groups are of different size and are spread out in order to dis- oriminate more adequately at different levels of ability. The application of this theory will result in the construc- tion of a ”Hultimodal“ test. mm. The University College at Michigan State University is designed to provide for each student a common core of courses in general education. These courses include 13. those fundamental areas of knowledge which are felt to be an important part of the education of all students regard- less cf the individual'e field of specialization. All undergraduates are, therefore, required to take a sequence of courses in American Thought and Language, Natural Sci— ence, Social Science and Humanities. All students enrolled in these University courses are required to take a term.end examination which consti— tutes 50% of the final grade received in the course. These examinations are standardized achievement tests prepared from items submitted by instructors in the various courses and assembled under the direction of the Office of Eval- uation Services. All students enrolled in a particular course for a given quarter are tested with the same instru- ment. Number grades are assigned solely on the basis of. the scores received on these tests. These number grades are then averaged with a number grade assigned by the instructor to determine the final letter grade assigned in a particular course. The instructor grades are assigned completely inde- pendently of the test scores. They are based on the student's performance with regard to his instructor's assignments, tests, recitations, etc. The instructor num— ber grade is assigned on a 15 point scale. It may be con- verted into a letter grade by use of the following code: 14. l, 2, or 3 equal F; 4, 5, or 6 equal D; 7, 8, or 9 equal 0; 10, 11, or 12 equal B; 13, 14, or 15 equal A. mm Since the objective achievement test plays an es- sential role in the determination of college student's grades, it is important that these tests be constructed in such a way as to make their results function most efficient- 1y. It has been shown that a significant factor in deter. mining the efficiency of a test, for a specified purpose, was the distribution of item difficulty. It is, therefore, the purpose of this study to com. ' pare the effectiyeness of using three different distribu- tions of item difficulty, in the construction of academic achievement tests, in terms of the homogeneity of the scores and their validity for grading purposes. This study used achievement test data, available from University College term.end examinations at Michigan State University, to investigate the problem. Three ex. perimental tests were constructed for each of two subject areas. These tests represented three different types of item difficulty distributions, namely, (1) ”Peaked", (2) “Rectangular", and (5) "Hultimodal". The relative effec- tiveness of these tests was Judged by: l. The level of internal consistancy as determined by Kuder Richardson #20, 16. 2. The degree of correlation with instructor grades, and 3. The ability to discriminate among instructor- grade groups. The item difficulties used in this study were based on a stratified sample of fifty students in the upper twenty seven percent of the distribution of total test scores and a stratified sample of fifty students in the lower twenty seven percent of the distribution of total test scores. The samples of fifty students were chosen so that they possessed approximately the same distributions of scores as the larger groups from which they were chosen. Item difficulties consisted of the total preportion of students answering the item correctly in both the upper and lower sample groups. W The following chapter outlines the general plan of the eXperiment. The initial test data and subjects are dis- cussed and the ”Peaked“, ”Rectangular” and 'Multimodal' experimental tests are described. Chapter III presents and discusses the analyses of the results of the experi- mental tests. Reliabilities are compared rationally and validities are statistically compared both as to correla— tion with instructor grades and as to ability to discrim— inate among grade groups. The final Chapter summarizes the procedure and findings of the investigation. It also points out the limitations of the study and offers conclu— sions and recommendations. CHAPTER II PROCEDURES The purpose of this chapter is to outline the general plan of the eXperiment. The initial test data and subjects used in the study will be described. Fin— ally, the eXperimental tests developed for, and used in, the study will be discussed. Mam—9W Available data from achievement tests were used to investigate the relative effectiveness of the three item difficulty distributions in separating examinees into groups for grading purposes. Term—end examinations from two subject areas were selected. .A rather large pool of items was needed in order to build an eXperimental test of the required item difficulty distribution and also of a satisfactory length. The items from three term—end tests given in sequence in each subject area, to the same students were, therefore, combined to form item pools. Items for the eXperimental tests were taken from these pools. The tests selected were those which had been given in Natural Science and in Social Science for three successive terms, i.e., Fall 1963, Winter 1964 and Spring 1964. Stu- dents normally take the three courses in sequence, starting in the Fall and finishing in the Spring. 16. 17. Item analyses were obtained for each of these tests. Difficulty and discrimination indices were taken from these analyses. The difficulty indices were used in selecting items for the eXperimental tests. The discrimination in. dices were used to reject undesirable items and to keep the items of each test as similar as possible in regard to this etastistic. The item discrimination index available on the items used in this study, was determined by use of the table pre- pared for this purpose by Flanagan (1936). This index is an estimate of the product moment correlation coefficient between an item and the total test score. The proportion of successes in both the lowest and highest 27 percents of stratified random samples of examinees were used in enter— ing Flanagan's Table. The students used in the study were those who had taken all three terms of each subject in the proper se- quence; Fall 1963, Winter 1964 and Spring 1964, and had also received an instructor grade for each term. Answer sheets for these students were obtained and rescored for. each of the three experimental tests. The scores from each type of experimental test for all three terms were combined to yield a total score. Reliability coefficients were computed for each of the experimental tests by using the Kuder Richardson #20 18. formula. Rational comparisons were made between these reliability coefficients. They were also corrected for length and compared with the reliabilities of the original tests. Instructor grades for each student for each of the three terms were obtained and added together for a total instructor grade. These composite instructor grades served as the criterion for comparing the validities of the experimental tests. Validities’of a first type were estimated by the product moment correlation coefficient computed between total instructor grade and total test score. Statistical tests were used to compare these valid— ities. As a second validity analysis, groups of students who had received an average instructor grade of A, B, C, or D, for all three terms were identified. Test score means were computed for each of these groups for each of the three experimental tests. Adjacent group means were compared statistically in order to determine which experi- mental test most adequately discriminated among instructor I'— grade groups. .§£1292128.91.I§££fl. The term-end examinations used in this study had been given to large numbers of students in three consecu- tive courses in the Natural Science and Social Science 19. sequences. The series consisted of the term-end examin- ations for Fall 1963, Winter 1964 and Spring 1964 for both courses. The Natural Science examinations each contained 125 items for a total pool of 375 items. The Social Sci- ence tests contained 100 items in the Fall, 110 in the Winter, and 120 for the Spring quarter. This gave a total of 330 items for the Social Science item pool. The aver— age reliability as determined by Kuder Richardson #20, was .87 for the Natural Science tests and .80 for the Social Science tests. The average validity, which represented a correlation with instructor grades, was .71 for the Natural Science tests and .62 for the Social Science tests. After elimination of items in the Natural Science item pool which had discrimination indices of less than .25; 250 items remained. The difficulty indices of these items ranged from .09 to .97. The frequency distribution of the item difficulties for these items is given in . Table 1. Social Science items were eliminated from the item pool if they had a discrimination index less than .20. This reduced the pool to 251 items. The item difficulties of these items ranged from .08 to .98. The frequency dis- tribution of these items difficulties is also given in Table l. TABLE I. Frequency Distributions of Item Difficulties for Natural Science and Social Science Item P0018. W Difficulty Natural Science Social Science .ss;.9e 19 :55 .76—.85 45 43 .66—.75 52 ‘ 47 .56-.65 58 45 .46..55 43 38 .36—.45 31 31 .26..35 16 14 .16—.25 13 3 0-.15 4 3 TOTAL "'2'5'6 “'23? . Both the Natural Science and Social Science exam. inations were essentially power tests. Even though a time limit was imposed, most of the examinees responded to every item. Both tests were of the multiple choice type. The Natural Science test had five choices for each item. The Social Science items each had four choices. Answers were recorded on IBM form I.T.S., 1000 B 4701. Test papers were carefully checked for marking more than one answer per item. These were excluded from the study. The remaining answer sheets were then scored on an IBM 21. scoring machine. The score on-a test was the total num— ber of correct responses since no correction was made for guessing. WQLW Students who were enrolled in the Natural Science sequence; N.S. 181 - Fall 1963, N.S. 182 - Winter 1964, and N.S. 183 - Spring 1964, at Michigan State University comprised the subjects for part of this study. The remain. ing subjects were those students who enrolled in the Social Science sequence; 8.8. 231 - Fall 1963, 8.8. 232 - Winter 1964, and 8.5. 233 - Spring 1964.1 Students who did not take all three examinations, who did not receive an instruc- tor grade for all of the courses in the sequence, or who used Form B on the Fall term—end examination in both sub. ject areas, were eliminated from the study. Most of the students were college freshman and sOphomores. There were 5168 students who took the Fall Natural Science examination, 4408 took the Winter test, and 3371 were tested at the end of the Spring quarter. A total of 1423 students were available who had taken all three Nat. ural Science examinations, received an instructor grade for each course and used Form A on the Fall term examina— tion. 1 Some students may have been in both courses. 22. The term-end examinations in Social Science were taken by 3189 students in the Fall, by 2698 students in the Winter, and by 2295 students in the Spring. Of these, 909 students were available, who had taken all three exam. inations received instructor grades for each course, and used Form A in the Fall term examination. Eng Experimental Test: Three experimental tests of 50 items each were developed for use in both the Natural Science and the Social Science areas. The items for these tests were taken from the respective item pools which resulted from combining the items of the term—end examinations in these two subjects. Items were chosen which had the largest discrimination index and the appropriate difficulty level for the test being developed. Item discrimination indices were balanced as much as possible. Attention was also given to balancing the number of items taken from the Fall, Winter, and Spring term examinations. Tables gl,and 11; present the resulting distribution of item difficulties for the three tests in each area. Table I! shows the mean item difficulties, item discrimination indices, and indicates the number of items taken from the Fall, Winter, and Spring term—end examina— tions. 23. TABLE II. Frequency Distributions of Item Difficulties for Experimental Tests in Natural Science. Difficulty Peaked Rectangular Multimodal Range Test Test Test .86-1.00 - 7 9 .76- .85 - 6 20 .66. .75 — 6 - .56- .65 25 7 .. .46— .55 25 7 - .36— .45 - 6 15 .26. .35 - 6 .16. .25 - 6 2 0— .15 - - 4 TOTAL ‘35 ‘33 “5'5 The. £24m its}; It was possible to construct 'Peaked“ tests for both subject areas from the item pools.' The items ranged in difficulty from .46 to .63 for the Natural Science test and from .48 to .63 for the Social Science test. The composition of these tests with respect to item difficulty, discrimination index and the designation of the test from which the item came, are given in the Appendix. TABLE III. Frequency Distributions of Item Difficulties for EXperimental Tests in Social Science. Difficulty Peaked Rectangular Multimodal Range Test Test Test .86—l.00 - 6 10 .76- .85 - 8 22 .66- .75 — 5 - .56- .65 25 7 - .46- .55 25 9 - .36— .45 - 7 10 .26. .35 — 6 3 .16. .25 - 2 2 0- .15 - - 3 TOTAL "'55 ”5'6 "'58 The Rggtangglar Test Sufficient items were available in the item pools to construct a ”Rectangular Test" for each subject area. Few items having the same difficulty index were used in either test. The range of difficulty for the Natural Sci— ence test was from .23 to .93. For the Social Science test it was from .22 to .92. The composition of the tests with respect to item difficulty, discrimination index, and source of items is given in the Appendix. 25. TABLE IV. Mean Difficulties, Mean Discrimination Indices and Source of Items for EXperimental Tests. Test Mean .Mean Fall Winter Spring Difficulty Discrim. NATURAL SCIENCE Peaked .53 .48 17 16 17 Rectangular .56 .46 16 16 18 Multimodal .64 .44 16 16 18 80 CIAL SCIEN CE Peaked .55 .39 16 16 18 Rectangular . 58 . 40 17 15 18 Multimodal .64 .37 16 14 20 Ing_Mg1§;modgl 22g; Consistent with the theory for constructing a ”Multimodal Test“ (p. 9-12), a test of this nature was constructed for each subject area. The instructor grades used in the construction of these tests were assigned independently of the term-end examinations in the basic college courses. They were reported on a 15 point scale. A small percentage of the grades other than these, i.e.,‘ deferred or incomplete were excluded from the study. These percentages were averaged for the three quarters involved in the study and these data are given in Table,! and‘ll. 26. TABLE V. Percentage of Students Receiving Each Instruc- tor Grade for Natural Science. Grade Fall Winter Spring Average A 12.0 12.2 11.0 11.7 B 28.3 29.9 28.8 29.0 C 39.6 41.6 41.3 40.8 D 14.2 12.9 15.1 14.1 F 4.1 2.7 2.9 3.2 TABLE VI. Percentage of Students Receiving Each Instruc— tor Grade for Social Science. Grade Fall Winter Spring Average A 9.2 9.1 8.8 9.0 B 24.8 24.5 26.9 25.4 C 44.6 45.2 43.5 44.4 D 15.6 16.0 16.5 16.3 F 4.9 4.5 3.4 4.3 Tables VII and VIII present the number of items and the respective difficulty level needed to construct a 50-item 'Multimodal Test” for Natural and Social Sci- ence. Average percentage of instructor grades were taken from Tables 1 and 11;. The number of items needed in each category was computed from equation (1), (p. 11) 27. TABLE VII. Percentages of Students Receiving Instructor Grades and Number of Items Needed at Each Level of Difficulty for the Natural Science Multimodal Test. Instructor Percentage Number of Difficulty Grade Receiving Items Level A 11.7 6 12 B 29.0 15 41 C 40.8 20 82 D 14.1 9 96 TABLE VIII. Percentages of Students Receiving Instructor Grades and Number of Items Needed at Each Level of Difficulty for the Social Science Multimodal Test. L J: Instructor Percentage Number of Difficulty Grade Receiving Items Level A 9.0 5 9 B 25.4 13 34 C 44.4 22 79 D 16.3 10 95 In the actual construction of the test the distribu- tion of item difficulties was skewed toward the mean. The resulting ranges of item difficulty for each indicated level are given in Table 35, The data concerning the ac- tual items included in both tests are given in the Appendix. 28. TABLE IX. Range of Item Difficulty in Each Category for the Nultimodal Tests. NATURAL SCIENCE SOCIAL SCIENCE Indicated Actual Range Indicated Actual Range Difficulty of Difficulty Difficulty of Difficulty .96 .89—.97 .95 .91-.95 .82 .78-.82 .79 .75-.79. .41 041-045 .34 035-!ng .12 .12—.20 .09 .08-.22 522E221 In this phase of the investigation the plan of the experiment was considered. in the study were discussed. The term—end examinations used The development of, and the characteristics of the three experimental tests, ”Peaked“, ”Rectangular“, and ”Nultimodal' were described. CHAPTER III RESULTS The purpose of this chapter is to present and dis- cuss the results of the analyses described earlier, from using the three experimental achievement tests in two dif- ferent subject areas. Each of these tests represents a different type of item difficulty distribution, namely, (1) 'Peaked', (2) ”Rectangular”, and (3) 'Nultimodal”. Each test was constructed and scored from data available on term- end achievement examinations at Michigan State University. Reliabilitigs g; Experimggtal Teatg Reliabilities for the experimental tests were com. puted from the formula developed by finder and Richardson and reported by Gulliksen (1962). This formula is: r— x2—1 3 ZS rix = L 1 - g 8 1 2 x 7 B 'b- H where rxx is the reliability coefficient of the test, K is the number of items in the test, s is‘the variance of item g (equals p (l-p ) g where p is the percentage getting tfie 3 item correct), and s is the test variance. 29 30. The item difficulties used in constructing the experimental tests were considered to be good estimators of the percentages of examinees getting each item cor- rect. These item difficulties can be found in the Appen- dix and were used in the computation of the reliability coefficients. The use of these item difficulties in the Kuder Richardson #20 formula seems justified if it can be assumed that the method used to compute them is defensi- ble, that the average ability levels of the three groups used in these computations are approximately equal to that of the bxperimental group, and that variance of item dif- ficulties is not seriously affected. Flanagan (1939) defended the method used in compu- ting the item difficulties (see page 15) for he said that, “In practice it appears that frequently it is satisfactory to use the values obtained from this chart together with an index of difficulty found by averaging the difficulties for the upper and lower groups“. Although a number of students with low ability failed to complete the sequence of courses used in this in- vestigation, a number of those with superior ability also elected not to complete the entire sequence, since they passed examinations in lieu of taking the final courses in the sequence. Attrition at both ends of the ability 31. continuum is also indicated in Tables gland‘ll. It can be seen that there is a decline in the percentage of both A‘s and F's received, from Fall term to Spring term in both subject areas. This attrition of the tap and low ability groups did not greatly affect the average ability of the groups and, therefore, for the purpose of this study, the average abilities of the three groups were con- sidered to be the same. It is evident from the Kuder Richardson #20 formula that test reliability is dependent upon the variance of item difficulties and Tucker (1949) has indicated that while the mean item difficulty might change the variance proba- bly would not. He said that “Estimates can be made of item variance from experimental forms or that it might even be possible to guess a practical value of item vari- ance from editorial judgement." Since the assumptions regarding method of computa- tion of the original difficulty indices, equality of the groups involved and variance of item difficulties did not appear'to be seriously violated, item difficulties used in constructing the experimental tests were used in the Kuder Richardson #ZO-reliability formula. The resulting relia- bility coefficient appear in Table E, The test means, and standard deviations for the experimental tests are also given in Table 22. 32. TABLE X. Means, Standard Deviations, and Reliabilities of the Experimental Tests. Test Mean Standard Deviation Reliability NATURAL SCIENCE Peaked 27.74 7.53 .80 Rectangular 26.89 5.41 .68 Multimodal 31.77 4.64 .63 SOCIAL SCIENCE Peaked 27.97 6.14 .69 Rectangular 28.78 5.69 .70 Nultimodal 32.39 4.41 .59 Evidence that the item difficulties Operated as expected can be ascertained by an inspection of the eXperi— mental test score means as they appear in Table X and. -a comparihg them with the average difficulties of these tests as they appear in Table,1!. (page 25) The means of the tests tend to descent in order of magnitude from a high in the 'Multimodal Test“ to a low in the 'Peaked Test“. 'This tendency is a reflection of the fact that the average item difficulty for the tests vary in the same direction. The "Multimodal Test” was easiest with a mean item difficulty of .64 while the other tests were more difficult, having mean item difficulties in the low .50's. Although there is a reversal in this tendency in the Natural Science area involving the “Peakedfand the "Rectangular“ tests, this 35. reversal is undoubtedly only apparent since there is no significant difference, at the .05 level for a two-tailed test, between the means of the scores for these two tests. ( t . .014; d.f. = 1422) The reliability coefficients appearing in Table 5 are indices of internal consistency, computed on the same sample, and the author is aware of no statistical pro« cedure for determining whether or not the differences among them'are statistically significant. A rational an. alysis of data pertaining to the reliabilities of the ex- perimental tests will, therefore, be presented. Inspection of Table §_reveals a general tendency for the reliabilities to descend in order of magnitude from the ”Peaked Test“ to the "Multimodal Test“ with the relia- bility of the “Rectangular Test“ falling between these two. The pattern of the standard deviations of the experimental tests supports this observation, since they descend consis- tently, for both subject areas, in the same order suggested by the reliabilities. This is even true in the Social Sci- ence area where the standard deviation of the 'Peaked Test' exceeds that of the 'Rectangular Test“ even though the mag- nitude of the reliability coefficients is reversed thus reflecting the fact that total item variance for the 'Peaked Test'I was also greater. The rank order of the size of the standard deviations of the experimental tests supports the 34. hypothesis that the 'Peaked Tests” were most reliable since it is generally true that a test which spreads out examinees farthest on the score continuum, is most precise in measuring the amount of the behavioral char- acteristic being assessed. (Saupe 1961) There is one discrepancy in the general pattern of the reliability coefficients. In the Social Science area the IRectangular Test" has a larger reliability coefficient than the I'Peaked Test". This difference is email however, being only .01. The actual difference in favor of the 'Peaked Test“ in the Natural Science area is twelve times as large as the difference in favor of the “Rectangular Test“ in the Social Science area. An analysis of the function that discrimination indices have in determining test reliability is also relevant to the interpretation of the discrepancy in the Social Science Area. Gulliksen (1962:379) has shown that “the reliability of the test can be increased only by making the average item variance smaller or the aver. age item reliability index larger“, and has presented the following formula showing the relationship: 35. (“7L _ — , where K is the number of test items, (s3) is the average item variance, and ;_—3— is the average item reliability xg 3 index. Table ll,(p. 25) reveals that the average discrim. ination index for the "Rectangular Test" in the Social Science areas is .01 larger than that of the 'Peaked Test”. It seems reasonable to conclude that this increase in the average discrimination index would result in an increase in the average reliability index. According to Gulliksen, this increase in the aver- age reliability index would function to increase the reli- ability of the “Rectangular Test” over that of the ”Peaked Test". The average item variance was also smaller for the "Rectangular Test“ thus it too functioned to increase the reliability of this test. These two variables both cperated in the same direction and produced an increase of only .01 in the reliability of the ”Rectangular Test" in the Social Science area. The meager influence of the small differences in average discrimination index and average item variances were reflections of the fact that 36. average item variance was free to vary only from near 0 to .25, average item standard deviation from near 0 to .5, and average item discrimination index from .20 to .68 making it possible to have average item reliability indices somewhere between .10 and .34. It was, therefore, concluded that small changes in parameters could have but little influence on total test reliability as long as the number of items remained constant. It could also be con- cluded that although the average discrimination index of the ”Peaked Test”, in the Natural Science area, was .04 higher than the "Rectangular Test", this would not.func- tion to account for the .12 difference in the reliabili- ties of these tests as shown in Table E, Comparison of the reliabilities of the eXperimental tests with the average reliabilities of the original tests presents difficulties beyond the lack of the statistical test indicated earlier. Items with low indices of dis- crimination were systematically eliminated from the exper- imental tests and this fact alone should have caused them to have higher reliability coefficients. Statistical com. parison is also hampered by the differences in length between the eXperimental tests and the originals. In . spite of these limitations, a rational comparison between them is indicated in order that the original tests might serve as bench marks for evaluation of the experimentalmsts. 37. In order that this comparison be made as mean- ingful as possible, the reliabilities of the eXperimental tests were adjusted for length by using the Spearman— Brown formula developed for this purpose and reported by Cronbach (1960). This formula is as follows: r; n :1 lFF(n - l) r where rn is the reliability of the lengthened test, I r is the reliability of the original test, and n is the ratio of the new test length to that of the original test. For the Natural Science experimental tests, I'n“ became 2.5 since the original tests each contained 125 items while the experimental tests consisted of 50 items. “n“ was set equal to 2.2 for the Social Science experi— mental tests, since the average length of the original tests was 110 items, while the experimental tests contain- ed 50 items each. Reliabilities resulting from these computations are given in Table §1_along with the average reliabilities of the original tests. The statistics given in Table 31, indicate that only the reliability of the ”Peaked Test“ exceeded that. of the original test in the Natural Science area. In ,2 38. In the Social Science area both the 'Peaked Test” and the "Rectangular Test" had reliabilities of greater magni— tude than the original tests. TABLE XI. Reliability Coefficients for Original and Experimental Tests Adjusted for Length. Original Peaked Rectangular Multimodal Test Test Test Test NATURAL SCIENCE .87 .91 .84 .81 SOCIAL SCIENCE .80 .83 .84 .76 Eslissiias.sf.tthEassziasaislulsaia One of the criteria for judging which of the three experimental tests discriminates most effectively, is the correlation with instructor grades. Pearson product-moment correlation coefficients were computed between total instruc— tor number grade and total test score for each of the three experimental tests. (see p. 18) These coefficients are given in Table £LI_along with the differences between them. TABLE XII. validities, Differences Between Them and t's for these differences, for the Experimental Tests in Natural Science and Social Science. P a n (P - a) t (P - n) t (a - M) ' t N.S. .81 .75 .61 .06 6.46 .20 20.42 .14“ 12.3s 9.9. .65 .59 .49 .06 3.26 .16 9.29 .10 4.67 39. A statistical test for the significance of dif- ferences between correlation coefficients, has been re- ported by Lindquist (1940). This test is appropriate where two or more tests have been correlated for the same group of subjects with the same variable. In the present study the three eXperimental tests were all cor— related with the same student's instructor grades. This test of significance was, therefore, applied to the dif- ferences between the validity coefficients of the experi- mental tests used in this investigation. The "t's" given in Table fig; were used to test the null hypotheses that the pOpulation correlations between instructor grades and test scores were the same for each pair of eXperimental tests. Following is the formula UBede t a (r12 - rlfil Il-§2 \llfL r23 \/ 2 \/1 .. r52 - £3 - r35 + 2612) (’13) (1‘23) where r is the correlation between two variables, and n is the number of cases. Data used in the computation of these t's is given in Table £111. Significance was determined by referral to the “Table of t” in Edwards (1955). The degrees of freedom, appropriate to this procedure, are equal to (n - 3). Since the null hypotheses required a two tailed test, and since the level of significance was set at .05, the table shows that with 1420 degrees of freedom, a I't" value equal to or greater than 1.96 is required in order that the null hypotheses be rejected. TABLE XIII. Correlations Between Experimental Tests and Instructor Grades and Correlations Between Experimental Tests, Used in Computing 't's“ for Significant Differences Between alidity Coefficients for Experimental Tests. 1‘ r“ rml 1‘ 1‘ 1‘ Pl pr pm mr NATURAL SCIENCE .81 .75 .61 .77 .69 .66 SOCIAL SCIENCE .65 .59 .49 .51 .53 .46 Inspection of the 't's' in Table £2;_reveals that all of the ”t” values exceed the value of 1.96 necessary for rejecting the null hypotheses. Since the correlation . coefficients descend in level of magnitude from the ”Peaked Test“ to the "Multimodal Test", and since all of the dif- ferences in the sizes of the validities are significant, it may be concluded that the ”Peaked Test” is a better dis- criminator than either the “Rectangular Test" or the “Mul— timodal Test". It also follows that the “Rectangular Test” is better than the ”Nultimodal Test“ in this respect. A comparison of the validities of the eXperimental 41. tests with those of the original tests should give some indication of these tests' relative ability to discrimi— nate between examinees on the behavioral characteristic being measured.‘ A statistical comparison between these validities is not possible since no test is known to the author which can be used to determine whether or not sig- nificant differences exist among them. In order to make the rational comparison as meaningful as possible the validities of the eXperimental tests were adjusted for length. This was accomplished by using the following formula, reported by Thorndike (1963). 1‘ r,,,,- A (1 - ) va;+ where ron is the validity of the lengthened test, r is the validity of the original test, 01 r11 is the reliability of the original test, and, n is the ratio of the length of the length— ened test to that of the original test. The original Natural Science tests were each com. posed of 125 items. Since the experimental tests contain. ed 50 items each, ”n” was equated to 2.5 for computing the adjusted validity coefficients of the experimental test scores in this area. 42. In the Social Science area the three original tests contained 100, 120, and 110 items respectively, for an average of 110 items. "n" was, therefore, set equal to 2.2 for computing the adjusted validities of the Social Science experimental tests, since each contained 50 items. The validities of the eXperimental tests corrected for length appear in Table 2911. The average validities of the original tests used to supply items for the experi— mental tests were also given in Table 131. TABLE XIV. Validity Coefficients for Original and Experi- mental Tests Adjusted for Length. Original Peaked Rectangular Multimodal Test Test Test . Test NATURAL SCIENCE .71 .86 .83 .65 SOCIAL SCIENCE .62 , .72 .64 .56 It is assumed that the items needed to increase the length of the experimental tests would be similar to the existing items of the experimental tests. The average re— liability index for the items of the experimental tests would be larger than those of the original tests since items with a low index of discrimination were systematical- ly eliminated from the experimental tests. According to Gulliksen (1962) this would tend to decrease the validity 43. of the experimental tests since the validity of a test is equal to the ratio of its average validity index to its average reliability index. It can, therefore, be concluded from Table $11 that the validity coefficients for the ”Peaked' and “Rectangular” tests are greater, in both areas, than are those of the original tests. iwmmm As stated in Chapter I, one of the two methods by which validities were to be compared in this investigation is to determine which of three distributions of item dif— ficulty, used in constructing an achievement test, will most effectively separate a group of examinees into sub. groups for grading purposes. In order to answer this question the subjects used in this investigation were sep- arated into criterion groups according to the sum of the numerical grades assigned by their instructors for the three terms being considered. Significance tests were performed to determine which of the experimental tests pro- duced scores best able to discriminate between adjacent groups. Students were assigned to criterion groups on the basis of total instructor grades as follows (see p. 18). 44. www WM 38 thru 45 29 thru 37 20 thru 28 11 thru 19 3 thru 10 endow» The results of this procedure are shown in Table ,E!, The numbers of individuals in each group are listed under N in this table. No 'F” group appears in the table, since only one individual was assigned to this group and that was in the area of Social Science. This result was expected since students rarely continue through the en- tire three course sequence if they fail the first one or two courses of that sequence. Table L! also lists the criterion group score means for each of the eXperimental testes Figures 1 and 2 present these criterion group means in graphic form. The relative slopes of these lines as well as the relative sIOpes of the short segments connect- ing the mean score points of each group give an indication of the relative distances between the means. If it is assumed that group standard deviations, are not signifi- cantly different for each criterion group, then the slopes of these lines and also the slapes of the line segments should indicate the ability of the corresponding tests to 45. discriminate between groups. TABLE XV. Numbers in Each Criterion Group, Groups Means, and Group Standard Deviation for the Experi- mental Tests in Natural Science and Social Science. Criterion PEAKED RECTANGULAR HULTIMODAL Groups N Mean s.d. Mean s.d. Mean s.d. NATURAL SCIENCE A 96 39.30 4.22 34.66 3.77 37.30 3.79 B 459 32.41 5.35 30.38 3.68 34.47 2.85 C 725 25.02 5.64 24.80 4.12 30.36 4.05 D 143 18.75 4.54 21.20 3.98 26.68 3.76 SOCIAL SCIENCE A 54 37.24 4.33 37.70 4.91 38.52 3.90 B 272 31.92 4.76 31.78 4.04 34.72 3.05 C 475 25.76 4.94 26.94 5.09 30.91 3.99 D 107 23.13 4.21 24.84 3.66 29.53 3.76 Inspection of Figure 1 reveals that the ”Peaked Te st' has both an over..all greater s10pe and also steeper line segments between each criterion group than the other experimental tests in the Natural Science area. These facts give tentative support to the hypothesis that the 'Peaked Test” was the best discriminator between criterion groups. Mean 46.. 40. P H 35 T R P 30* H R M P 25 - 20 + P 15 ~ P - Peaked Test R - Rectangular Test 10 1 M - Hultimodal Test but i or —CT fie A Group Fig. 1 Criterion Group Means For The EXperimental Tests In Natural Science Mean 47. 40-1 M R. P 35 - P H 30 - R P 25 4 P 20 r 15 - P - Peaked Test R.- Rectangular Test 10 - M - Nultimodal Test 5 - Fr 13 c s A Group Pig. 2 Criterion Group Neans For The Experimental Tests In Social Science It is also of interest to note that the segment from "C” to "B” for the "Peaked Test“ has a greater slape than any other segment on the chart. Examination of Table EZLreveals that these two groups also have larger standard deviations than any of the other groups. Whether or not these larger dispersions of scores within the groups, and hence greater overlap between them, will seriously affect the ability of the test to discriminate between the cri- terion groups, can only be determined by a significance test. This test will follow this discussion of Figures 1 and 2. Examination of Figure 2 reveals that apparently the “Peaksd” and ”Rectangular” tests in the Social Science area are both better discriminators among the criterion groups than the "Multimodal Test", since both have over- all greater slopes. The "Peaked Test" seems to be better in differentiating "C's” from "B's“, while the "Rectangular Test” seems to discriminate more effectively between group ”A” and group ”B". These observations can only be tenta- tive since a statistical procedure using the variances of group test scores is necessary in order to determine whether or not these differences are actually significant. A test is also desirable in order to determine which differences are significant. The statistical tests indicated above were performed 49. in order to determine whether or not criterion group mean differences were significant. These differences between adjacent criterion group score means were computed and are listed in Table £11, These differences were converted to z's by using the following formula: 1' - X' z : 1 2 2 2 N181+ m25:2 N N \ l 2 where Xi and X; are the means of two groups, N and N are the numbers of individuals 1 2 in groups 1 and 2, and a: and 8: are the variances of the scores of groups 1 and 2. The values for Xi -‘Xé were taken from Table _!1. Table £!,lists the values for N. The values of 82 can also be computed from the s.d.'s in this table. The values for the resulting z's are listed in Table £111. The null hypotheses, being tested, is that the pop— ulation mean of a group's test scores is equal to the pep- ulation mean of an adjacent group's test scores. If the level of significance is set at .05 and a one-tailed test 50. is performed, the table or the normal curve indicates that a "2" value equal to or larger than 1.65 will permit the rejection of the null hypotheses. An examination of the "2" values in Table fill; reveals that all of them are larg- er than this value. It may be concluded, therefore, that significant differences exist among the pOpulation means of adjacent letter group test scores, on all of the BXpBrl- mental tests. TABLE XVI. Differences Between the Means of Adjacent Criterion Groups on the EXperimental Tests for Natural Science and Social Science. Criterion Peaked Rectangular Multimodal Groups Test Test Test NATURAL SCIENCE A - B 6.89 4.28 2.83 B - C 7.39 5.58 4.11 C - D 6.27 3.60 5078 SOCIAL SCIENCE A - B 5.32 5.92 3.80 B - C 6.16 4.84 3.81 C _ D 6.27 3.60 3.78 1 The more precise t-test could have been used in making these comparisons. The z's were used, how- ever because they were the values that were computed in the following analysis, because the numbers of cases were generally large, and because the 2's were so large that it was clear the more precise test would lead to the same conclusion. 51. TABLE XVII. z's for the Differences Between Criterion Groups for the EXperimental Tests in Natural Science and Social Science. Criterion Peaked Rectangular Multimodal Groups Test Test Test NATURAL SCIENCE A - B 11.87 10.19 8.32 B — C 22.39 23.25 18.68 C _ D 12.54 9.73 10.22 80 CIAL SCIENCE A - B 7.60 9.40 7.92 B — C 16.65 13.44 13.61 C - D 12.06 6.92 9.00 Figures 3 and 4 are visual representations of the critical ratios for criterion group differences on each experimental test and were taken from Table £111. The z's were plotted on the ordinate and assuming that all other group parameters were equal, the highest ordinate for each group difference indicated the corresponding test best able to discriminate between those particular groups. In keeping with this rationale, Figure 3 reveals that for the area of Natural Science, the “Rectangular Test” dis- criminated best between the ”B" and “C” groups and that the 'Peaked Test“ discriminated most effectively, “A's“ from "B's” and ”C's" from "D's". 52. In the Social Science area, Figure 4 shows the "Peaked Test" to differentiate best between "C's" and "D's“ and "B's” and "C's“. The "Rectangular Test” discriminated best between group A and group B. These obserVations were based on the assumption that group sizes, variances and covariances were equal. Since this assumption was violated a statistical procedure was necessary in order to deter- mine which eXperimental test discriminated most effective- ly between adjacent criterion groups. A procedure for this purpose was develOped by Saupe, (1965). He called it an "approximate, large sample test for comparing the ability of two measures to discriminate between two groups". This test may be expressed by the formula: 2 - z z 3 1 2 saw“) “2.9+ NAC(X18)(X2A) 2 1 2 2 2 2 (us-l-Ns (NS use) VAiA 813)A2A+B2 where “z " and "z ” are critical ratios of the differ- 1 2 ence to the standard error of the difference between the mean scores of two adjacent groups, N is the number of individuals in each group, S is the variance of the scores in a group, and C(X A)(X ) is the covariance of the scores 1 2A of two groups. 53. z's 30‘- 25‘fi R 1 20 J 15.. 10.. N R R. M P - Peaked Test 5 d R - Rectangular Test V n .. Nultimodal Test I ‘T I C - D B - C A.- B Groups Fig. 3 2's Between Adjacent Criterion Groups For The Experimental Tests In Natural Science 20‘y P 151 . N R‘ P 10 1 R N N P P - Peaked Test 5" R - Rectangular Test b M - Multimodal Test l I ‘1 C - D B - C A - 8 Groups Fig. 4 z's Between Adjacent Criterion Groups For The EXperimental Tests In Social Science The values Obtained from substituting the appro— priate values in this formula are given in Table £1111. Saupe (1965) has assumed that these 25's values are nor— mally distributed. A two-tailed test was used to test the null hypotheses that, the experimental tests in both subject areas, were equally effective in discriminating between adjacent criterion groups. The significance level was set at .05 and the table of the normal curve indicated that a value of 1.96 or larger, or -l.96 or smaller, was necessary in order to reject the null hypotheses. An asterisk appears above and to the right of the values in Table 521;; that are beyond these limits. It may be con. cluded that those values having an asterisk represent dif- ferences between critical ratios which are statistically significant. The corresponding eXperimental tests can be assumed to be the best discriminators for the groups and areas indicated. An examination of Table XVIII shows that of the twelve comparisons of the "Peaked Test" with the other ex- perimental tests, eight proved to be significantly in favor of the 'Peaked Test". Of the four comparisons which were not significant, two were in favor of the “Rectangular Test" and one of these appnoached significance. When compared only with the “Rectangular Test", three of the six comparisons were significantly in favor of the “Peaked Test“. One other was in that direction, but 56. not significant. The other two comparisons favored the "Rectangular Test'' and one of these approached signifi— cancee TABLE XVIII. “z " Values for the Differences Between the 3 ”2's” of Group Differences of the Ex. perimental Tests in Natural Science and Social Science. Criterion z - z z - z z - 2 Groups p r p m r m NATURAL SCIENCE A's - B's 1.67 3.489 1.46 B's - 0'6 -.89 3.289 4.13“ C's - D's 2.40. 2.11“ -.46 SOCIAL SCIENCE A's - B's -l.94 -.26 1.969 B's - C's 2.63. 2.53* -.14 C's - D‘s 4.02. 2.59. -6.06* The "Rectangular Test" proved to be a better dis- criminator than the "Multimodal Test" in two cases. Only in one case was the "Multimodal Test" significantly better than the ”Rectangular Test“, and in no case was it statis- tically superior to the 'Peaked Test”. The general pattern of the 2's involving “Peaked Tests" in Table XXII; seems to support the findings of the 57. Cronbach, Warrington study (1952). They concluded that a "Peaked Test" should be more valid except for examinees of extremely high or low ability, assuming low interitem cor- relations. This would imply a rise in the ability of the 'Peaked Tests" to discriminate most effectively among the middle groups. This ability is indicated by Table 2111; '-“~‘-I and Figure 4, for the "Peaked Tests", since the measures (2's) of ability to discriminate do rise for the middle instructor grade groups. Figure 3 failed to reveal this ; trend. No “F“ groups were available for this study and therefore, it can only be inferred that if this theory can account for the results of the 'Peaked Tests”, then the 2's for the ”D - F“ groups should decline in magnitude. The only 2 involving a ”Peaked Test" which tends to negate this theory is that for discriminating between the B and C groups in the Natural Science area. In this instance a greater actual difference occurred between the group mean scores of the "B - C" groups on the "Peaked Test” than on the "Rectangular Test“. However, when these differences were converted to critical ratios, the critical ratio of the ”Rectangular Test" was larger, although not signifi- cantly so. An eXplanation for this phenomenon is indicated by the fact that the variances of these two groups of test scores are roughly twice as large for the 'Peaked Test" 58. as for the "Rectangular Test". Those for the "8“ groups are 28.66 and 13.53, while those for the ”C” groups are 31.88 and 17.04 for the “Peaked' and "Rectangular“ tests respectively. These large group variances indicate a high degree of overlap for the scores of the adjacent criterion groups. This functioned to decrease significance of the difference between the means of these groups. Since the test variance is equal to the square of the sum of the item reliability indices (Gulliksen 1962), the larger variances for groups B and C imply that for these groups in the Natural Science area the items of the "Peaked Test” had relatively large interitem correlations. This increase in the homogeneity of the items also indi. cated that the 'Peaked Test" became much more reliable for the two groups with the result that there was an accompanying decrease in the validity for these criterion groups. The eXplanation for this apparent paradox is that in practice as the reliability of a test increases the validity also increases up to a certain point and then as reliability continues to increase, validity decreases. Since validity is usually computed from a complex criter- ion, it should not appear strange that a test having a high degree of item homogeneity should be a poor predictor for a criterion heterogeneous in nature. In reality, as a test becomes more reliable the specificity of measure- 59. ment increases and usually the number of factors being measured decreases. If the validating criterion consists of only those factors being measured by the test, then an increase in reliability could be expected to bring about an accompanying increase in validity. 0n the other hand, if an increase in reliability results in measuring fewer of the factors relevant to the validating criterion, an im- provement in validity can not be eXpected. Apparently this latter case is what happened to the 'Peaked Test“ for the ”C" and "D" groups in the Natural Science area. The Cronbach and warrington position has taken this phenomenon into account. It advocated a widening of the range of item difficulty when high correlations existed among items. This position would, therefore, account for the lack of validity for the "Peaked Test" in the Natural Science area, for the 'B' and ”C" criterion groups, on the basis that the high interitem correlations for these groups required a greater spread in item difficulty. This position would also explain the greater validity of the “Rectangular Test“ for these groups in this area. If it can be assumed that the interitem cor- relation for this test were similar to those of the "Peaked Test”, then the greater dispersion of item difficulty in the "Rectangular Test“ would be expected to produce great- er validity. The evidence in this section indicated that the "Peaked Test" was more effective in discriminating among criterion groups than the “Rectangular Test” in three of the four comparisons involving groups in the middle range of ability. It was assumed that these results were due to moderate interitem correlations for the criterion groups involved. There was some indication that the "Rectangular Test” was a better discriminator between adjacent criterion groups than the "Peaked Test" for the other comparison involving groups in the middle range of ability. This result was assumed to be a reflection of high interitem correlations for the groups involved. 0f the two comparisons of groups available at the extremes of ability, one favored the ”Peaked Test" while the other favored the ”Rectangular Test“. Neither comparison re- vealed a significant difference. The "Multimodal Tests” apparently were the poorest discriminators of the eXperi- mental tests. 3229821 In this phase of the investigation data gathered from the experiment were presented and analyzed. The results were compared statistically where possible and rationally where no statistical test was available. A rational examination of the reliability coeffi- cients of the eXperimental test indicated that the “Peaked 61. Test” was most reliable in the Natural Science area. No real difference between the reliabilities of the ”Peaked Test“ and the "Rectangular Test“ was apparent in the area of Social Science. The ”Nultimodal Test” apparently was the poorest of the three eXperimental tests in regard to reliability. The "Peaked Test” had an adjusted reliability coefficient larger than that of the original test in the area of Natural Science. Both the “Peaked” and the “Rec- tangular' tests had larger adjusted reliabilities in the Social Science area than the original tests. Statistical evidence indicated that the I’Peaked Test“ produced the highest validity coefficients of the experimental tests when instructor grades were used as the validating criterion. The "Rectangular Test“ was next, and the 'Multimodal Test” was last in this respect. The validities of the eXperimental tests were adjusted for length, and it was assumed that items added would have characteristics similar to those of the existing items. These adjusted validities of both the ”Peaked“ and the "Rectangular” tests exceeded the validities of the origi— nal tests in both subject areas. When instructor-grade groups were used as the vali- dating criteria, the ‘Peaked Test“ was found to be the best discriminator between most of the adjacent criterion groups in the middle of the ability range. Groups were 62. available only for the upper extremes of ability and com. parisons between test score means reveals one critical ratio in favor of ”Peaked Tests" and one in favor of the ”Rectangular Test”. The "Hultimodal Test" apparently was the poorest discriminator of the eXperimental tests. CHAPTER IV SUMMARY, CONCLUSIONS AND RECOMMENDATIONS assess: The purpose of this investigation was to determine which of three distributions of item difficulty, used to construct achievement tests, would be most effective in terms of the homogeneity of their scores and their valid— ities for grading purposes. Achievement test data for two subject areas was available at Michigan State University. These data were used to investigate the problem. The items from three of these tests, given in sequence to the same students, were combined to form an item pool for each of the two subject areas. Item analyses were available for these tests and provided item difficulties and discrimination indices, which were used in the construction of 50-item experimen. tal tests. A “Peaked Test" in which item difficulty ranged from .46 to .63, a “Rectangular Test“ with a range in item difficulty from .22 to .93, and a 'Nultimodal Test” with items concentrated at four levels of difficulty, were con- structed for each subject area. Reliability coefficients were determined for each test by use of Kuder Richardson #20 formula. These 64. coefficients were compared with each other rationally. The "Peaked Test" was found to be most reliable in the Natural Science area. No real difference was apparent between the reliabilities of the "Peaked" and the “Rec- tangular" tests in the Social Science area, although both had higher reliability coefficients than that of the “Multimodal' test. The reliabilities of the experimental tests were corrected for length and compared with those of the orig— inal tests. It was acknowledged that items with low dis- crimination indices on the original tests were not includ- ed in the experimental tests and that this would tend to increase the reliabilities of the experimental tests. Only the "Peaked Test“ had an adjusted reliability coeffi— cient higher than that of the original test in the Natural Science area. Both the ”Peaked” and the "Rectangular" tests were more reliable in the Social Science area. The "Rectangular Test“ in this area had a reliability coeffi- cient only slightly larger than that of the “Peaked Test". The validity of the eXperimental tests was deter. mined in two ways. First, the total test score for each of the eXperimental tests was correlated with a criterion which consisted of the total instructor grade for a three course sequence. The resulting validity coefficients were compared statistically with each other and were also 65. adjusted for length to correspond with the lengths of the original tests and then compared rationally. The statis- tical comparisons showed that the ”Peaked Tests" produced higher validity coefficients than any of the other experi- mental tests. The ”Rectangu1ar Test“ was superior to the 'Multimodal Test” in this respect. Rational comparisons of adjusted eXperimental test validities with original test validities assumed that the items added to make the experimental tests as long as the original tests were similar to the existing items, and that items with low discrimination indices eliminated from the eXperimental tests would tend to decrease their val- idity. It can therefore be concluded that validity coeffi- cients for the ”Peahed“ and ”Rectangular“ tests were great- er in both areas than were those of the original tests. The second means of determining validity was to compare the relative abilities of the experimental tests to discriminate between adjacent instructor grade groups. This was accomplished by computing critical ratios for the differences between adjacent grade groups for each experi- mental test. A statistical comparison of these critical ratios indicated that in a majority of the comparisons the I'Peaked Test” discriminated most effectively between adjacent criterion groups in the middle of the range of ability. Of two comparisons for the upper extremes of 66. ability, one favored the “Peaked Test” and one favored the "Rectangular Test”. The ”Multimodal Test” apparently was the poorest discriminator of the eXperimental tests. L at o The main purpose of this study was to determine by empirical means which of three distributions of item dif- ficulty, used in the construction of achievement examina- tions, would result in the most useful instruments for grading purposes. Conclusions and recommendations from this study are tempered by a number of limitations which should be pointed out. This investigation was performed with data available on academic achievement tests in courses at the college level. This data was compiled under practical conditions as they existed in the actual construction of college achievement examinations for large numbers of students. Item statistics were estimated from a sample taken from the tails of the distribution of test scores. While this procedure provides statistics which can be used in practi. cal situations with some justification, their use in this study necessitates the limitation of the inferences of the results to similar situations. The construction of 50-item experimental achieve- ment tests, whose items had the desired characteristics, required a large supply of items that could be accumulated 67. only by the pooling of items from three examinations. This procedure resulted in the attrition of students of low ability and resulted in the complete loss of an “F” instructor grade group. Some students of very high abili- ty also were lost to the study. The assumption was made that the ability levels of the groups were not greatly affected by these losses. However, lack of statistical evidence of this fact also limits the inferences which can be made from the results of this study, since item statistics for each course were computed for the different groups rather than for a total group which could be shown to have an ability level comparable to that of the eXperi- mental group. The pooling of results from three courses also re- sulted in the combining of instructor grades. In some cases students may have been taught by the same instructor for all three terms, and in other cases a different teacher may have conducted a student's class for each term. The large numbers of students involved in these three courses required the services of a 1arge number of instructors whose subjective evaluations and personalities certainly had some influence on the grades which they assigned. Whether or not this had some biasing effect on the actual instructor grades used in this investigation is not known. Comparison of test results was hampered in the case 68. of the reliability coefficients since there was a lack of appropriate statistical tests which would lead to more definitive conclusions regarding these results. The com- parison of adjacent grade group means was made by an approximate test which makes acceptance of the results somewhat tentative in nature. A further limitation was the lack of enough items with the precise characteristics desired in order to con. struct each experimental test in complete accordance with its respective theory. Hewever, in practice, this con— dition also exists since it is difficult to obtain a plentiful supply of useful items. W The conclusion of this study are dependent upon the assumption that item characteristics other than item dif- ficulty such as, discrimination indices, reliability in- dices, and validity indices were the same for the differ- ent types of tests. In so far as the techniques employed in this inves— tigation may be justified, the following conclusions seem defensible: 1. College achievement tests which have items con- centrated within a small range of difficulty, somewhere near the mean ability level of the group, have a tendency to produce larger reliability coefficients than either 69. tests which have items covering a wide range of difficulty levels, or tests whose items are concentrated at four different ability levels. A major factor accounting for this is the greater item variance for items near the mean ability level of the group. This results in a larger item standard deviation and hence a greater reliability index, provided that discrimination indices are held constant, and the result is a larger reliability coefficient. 2. College achievement tests which have a wide range of item difficulties have a tendency to be more reliable than those whose items are concentrated at four different difficulty levels not including the middle range. This conclusion results from the fact that omission of items near the mean level of difficulty tends to decrease item reliability indices which are dependent upon item standard deviations as well as discrimination indices. If item discrimination indices are held constant a decrease in item standard deviation results in a decline in the re- liability index, and hence, results in a smaller reliabil- ity coefficient. 3. Validity coefficients for college achievement tests (correlations with independently assigned instructor grades) will be larger for tests whose items are concen- trated near the level of mean group ability, than for those tests whose items are concentrated at four different levels, 70. or for tests with a wide range of item difficulty when thel validating criterion is instructor grades. This conclu— sion assumes that interitem correlations are similar to those used in this study. 4. College achievement tests composed of a wide range of item difficulties will correlate higher with instructor grades than will tests whose item difficulties are concentrated at four different levels assuming that interitem correlations are similar to those used in this investigation. 5. College achievement tests with a small range of item difficulties concentrated near the average ability level of the examinees and with moderate interitem correla- tions have a tendency to discriminate more effectively among instructor grade groups than those tests whose item difficulties have either a wide range or are concentrated at four levels. This is especially true for the groups in the middle ranges of ability and is dependent upon interitem correlation being similar to those used in this study in the area of Social Science. 6. College achievement tests having a relatively high degree of interitem correlations will discriminate more effectively when the variance of item difficulties is greater than when items are concentrated near the mean level of difficulty. Test validity is a function of the 71. sum of the variance of item unreliability and the variance of item difficulties. validity is maximum for one score when this sum is small, but validity increases for a wider range of scores as this sum of variances increases, up to a certain point. Therefore, when interitem correlations are high (low unreliability) a compensatingly larger vari— ance of item difficulty is necessary in order to increase the sum of variances and hence improve the validity of the test for a number of scores. 7. College achievement tests whose item difficul- ties cover a wide range, discriminate more effectively among groups than those whose items are concentrated at four ability levels other than near the mean level of abil- ity. This results from the fact that validity is dependent upon the ratio of average validity index to average relia— bility index. Since the average reliability indices of the two tests are assumed to be equal, the magnitudes of the Validities of the tests are dependent upon the correspond— ing sizes of the average validity indices. The major dis— cernable difference between the two tests is the inclusion of items in the middle range of ability on one test and their omission on the other test. It is, therefore, con- cluded that the test containing items near the mean level of difficulty has a higher validity due to the influence of these items in increasing the average validity index. 72. W In so far as the techniques employed in this study may be valid, the following recommendations seem jus- tified: 1. If it can be assumed that items available for achievement tests have item characteristics similar to the items used in this investigation, it is recommended that in the construction of achievement tests as many of the items as possible be located near the mean ability level of the examinees. 2. The validities of the ”Peaked Tests“ were clearly superior when individual test scores were correl- ated with instructor grades. The results of comparing criterion group means presented some evidence that they were also the best discriminators among some of these groups. There was also an indication that the "Rectang— ular Tests“ were somewhat effective in this respect. A study is, therefore, recommended which would determine empirically whether or not a distribution of item dif— ficulties intermediate between “Peaked” and "Rectangular" and “Normal" in shape, would be more effective than either the 'Peaked“ or the "Rectangular“ distribution in discrima inating among the whole range of criterion groups. 3. This study was designed to investigate the relationship which exists between item difficulties and Isl. ‘II I Ill-III. I‘ll! III II II! . 73. the criteria which instructors use for grading purposes. The assumption was made that item difficulties were not related to item content. The extent to which this assump- tion is invalid will affect the accuracy of the results of this investigation. It is, therefore, recommended that a study be undertaken to investigate the extent and nature of any relationship which may exist between item content and item difficulty. 'Iilll ll! 111 i411! ll 1 ill! . 1‘ 'II. ‘III I lllllllll' l'.||l1l’ In! it BIBLIOGRAPHY Brogden, R. E. “Variations in Test Validity with variation in the Distribution of Item Difficul- ties, Number of Items, and Degree of Their In- gizcorrelation.“ Egyphgmgtzixg, 1946, L1, 197— Brogden, R. E. ”On the Interpretation of the Correl- aticn Coefficient as a Measure of Predictive Efficiency.“ Jams; 2:. W W 1946, .12, 65-76. Committee on Measurement and Evaluation lelggg Egggjgg Washington: American Council on Educa- tion, 1549, p. 40. Cronbach, L. J. and warrington, w. W. "Efficiency of Multiple—Choice Tests as a Function of Spread of Item Difficulties.” Esxghgmgtgigg, June 1952, 12, No. 2, 127-147. Cronbach. L. J- Eeseaiisissfzsxsaslssissl T New York: Harper and Brothers., 1960, p. 131. Cook, N. v. “The Functions of Measurement in the Facilitation of Learning." In E. F. Lindquist (ed.) Egggg5;gggl_flegsgggmggt, Washington: American Council on Education, 1963, 3—46. Davis, F. 8. "Item Selection Techniques." In E. F. Lindquist (ed.) ,§%%gg£i%ngl,h a m Wash— ington: American unci on ducation, 1963, 119.158. Dressel, P. L. W in. higher W. Boston: Houghton Nifflin 00., 1961. Edwards A. L. WW“ t d far. the Essex- a W, New York: Rinehart and Co. 1955, p. 501 e Flanagan, J. C. ”A Table of the Values of the Product Moment Coefficient of Correlation in a Normal Bivariate Population Corresponding to Given Preportions of Successes." Flanagan J. C. "General Considerations in the Selection of Test Items and a Short Method of Estimating the Product Moment Coefficient From Data at the Tails of the Distribution.” W 91. W1 mm. 1939. as. 74... so. Ebel R. E. “Procedures for the Analysis of Cifileeroom Testim“ W sad 291299.122- issl W. 1954. 1.4. p- 359- . Ebel, R. E. “Item Difficulty and Test Effective- mess.” Presented to the AERA, Atlantic City, Feb. 1959. Gulliksen, H. "The Relation of Item Difficulty and Interitem Correlation to Test variance and Reliability." W, 1945, L, 79.91. Gulliksen, H. Thggzz,gf_flgg§gl Tgsts, New'fork: John Wiley and Sons, Inc., 1962, p. 223. Hayes. in L. WMW: New York: Holt, Rinehart and Winston, 1963, p. 500—501. Hull, C. L. Aptitngglggsgjng, Yonkers-on-Hudson: werld Book, 1928. Jackson, R. A. ”An Item Analysis Technique Based Upon Adjacent Group Differences.” Unpublished Ed.D. Thesis, {ichigan State University, East Lansing, Mich. 1952. Kelly, T. L. §EGPLS§LQBL.HEEDRQ, New York: Macmillan, 1923. Kelly, T. L. ”The Selection of Upper and Lower Groups for the Validation of Test Items." W 9.1; W1 W. 1939. 3.0.. Lindquist, E. F. .Ststisgical Analysis in Banga- . o W, New ~l'ork: Houghtcn Mifflin . 1940 O Lindquist, E. F. "Preliminary Considerations in Objective Test Construction". Eggggtigng; Mgasggement, washington: American Council on Education, 1963, 119-158. 77. Lord, F. M. ”The Relation of the Reliability of Multiple Choice Tests to the Distribution of Item Difficulties.“ gsyghometzigg, 1952, XVI], 181.194. Myers C. T. “The Relationship Between Item Difficulty and Test Validity and Reliability." iiggisiisaal.sad.fszshslasissl.hassuasasat. , No. a, 1962, 565-571. N011. V. 18.1229119112322qu 1‘- megs6 Boston: Houghton Mifflin Co., 1957, p. 1 0. Richardson, M. W. ”Relation Between the Diffi- culty and the Differential Validity of a Test.“ Psygnomgtzigs, I,'1936, 35-49. Saupe, J. ”A Significance Test for Comparing the Ability of Two Measures to Discriminate Between Two Groups." Unpublished paper. East Lansing, Mich., 1965. Saupe, J. “Technical Considerations in Measure- ment“ In P. L. Dressel, (ed.), figgngzuggggggggn, Boston: Houghton hifflin Co., 1961, p. 444. Thurstone, T. G. “The Difficulty of a Test and Its Diagnostic Value.“ lggzngl,g£,figgg§&ignal Egyghglggy, 1932, XXIII, 335-343. Thorndike, R. L. ”Reliability“ In Lindquist, (ed.) Egggggigggl_Mggagrggent, Washington: American uncel on Education, 1963, p. 608. Traxler A. E. Nea ur m t and W in fine igprovemgnf £2 3%9923190 Washington: merican unc l on ducation, 1951. Tucker L. R. 'A Note on the Estimation of Test Reliability by the finder-Richardson Formula #20.“ Emishsasisika. Eli. 1949. 117-119. 78. Tyler. R. W- W W 12m . Columbus: Ohio State University, 1934. Annual Report, Princeton: Educational ’ Testing Service, 1963 ”Item Analysis Summary of University College Examinations.“ Unpublished Bulletin. East Lansing, Mich: Office of Evaluation Services, Michigan State University, July 1, 1964. AEREEDLZ‘. NATURAL SCIENCE 80. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. TEN IO DIFF 3105 .46 S54 .46 '2 .46 F69 .47 W24 .46 8103 .47 V44 .49 3104 .50 F3 .50 879 .51 F34 .51 F91 .52 F99 .52 W16 .52 S63 .52 S74 .52 $89 .53 W32 .53 $60 .53 824 .54 F2 .54 F56 .54 F111 .54 F90 .55 F87 .55 Peaked Test DI C .45 26. .52 27. .49 28. .58 29. .33 30. .58 31. .57 32. .44 33. .37 34. .27 35. .50 36. .40 37. .33 38. .29 39. .33 40. .48 41. .42 42. .42 43. .42 44. .53 45. .41 46. .40 47. .33 48. .43 49. .35 50. ITEM NO DIFF DI 826 .56 W102 .56 F10 .56 W59 .57 F75 .57 892 .57 W30 .59 F33 .59 F43 .59 W96 .59 F82 .60 ‘ W115 .60 W117 .60 888 .60 S48 .60 F80 .60 $87 ’ .61 W99 .61 W116 .61 W98 .62 873 .63 S44 .63 W13 .65 W79 .64 F104 .63 .60 .76 .49 .47 .43 .47 .44 .48 .44 .72 .66 .63 .58 .58 .66 .54 .61 .53 .61 .55 .67 .58 .38 .39 .63 NATURAL SCIENCE 81. Rectangular Test . ITEM 30, 9129, 5190. figgsu no. 9179, 5199, 1. 2109 .23 .25 26. F5 .56 .53 2. 969 .17 .65 27. 953 .57 .39 3. 351 .20 .30 29. 7109 .59 .69 4. 941 .21 .53 29. F107 .59 .44 5. F78 .22 .40 30. 9121 .61 .45 6. 9106 .25 .34 31. 9123 .63 .33 7. w71 .27 .29 32. F54 .65 .47 9. F63 .29 .41 33. W54 .67 .43 9. 964 .30 .33 34. W97 .69 .66 10. wee .31 .35 35. F77 .72 .45 11. 993 .33 .53 36. 912 .72 .70 12. F74 .35 .42 37. H51 .74 .73 13. 999 .37 .37 39. 917 .75 .59 14. F81 .39 .40 39. 97 .77 .56 15. F70 .41 .36 40. W106 .79 .45 l6. w34 .43 .36 41. we .91 .59 17. 995 .44 .45 42. F3? .93 .56 19. 931 .45 .51 43. w27 .95 .53 19. 992 .46 .45 44. 970 .94 .63 20. F31 .46 .41 45. W12 .96 .40 21. 997 .47 .31 46. 951 .97 .50 22. 992 .49 .39 47. 961 .99 .49 23. 9105 .52 .67 49. 9122 .99 .46 24. 959 .54 .33 49. w93 .90 .30 25. “18 .55 .51 50. 6114 .93 .25 NATURAL SCIENCE 82. Multimodal Test ITEM NO. DIFFI D §C. ITEM NO, DIFFI 2159. 1. F114 .12 .40 26. S3 .78 .34 2. 882 .12 .48 27. U104 .79 .62 3. W60 .14 .31 28. F112 .79 .38 4. F42 .16 .36 29. 896 .80 .51 5. F113 .20 .51 30. W114 .80 .61 6. F95 .20 .44 31. W53 .80 .44 7. “101 .41 .81 32. W15 .80 .37 8. W55 .42 .37 33. F67 .80 .68 9. F16 .42 .25 34. W118 .81 .59 10. F47 .43 .35 35. 89 .81 .42 11. 819 .43 .51 36. 811 .81 .42 12. S32 .43 .31 37. S36 .81 .35 13. 395 .43 .47 38. F25 .82 .32 14. 8109 .43 .43 39. F50 .82 .58 15. 898 .44 .37 40. W63 .82 .58 16. S71 .44 .60 41. W10? .82 .40 17. 822 .44 .53 42. W14 .89 .33 18. $18 .44 .49 43. U93 .90 .30 19. F9 .45 .43 44. 869 .91 .51 20. F106 .45 .54 45. F21 .92 .37 21. F117 .45 .66 46. W42 .92 .49 22. F44 .78 .34 47. S52 .95 .25 23. W49 .78 .40 48. 890' .95 .46 24. W111 .78 .40 49. F38 .95 .25 25. 393 .78 .70 50. W10 .97 .30 SOCIAL SCIENCE 83. Peaked Test t— ___- i I— ITEM no, DIFF, D139, ITEM 90, DIFF, 2199. 1. 9110 .49 .25 26. F29 .59 .41 2. 959 .49 .29 27. W49 .56 .33 3. F1 .49 .29 29. V24 .56 .69 4. F69 .49 .31 29. 913 .56 .33 5. 992 .49 .46 30. 9101 .57 .51 6. F92 .50 .33 31. F50 .57 .43 7. we .50 .29 32. 947 .59 .33 9. 9104 .50 .56 33. 915 .59 .53 9. 912 .50 .44 34. w105 .59 .33 10. 961 .51 .31 35. W46 .59 .46 11. 97 .51 .22 36. F29 .59 .53 12. w109 .51 .35 37. F23 .59 .44 13. F10 .51 .22 39. HQ? .59 .32 14. 969 .53 .35 39. 976 .60 .39 15. W55 .53 .39 40. F91 .60 .30 16. w45 .53 .36 41. F39 .60 .21 17. w90 .53 .36 42. F30 .61 .49 19. F90 .54 .33 43. 994 .61 .45 19. w2 .54 .41 44. 925 .61 .36 20. w93 .54 .25 45. F26 .61 .65 21. 94 .54 .33 46. W103 .62 .39 22. 994 .54 .67 47. w102 .62 .43 23. 91 .55 .23 49. F63 .63 .54 24. W16 .55 .23 49. 922 .63 .41 25. W99 .55 .35 50. F27 .63 .41 SOCIAL SCIENCE 84. Rectangular Test 1793 N0, QIFE, 219;, TEM no FF DI c 1. 993 .22 .22 26. F79 .57 .23 2. 620 .24 .44 27. F24 .59 .21 3. W61 .29 .31 29. 964 .59 .23 4. 940 .30 .39 29. F67 .60. .21 5. 959 .31 .55 30. F20 .63 .33 6. F45 .32 .56 31. 995 .65 .42 7. 916 .33 .20 32. F11 .66 .49 9. 967 .34 .36 33. F6 .67 .53 9. 9101 .36 .31 34. F99 .69 .56 10. 937 .39 .26 35. W53 .70 .43 11. F49 .39 .32 36. 995 .71 .52 12. 5100 .40 .21 37. 919 .73 .56 13. F31 .41 .36 39. 9102 .74 .36 14. F9 .44 .60 39. F57 .75 .45 15. 717 .45 .35 40. 9120 .76 .50 16. 1:79 .46 .60 41. F14 .77 .30 17. F92 .47 .31 42. 966 .90 .24 19. we .49 .33 43. W38 .93 .46 19. 92 .49 .46 44. F51 .95 .62 20. w65 .50 .29 45. 9107 .86 .51 21. F2 .51 .39 46. F41 .971 .59 22. 979 .53 .42 - 47. W63 .99 .57 23. 929 .54 .21 49. 921 .99 .55 24. 999 .55 .43 49. w7 .91 .51 25. N77 .56 .33 50. W12 .92 .23 SOCIAL SCIENCE 85. Multimodal Test ‘ ITEM no,~ DIFF, 2199. ITEM no, DIFF, 2139, 1. F75 .09 .23 26. was .76 .50 2. 941 .13 .39 27. F99 .76 .32 3. 930 .14 .31 29. F97 .77 .42 4. 942 .16 .21 29. F59 .77 .64 5. 945 .22 .34 30. F13 .77 .36 6. F65 .35 .33 31. 936 .77 .36 7. F93 .35 .42 32. 934 .77 .36 9. 944 .35 .47 33. 997 .79 .47 9. 99 .36 .22 34. 996 .79 .40 10. F43 .36 .39 35. 931 .79 .40 11. W56 .36 .49 36. w29 .79 .29 12. was .39 .30 37. W73 .79 _ .34 13. w59 .39 .34 39. F35 .79 .29 14. 9111 .40 .42 39. F5 .79 .40 15. F47 .40 .54 40. W22 .79 .45 16. W43 .40 .30 41. F42 .91 .51 17. W54 .40 .34 42. F44 .91 .26 19. F6 .40 .26 43. 999 .92 .37 19. 924 .75 .45 44. 960 .93 .34 20. 949 .76 .39 45. F36 _ .93 .34 21. 935 .76 .21 46. F32 .93 .46 22. 926 .76 .50 47. 973 .94 .43 23. 713 .76 .44 49. W50 .95 .40 24. W48 .76 .32 49. W5 .95 .25 25. 962 .76 .27 50. F7 .95 .25