THE CWFRRATWE EFFECm'ENESS OF DIFFERENT ITEM ANALYSis TECHNIQUES 5N INCREASING CHANGE SCGRE RELIABEUTY . ’Thesislfor the Degree of Ph. D. 7 ' > MICHIGANSTATE-UNIVERSITY LINDA D. MITCHELL; T_7 1970 . THES'S This is to certify that the thesis entitled THE COMPARATIVE EFFECTIVENESS OF DIFFERENT ITEM ANALYSIS TECHNIQUES IN INCREASING CHANGE SCORE RELIABILITY presented by Linda D. Mitchell has been accepted towards fulfillment of the requirements for PhoDs degree in Education ZZ/A/ddum z/léfltu’w” Major professor Date July 1, 1970 0-169 ¥ ‘ BINDING BY "DIG & SUNS’ BUUK MNDERYINB. LIBRARY BINDE R5 SPIIIBPOI‘I, IICHISIJ o‘— __-“ ‘— I .‘I EII I 3. L I ABSTRACT THE COMPARATIVE EFFECTIVENESS OF DIFFERENT ITEM ANALYSIS TECHNIQUES IN INCREASING CHANGE SCORE RELIABILITY By Linda D. Mitchell Four different procedures for selecting items to measure individual change were studied to determine which would result in sets of items with the highest change score reliability. The four methods of item analysis used for these change items were: selec- tion on the basis of change item score variance; selection on the basis of pretest response frequency; selection on Saupe' s correla- tion between change item score and total score; and selection on triserial correlation. The study was Specifically undertaken to determine whether these methods of change item analysis could lead to the selection of more reliable subsets of items than could be obtained by randomly choosing items from a pool. Comparisons between the different methods were also made. The sample used for item analysis and cross -v: Universi meninl group. C conductec each proc control pr subsets. one dichot W0 differ. 'I Scored for Were cach descriptiv. the em? e hIDOtheseE Smaller in. subsets Ch HmotheSeE Tukev aw Linda D. Mitchell cross-validation was a group of 263 students at Michigan State University who had been tested on the Inventory of Beliefs as fresh- men in 1958, and again as juniors in 1961. Half of this sample were assigned to an initial item analysis group. On the basis of their responses the four item analyses were conducted and subsets of 15, 30, 60, and 90items were chosen by each procedure from the original pool of 120 items. In addition, a control procedure of random selection was also used to select item subsets. Items were scored on both a one -to -four scale and a zero- one dichotomy. Item analyses were carried out separately for these two different scoring procedures. The items selected by the item analysis methods were then scored for the cross-validation group. Change score reliabilities were calculated based upon these responses. To obtain the best descriptive comparison, all reliability estimates were computed for the entire cross -validation group of 131 students. To test the hypotheses of the study, the cross -validation group was divided into smaller independent samples andlchange score reliabilities for item subsets chosen by different methods were computed on these samples. Hypotheses were tested by using a two-way analysis of variance with Tukey post hoc comparisons for mean differences. scored 0 resulted random 5 high Chan quency ar. \ methods (I than did 1‘; an“. sele this Case, N0 signinc methOdS of Linda D. Mitchell The results of the analysis showed that when the items were scored on a one -to -four scale, three methods of item analysis resulted in significantly higher change score reliability than did random selection. Saupe' s r was the most successful in producing dD high change score reliability. Selection on the basis of pretest fre- quency and change score variance were equally effective. When the items were scored on a zero -one basis, three methods of item analysis resulted in greater change score reliability than did random selection. These were: selection on change vari- ance, selection on pretest frequency, and triserial correlation. In this case, Saupe' s correlation was not superior to random selection. No significant differences were found between the three successful methods of change item analysis. T} THE COMPARATIVE EFFECTIVENESS OF DIFFERENT ITEM ANALYSIS TECHNIQUES IN INCREASING CHANGE SCORE RELIABILITY By \.,,/ Linda DT'IMitchell A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Educational Psychology 1970 Mehrens, assistano work and suggestio Committe Bell - - a r to Dr, Ir enabled 1 and [‘eSe ACKNOWLEDGME NTS The author expresses her sincere thanks to Dr. William A. Mehrens, Chairman of the Guidance Committee, for his counsel and assistance throughout her doctoral program and in the experimental work and preparation of the manuscript for this study. The helpful suggestions and editorial comments of members of the Guidance Committee--Dr. Andrew Porter, Dr. Leroy Olson, and Dr. Norman Bell--are gratefully acknowledged. Special thanks is also extended to Dr. Irvin J. Lehmann, who generously provided access to the data used in this study. The financial support of an NDEA Title IV Fellowship enabled the author to carry out her doctoral program of coursework and research at Michigan State University. ii LIST OF CHAPTE] I . III. TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . v CHAPTER I. THE PROBLEM . . -. . . . . . . . . . . . . . 1 Purpose of This Study . . . . . . . . 5 Hypotheses . . . . . . . . . . . . . . . 6 Theoretical Rationale 7 An Overview . . . . . . . . . . . . . . . 12 11. REVIEW OF LITERATURE . . . . . . . . . . . 13 Summary . . . . . . . . . . . . . . . . . 20 III. DESIGN OF THE STUDY . . . . . . . . . . . . 22 The Sample . . . . . . . . . . . . . . . . 22 The Instrument . . . . . . . . . . . . . . 23 Design . . . . . . . . . . . . . 25 Item Analysis Pwrocedures . . . . . . . . . 27 Testable Hypotheses . . . . . . . . . . . . 28 Statistical Analysis . . . . . . . . . . . . 29 Summary . . . . . . . . . . . . . . . . . 30 IV. RESULTS . . . . . . . . . . . . . . . . . . . 32 Results for One -to -Four Scoring . . . . . . 32 Testing Hypotheses for One -to -Four Scoring . . . . . . . . . . . 36 Results for Zero -One Scoring . . . . . . . . 38 Testing Hypotheses for Zero -One Scoring . . . . . . . . . . . . . . . . 42 Summary . . . . . . . . . . . . . . . . . 44 iii CHAPT. BIBLIOG APPE ND} CHAPTER Page V. SUMMARY AND CONCLUSIONS . . . . . . . . . 46 Summary . . . . . . . . . . . . . . . . . 46 Conclusions . . . . . . . . . . . . . . . . 48 Discussion . . . . . . . . . . . . . . . . 49 Implications for Future Research . . . . . . 51 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . 54 APPENDIX.......................57 iv TABLE LIST OF TAB LES Change score reliability coefficients computed for the total cross -validationsample using the one-to -four scoring system . Change score reliability coefficients computed for independent cross -validation samples using the one -to -four scoring system Two -way analysis of variance for the effects of item analysis method and number of items on change score reliability (with the one -to -four scoring system) . Differences between reliability estimates for items chosen by different item analysis methods (one -to -four scoring) Change score reliability coefficients computed for the total cross -va1idation sample using the zero —one scoring method Change score reliability coefficients computed for independent cross -validation samples using the zero -one scoring system Two -way analysis of variance for the effects of item analysis method and number of items on change score reliability (with zero -one scoring) Differences between mean change score reli- abilities for‘items chosen by different methods. (Scores are Fisher r-to -Z transforms.) Page 33 34 35 38 39 40 43 TABLE A.1 A.2 A.3 A.4 A.5 A.6 A.7 151.8 A.9 TAB LE A.1 Listing of subscales in which each change -item first appearedafter item analysis with one —to -four scoring system Listing of subscales in which each change -item first appeared after-item analysis with zero -one scoring Percentage of item overlap for scales chosen by different item analysis methods -- 15 items, one -to -four scoring Percentage of item overlap for scales chosen by different item analysis methods-- 30 items, one -to -four scoring Percentage of item overlap for scales chosen by different item analysis methods-- 60 items, one -to -four scoring - Percentage of item overlap for scales chosen by different item analysis methods-- 90 items, one -to -four scoring Percentage of item overlap for scales chosen by different item analysis methods-- 15 items, zero -one scoring ‘ Percentage of item overlap for scales chosen by different item analysis methods —- 30 items, zero -one scoring Percentage of item overlap for scales chosen by different item analysis methods-- 60 items, zero -one scoring Percentage 'of item overlap for scales chosen by different item analysis methods --90 items, zero —one scoring vi Page 66 72 78 78 78 79 79 80 80 81 researct change f this prot individu; Where I] score a‘ Score, CHAPTER I THE PROBLEM A methodological problem frequently encountered by researchers in education is how to obtain measures of growth or change for subjects over a given period of time. One approach to this problem has been to calculate the change score for each individual, using the formula: D=X-Y, (1) where D is the change score, Y is the score at time 1, and X is the score at time 2. D has also been called a gain Score, or discrepancy score. Researchers who have attempted to use such change scores, however, have been plagued by one persistent psychometric problem. These change scores are remarkably unreliable. Noted measure- ment experts such as Lord, Horst, Webster, and Bereiter have long recognized this problem (Harris, 1963). When the researcher is primarily interested in measuring change for a group, this problem of low reliability is not .too serious; however, if he wishes to make meaningful comparisons between individuals on the basis of their growth or attitude change, then the lack of reliability becomes crucial. When the traditional formula for the reliability of change scores is examined, two factors seem to be necessary to obtain high changerscore reliability. This formula as derived by Gulliksen (1950, p. 353): is: r _ r - rXY (2) DD 1 - rXY where r 18 the mean of rXX and rYY' From fine It appears that In order to obtain a high value for *r the interrtal consistency of the DD’ test at time 1 (r Y) and at time 2 (rXX) should be high, but the Y stability coefficient for the test over time (r ) should be somewhat XY lower. Thus change score reliability can be increased if the test- retest correlationcan be reduced while test homogeneity (or internal consistency): is maintained at a high level for each separate adminis - tration of the test. When the reliability of an instrument is unsatisfactory, a common psychometric practice is to construct more items, since longer tests are usually more reliable. The obvious drawback in this procedure is that for many testing situations the number of items must be kept to a minimum for practical considerations of time and economy. When this is the case, item analysis techniques are usually employed to select subsets of the most discriminating items from the original pool so that the test can be shortened without seriously reduc- ing its reliability. Ordinary item analysis procedures, usually based upon a sing-1e test administration, are designed to improve test internal consistency or to yield a test which correlates highly with some criterion. Such methods are not guaranteed to work for change score reliability. Theorists such as Bereiter (1963), Saupe (1966 and 1961), and Lord (1968, p. 331) have suggested that a researcher'who desires to construct an instrument, sensitive to individual change, should use item analysis techniques suited for that purpose. Several new techniques for such item analyses have recently appeared in the literature. One of these methods is based upon observing the response changes to items over time (Gruber and Weitman, 1962). Items for which there is a "moderate" change frequency when a group is tested and retested at a later date should be selected. This tends to eliminate those items for-which the group exhibited little change in response over time as well as items for which the group displayed a universal change over time. In other words, items for which there-is"'moderate" rate of change will be items for-which there was variation between subjects in their response changes. A second method uses response frequency to items on the pretest only. With this method the expected direction of change must be known in advance (Gruber and Weitman, 1962). The experimenter then selects items which had a. low percentage of negative responses on the pretest if a high percentage of negative responses is expected on the posttest, or vice versa. In a third method items are selected which have a high cor- relation between item response change and total change score. This correlation is determined from a formula derived by Saupe (1966), which is equivalent to the Pearson Product Moment correlationvalue. A fourth method of item analysis was employed in this study which had not been revealed when literature in this area was reviewed. With this method items are selected if they have high triserial cor- relation values when the correlation between total change score and trichotomized change in item response is computed. Because of the relative newness of these-item analysis methods there has been little empirical research to determine whether or not they could effect increases in change score reliability. Also, the comparative efficiency and effectiveness of these different procedures is completely unknown. Such informationis sorely needed by researchers who face the problem of constructing: instru- ments to reliably measure growth or attitude change for individuals over time (Lord, 1968, p. 331). ' To provide further-information on this topic, an empirical study was designed to examine these various item analysis procedures and their effects upon change score reliability. Purpose of This Study The purpose of this study was to determine whether use of the item analysis methods previously discussed could increase the reliability of change scores on a collegiate attitude survey. Four specific questions of central importance to this issue were raised. 1. Which of these four item analysis methods would result in selecting a subset of items with the highest change score ' reliability ? 2. Whichcorrelational method would result in the higher estimate of change score reliability for selected subsets of items? 3. Wouldthe response frequency method, based upon the variances of response changes from pretest to posttest, result in higher change score reliability than the method which uses only pretest response frequency? 4. Could reliability of change scores for items selected on the basis of pretest response frequency exceed the reliability of an equal number of randomly chosen items? This fourth questionwas particularly interesting because of its practical significance for test construction. In many attempts to measure change the experimenter simply does not have time to con- struct his instrument and run a complete item analysis on test-retest data before he can gather his data. (This is especially true for longitudinal studies.) Thus, if a method could be developed to elimi- nate useless items on the basis of pretest characteristics alone, it wouldbe extremely helpful and time -saving for the researcher and his subjects. Hypotheses On the basis of reliability. and item analysis theory, four general hypotheses were formulated in an attempt to answer the questions under investigation in this study. These hypotheses were: 1. Use of Method III (computing the PPM correlation for total change score and item change response) would result in a subset of items with higher change score reliability than that of item subsets chosen by any other method or by random selection. 2. Method IV (computing the triserial correlation between change scores and change in item response) would result in a subset of items with higher change score reliability than could be obtained for items chosen by response frequency methods or by random selection. 3. Method I (selecting items which showed variance in changes in response frequency over time) would result in a subset of items with higher change score reliability than that of items selected by Method II (selection on the basis of pre - test response only). 4. A subset of items could be selected by Method II (pretest response frequency) which would have higher change score reliability than a randomly selected subset of items. Theoretical Rationale The idea of attacking the unreliability of change scores at the item level can be credited to Bereiter, who formulated the con- cept of the change item. The change item was defined in this way: A single item administered on two occasions yields an item change score which is the difference between the item scores on the two occasions. If the item is scored dichotomously, 1 or O, oneach occasion, then the change item may take any of three values, 1, O, or —1. (Bereiter, 1963, p. 10) This definition can be expanded to include items which have more than 0 or 1 as a possible score on each occasion, such as those found on many attitude scales. Change items may thus be scored for both direction and amount of change, and change item scores may be summed, like ordinary item scores, to get a total change score, D = Zdi (3) where d. = x. - y. . (4) In this definition yi is the individual' 8 response to item 1 at time 1, and xi is his response to item i at time 2. Bereiter believed that item analysis procedures could be carried out on the change items to improve change score reliability. Furthermore, he maintained that change score reliability could be adequately defined using a classical definition of reliability. Using change item scores the formula becomes: ZS2 d. 1 r = 1 - (5) DD 2 ZS d. + zcdd. 1 1 J i f J' where Szd is the variance of a change item and Cd d is the covari- i i J' ance for the change item scores of items i and j. Bereiter then i.I.ll.-II Ill-Iii '5‘ 4 AT; . I... .2 .FFQL‘ hypothesized that increases in change score reliability could be attained by selecting items in such a way as to maximize the change item covariances. Clearly the two methods of item analysis which use corre- lations between total change score anditem change score as the indices for selecting items are directly based on this line of thought. Change items which have high correlations with total change score must have high intercorrelations with‘each other. Further consideration of thetwo correlational indices for change item analysis reveals that they correspond directly to two popular indices often used for selection of regular dichotomous items. The familiar point -biseria1 correlation coefficient for dichotomous items is actually a Pearson Product Moment correlation (Magnusson, 1966, p. 199). In addition, the triserial correlation is derived usingthe same assumptions as the well -known biserial correlation, and the formulae for these two statistics are identical, except for the inclusion of the parameters for the third category in the tri- serial expression. Expressions for both biserial and triserial correlations can be derived from the general expression for the multiserial correlation coefficient given by Jaspen (1946). (A more complete discussion of this topic follows in theliterature review in Chapter II. ) These similarities should help to answer the question: 10 Which correlational method of change item analysis will result in greater change score reliability? Lord (1968, p. 344) pointed out that when there are ability differences between item analysis and cross-validation groups, the biserial correlation, which is unaffected by the factor of item difficulty, might be better for selecting items with high reliability across groups; however, when the groups are similar, the point biserial method might produce a more reliable test. In this experiment subjects from the same population were randomly assigned to item analysis and cross -validation groups. Since the two groups could be expected to be fairly similar, it was hypothesized that the PPM method (the point biserial method) would result in more reliable change scores than would the triserial index. The rationale for selecting items on the basis of response frequency is apparent from Formula (5). One way to increase reliability is to increase the item covariance/ variance ratio. Assuming item intercorrelations remain constant, this can be accomplished by selecting items which have large variances. As individual item variances are increased, item covariances must also increase, but the total of the item covariances will increase at a faster rate than the total of the item variance. The change items with the greatest variances will be those with moderate "difficulty" levels or frequencies of response change. This is clearly illustrated 11 if two extreme cases for change items are considered. An item for which there was no change inresponse between testings will have a mean change score of 0 and a change variance of 0. Likewise an item for which there was a universal shift for the group from positive to negative response will have a mean change score of 1 and a change variance of 0. Such items can contribute nothing to reliability (Shoe - maker, 1969.); however, items which have moderate frequencies of response changes will have variances which are larger and can reflect differences in individual changes which are necessary for high change item covariances and, consequently, high change score reliability. If the direction of change cannot be predicted, it is irnpos- sible to choose items on the basis of pretest response frequency to insure that they will have adequate response change. If the direction of the change can be predicted, then the items can be chosen which are likely to have response shifts that will produce the desired rate of change. For example, if a shift toward a positive ”Agree" response is expected over time, then items which initially have a high proportion of ”Disagree" responses will be likely to have a moderate frequency of response changes overtime. (Obviously the researcher must hope that a total shift to the "Agree" response does not occur.) This procedure is more risky than selecting items when 12 the actual response changes and their variance can be computed from a complete set of test-retest data. It was generally expected that correlational methods would be superior to response frequency methods for item selection because the correlational procedures depend upon both item variances and their intercorrelations, while the frequency methods fail to con- sider how an individual item covaries‘with others in the item pool. An Overview Further discussion of theoretical works and empirical research studies whichare related to the problem of selecting items to measure change and the reliability of change-scores is presented in the Review of Literature, Chapter II. An empirical study designed to compare several different change item analysis methods is described in Chapter III. In Chapter IV the method of statistical analysis used to test the hypotheses of this study and the results of that analysis are presented. The conclusions from this study, dis- cussion of the results, and some implications for future researchin this area have been summarized in the fifth and final chapter. CHAPTER II REVIEW OF LITERATURE Whenever the problem of measuring change is considered, the researcher must be careful to specify whether he wishes to evaluate mean change for a group or to study relative changes between individuals within a group. The need for this distinction has been pointed out by Lord (1963), Webster (1963), and Tucker et a1. (1966). If individual differences are the main interest, then the researcher must be concerned with the reliability of his observations of change (Webster, 1963). Traditionally difference scores have been regarded as so unreliable that Gulliksen (1950, p. 354) urged that standardized test publishers should warn their users of this fact and actually report difference score reliability in their technical manuals. Lord (1958) urged that counselors should make very cautious interpretations when advising individuals on the basis of difference scores. Concern over the reliability of difference scores led to the development of several different expressions for its estimation. 13 14 Onewell -known expression for this reliability was given by Gulliksen (1950, p. 353): XY Fm) ' 17?“); - ‘6’ The value for r is found by computmg the meanof rXX and rYY' Lord (1963) cautioned users of this formula to remember that it requires the assumption that S2X = SZY . The difference scores used in Gulliksen' s formula were usually computed by subtracting an examinee' 8 score on Test A from his score on Test B when A and B are composed of different test items. Change scores, however, are usually difference scores com- puted when the same test is administered to an individual on two separate occasions. For this reason Webster (1963) indicated that Formula (6) may be unsatisfactory for computing change score reli- ability. He noted that this‘formula derivation rests upon the assumption. that errors of measurement are completely uncorrelated, but maintained that this assumption may be unrealistic when the same form of a test is administered twice to an individual. By substituting changerscores into the familiar formula for‘Kuder Richardson 20, he derived an expression for change score reliability which does not require this questionable assumption. Furthermore, it does not 15 require that the test have equal variances at time 1 and time 2. This formula for change score reliability uses data at the item level and is written: k k 3? + Y + Z (1:2 r=——-——-l- g (7) DD k-l S fem-2? D In this expression, f is the number of items scored 1 on both occa- sions; 3? is the mean of scores at time 1; Y, the mean of the scores at time 2; J? is the mean score of all individuals in the group on item x; 5? is the mean for-item y; and k is the number of items. Bereiter (1963) used the general expression of Cronbach' s coefficient alpha and, by substituting change item score for traditional item scores, he defined change score reliability as: 252 d. 1 r = 1 - (8) DD 2 ZS d. + zcdd. 1 1 j i f- l 82d is the variance of the change item scores and Cd (1 is their 1 i l covariance. This expression can be shown to be equivalent to Webster' 3 derivation (Formula 7) except for the absence of the in the derivation by Bereiter. This computational k factor k _ 1 formula for change score reliability also uses data at the item level. 16 In addition, it has the advantage of removing the restriction of dichotomously scored test items. Based upon this formula Bereiter developed a plan for manipulating change score reliability at the item level. He suggested that item analysis techniques could be used to select items which had large change item covariances. When the change item covariances are large for a set of items, there is large variability between sub- jects on their changes in response to these items. Such variance in change scores results in a lower stability coefficient, rXY , and con— sequently a higher estimate of r It should be noted that this DD’ relationshiprexists regardless of which formula is used to calculate I‘DD. In an empirical study using an attitude questionnaire for college students, Webster and Bereiter (1963) reported that they-were able to effect large gains in change score reliability when they employed such an item selection technique. Bereiter, however, makes little mention of the actual index or decision rule used in this item selection. Horst (1966, p. 387) indicated that most item analysis pro- cedures for'raising reliability fall into one of two main categories: correlational and counting procedures. (Counting procedures use response frequency data.) It is obvious that this categorization 5mm 17 scheme can be extended to the realm of change item analysis as well. This classification will be used in the remaining discussion of item analysis procedures designed for selecting items to give reliable change measures. In 1962, Gruber and Weitman studied changes in item responses to an achievement test. They wanted to measure students' retention of subject matter over time using a pretest -posttest design. Although their goal was not to improve'reliability of change scores per se, they suggested that changes in response frequency to items might be useful as a basis for item selection in the measure- ment of change. In their study the researchers employed two methods of item selection. The first method was based upon observing shifts in response frequency toward a specified, optimal level of difficulty from initial testing to retesting. In the second procedure items were selected on the basis of their pretest response level only. The researchers emphasized the necessity of knowing in advance the direction in which response change is likely to occur over time. The results of this study indicated that it was possible to improve discrimination on the posttest by selecting items on the basis of response shifts. Selecting items on the basis of their pretest dif- ficulty level did not significantly improve the discrimination between subjects on the posttest. However, the authors felt that this could 18 have been due to the limitations of ceiling effect on their instrument and the small sample size rather than to ineffectiveness of the method itself. No data for the change score‘reliability was reported, although these estimates could have been easily computed. Thus it is not known if these item selection techniques could have improved the reliability estimate for changes in retention. The first correlational method of change item analysis was derived by Saupe (1966). Based upon Gulliksen' s formula .for the correlation between a component element and a composite, Saupe' s formula for the correlation for change item with total change score is: _ CxX + Cy Xy Y "CY rdD‘ 2 2 V2 2 st+Sy-2ny SX+SY-2CXY where x and y denote item scores, X and Y are total scores, and C - C (9) is covariance. Lord (1968, p. 331) urged that empirical studies be under- taken in this area and stressed the need for development of still other item analysis procedures for change items. The primary reason that most traditional item analysis methods cannot be used with change scores is due to the nature of the change item itself. As defined by Bereiter (1963), the change 19 item, (11 = xi - yi, can have at least three values, 0, 1, or -1. This rules out the possibility of using the biserial correlation which is frequently employed as an index foritem selection. (The biserial correlation, point biserial correlation, tetrachoric and phi coefficient all require dichotomously scored items.) J aspen (1946) developed a formula for the triserial correlation. This was intended to serve as a computational formula for a correlation between two variables when both were assumed to have underlying normal, continuous distributions, but when one distribution had been artificially divided into three categories. This expression is a direct counterpart to the biserial correlation used for computing correlations between a continuous variable and a variable classified into two artificial cate - gories. Jenkins (1956) presented a simplified version of Jaspen' S formula for triserial r: r _ Mhyh+Mm(yl yh) 'Mlyl (10) tris - +( _ )2 + O- y h yI yh y 1 ph pm pl where M = mean, y = curve ordinate, and 0": s.d. of scores. The letters h, m, and 1 represent high, medium, and low categories. 20 If the item change scores of 1, 0, and -1 are used to designate the divisions of high, medium, and low, it is seen that rtris could be-written as an index for change item analysis: Mlyl + Mo‘y-I ' VI) ’ M-ly-l r . = L (11) tris 2 + ( _ )2 + 2 y 1 ‘y-l y1 y -1 PI po p-l Triserial r, however, had never been used as an index for item selection, despite its availability. Summary To summarize this review of literature onitem analysis methods for change items and change score reliability, several key points should be noted. First, in recent years there has been great interest in the problems of measuring change and considerable con- cern over the lack of reliability for change scores. This low reli- ability made it extremely difficult to predict change for an individual, or to make counseling or placement decisions based on change scores. Several different formulae for change score reliability were discussed in this chapter. It seems best to conclude that when the same form of a test is used for both initial and final testing, then 21 Bereiter' 3 Formula (8) or Webster's Formula (7) for change score reliability, is preferable to the traditional formula, derived by Gulliksen (Formula 6). Regardless of which formula is used for rDD’ several researchers have suggested that low change score reliability can perhaps be improved through item analysis procedures. Three tech- niques have been suggested in the literature. These are: selection of items on the basis of observed shifts in response; selection of items on the basis of pretest response frequency when the direction of expected change can be predicted; and use of a correlational index based upon correlation between change item score and total change score. The development of a formula for triserial correlation was also presented and this statistic was suggested as a fourth possible index for-item selection in the measurement of change. Reports on studies comparing these various change item analysis methods are "conspicuous by their absence" in this review. It is readily apparent that empirical investigation of these methods is essential to determine if they can be successfully used to improve change score reliability. It was toward this end that the study of change item analysis techniques, described in Chapter III, was undertaken. CHAPTER III DESIGN OF THE STUDY An empirical study was designed to compare the change score reliability for subsets of items selected by four different change item analysis procedures. The four procedures compared were Saupe' 8 item -total score correlation for r dD , triserial corre- lation between change item and change score, selection of items having high variance in change scores, and selection of items on the basis of pretest response frequency. In addition, a control method, selecting items randomly from the original item pool, was used. The Sample In the fall of 1958, the first-term freshman class at Michi- gan State University was tested on a variety of achievement, aptitude, attitude, and personality measures. All freshmen were included in the population who met the following criteria: (1) The student must have been a first time freshman--not a past dropout or a transfer from another university; (2) The student must have been a native born American. 22 23 In 1961, a sample was drawn from this original population. This groupeconsisted of students who were still enrolled in the university at that time. These students, then juniors at MSU, were retested on the same measures. The test ~retest data from 263 students in this sample were used for this item analysis experiment. The Instrument The instrument selected for use in this study was the Inventory of Beliefs, Form 1.. This attitude survey was developed by the Cooperative Study of Evaluation in General Education under the sponsorship of the American Council on Education Committee on Measurement and Evaluation. The scale was designed to measure an individual' 8 tendency to subscribe to stereotypic beliefs (Lehmann and Dressel, 1963). Items on this inventory were taken from an original pool of one thousand items, composed by a panel of counselors and evalua- tion officers from twenty colleges which participated in the Coopera- tive Study. The 120 statements which were selected for the final scale were written in the form of "pseudo -rational cliches. " (American Council on Education, 1953). Some sample items from this inventory are: "No world organization should have the right to tell Americans what they can or cannot do. " 24 "We would be better off if there were fewer psychoanalysts probing and delving into the human min ." "When things seem black, a person should not complain, for it may be God' swill. " "Most Negroes would become overbearing and disagreeable if not kept in their place. " There were four possible responses to each item--Strongly Agree, Agree, Disagree, and Strongly Disagree. Two separate scoringschemes were used in this study. The scoring instructions from the Instructor' s Manual award the examinee-with one point for each Disagree or Strongly Disagree response. The second scoring scheme used in the item analysis study awarded one point for a response of Strongly Agree; two points for Agree; three points for Disagree; and four points for Strongly Disagree. Lehmann and Dressel (1963, p. 27) characterized the higher scorer as "mature, flexible, adaptive, and democratic in his relationships with others; a low scorer is immature, rigid in outlook, compulsive, and authoritarian in his relationships with others. " Thevreliability coefficients reported for this scale in the MSU study ranged from . 68 to . 95, with a median value of . 86 (Leh- mann and Dressel, 1963). The Inventory of Beliefs was considered appropriate for use in this change item analysis study for the following reasons: 25 1. The scale was designed expressly for the purpose of measuring the attainment of educational objectives. Thus change scores over time were expected to be fairly large and could be meaningfully interpreted. 2. The instrument was professionally developed. Considerable effort went into construction and item analysis. Reliability and validity for this scale had been demonstrated (Dressel and Mayhew, 1952). 3. The internal consistency reliabilities reported were high, but test -retest reliability coefficients after a lapse of time were lower, thus indicating a fairly wide range of individual differences in attitude change. 123.939. An item analysis, cross-validation design was employed in this study. The sample was randomly split into two groups. The item analysis group consisted of 132 students; the cross-validation group was composed of 131 students. Items from the 1958 and 1961 test administrations for these students were scored by both the zero -one scoring method and the one -to -four method described earlier. Item change scores were computed in accordance with Formula (4), 26 and total change scores were computed for each student. The data from the item analysis group, scored on the zero- one basis, was submitted to four different item analysis procedures and a control procedure of random selection. Data scored with the one -to -four system was submitted to three item analysis procedures and random selection. (It was necessary to omit the triserial corre- lation method because it was only appropriate for dichotomously scored items.) Subsets of 15, 30, 60, and 90 items were chosen under each procedure. These item subsets were used for computing reliability estimates for change scores on the cross -validation group data. The actual computation formula for the change score reliability was obtained by substituting Bereiter' s definition of change score reliability (Formula 8) into Webster' 8 expression for the Kuder- Richardson 20 for change scores (Formula 7) to get a change score version of Cronbach' s coefficient alpha (Cronbach, 1951). 2S2 d (12) i 2 ZSdC+ZCid 1 13 1753' 27 Item Analysis Procedures Method I was an item analysis procedure based on the variance of the change item scores. After the change item scores, di’ were computed, the mean change score di and the change score variance 82 d were found for each item. Items with the largest values for S d were selected. On this basis subsets of 15, 30, 60, i and 90 items were chosen from the original set of 120 items. Method 11 required that items for the subsets be chosen on the basis of pretest response frequency. With this method it was necessary to take into account the expected direction of the change. Because the Inventory of Beliefs had been developed to measure attainment of objectives of higher education, it seemed reasonable to predict that students' scores would increase over time. (Data from the Lehmann and Dressel study upheld this prediction.) Item means, 551, were computed for each item on the pretest (the measure taken in 1958, when the students were freshmen). Items with the lowest mean scores were selected into the 15, 30, 60, and 90 item subsets. Method III was a correlational item analysis procedure for which the index of item selection was the expression derived by Saupe (Formula 9): Items WGI‘EE betwee ingtol “fins ‘ the test data 111' “15. : ChOSen 28 = CxX+CyY-C 'CXy r xY dD 2 2 V2 2 ‘\/Sx+Sy-2ny SX+SY-2CXY Items which had the greatest correlations with total change score were selected into the test subsets. For Method IV the triserial correlation coefficients between change item and total change score‘were computed accord- ing to Formula (11): M1y1 + M0(y-1 ' 3'1) ‘ M-Iy-I rtris = 2 2 2 y 1 (y 1 - yl) y _1 0‘ — + + —— pI po -1 Items with the highest positive values for rtris were selected into the test subsets. This method, of course, was only applied to the data that had been scored on a zero -one basis on the original tests. The control method consisted of selecting randomly subsets of 15, 30, 60, and 90 items for comparison with those which had been chosen by the systematic item analysis procedures. Te stable Hypotheses The specific hypotheses tested in this study were: 1. The mean change score reliability for items chosen by Method III (Saupe' s correlation between change item and t‘ . M Using the 29 total change score) would be greater than the mean reliability for item subsets chosen by any other item analysis method or by the control method of random selection. 2. Mean change score reliability for subsets of items chosen by Method IV (triserial correlation) would be greater than the mean reliability for the subsets of items chosen by either the response frequency methods or by random selection. 3. Mean change score reliability for the subsets of items selected by Method I (using change item variance) would be greater than the mean reliability of item subsets chosen by pretest response frequency or by random selection. 4. Mean change score reliability for the subsets of items selected by Method II (using pretest response frequency) would be greater than the mean reliability of subsets of randomly selected items. Statistical Analysis Two procedures were used to compare the change score reliability coefficients computed on the cross -validation sample. Using the first method, all reliability estimates for each subset of Henn- smde SOD. ences vahda one sa then re score I Ofmer withat of iterrx Tukeyv Was us: ofthe1 gan 81 an em; and Dr 30 items were computed on the whole cross -validation sample of 131 students. This was to provide the best overall descriptive compari- son. To test the statistical significance of the observed differ- ences in the change score reliability coefficients, the cross- validation group was divided into smaller independent samples -- one sample for each item analysis method. These samples were then randomly assigned to the item analysis procedures and change score reliabilities were computed. Fisher r-to -Z transformations of the reliability coefficients were used, and the values were analyzed with a two —way analysis of variance. (One main effect was method of item analysis; the otherwas number of items in the subset.) Tukey' 8 test for an honestly significant difference (Kirk, 1968, p. 88) was used to test the significance of the differences between the means of the reliability estimates. Summary Test -retest data were obtained from a sample of 263 Michi- gan State University students in their freshman and junior years on an attitude survey called the Inventory of Beliefs. (These data had been collected as part of a longitudinal study conducted by Lehmann and Dressel from 1958 to 1962.) 31 The data were scored by two different methods --a zero -one scoring method and a one -to -four scaling method. Item change scores were computed for all 120 items on the questionnaire. Data from half of the sample were subjected to four different item analysis procedures and a control procedure of random selection. The item analysis procedures used for items scored zero -one were: Saupe' s correlation index, triserial correlation, selection for large change variance, and selection on the basis of pretest response frequency. All of these same procedures were used for the data scored on a one -to -four scale, except for triserial correlation. Sub - sets of 15, 30, 60, and 90 items were chosen by each method. Change score reliabilities for these subsets of items were computed using change score data from the cross-validation group. A change score reliability version of coefficient alpha was used. A two -way analysis of variance and a Tukey post hoc comparison test were used to test for differences in change -score reliability for the items chosen by different methods. Result —7 a one - Control of 15, analys- on ller Change item a il‘om r elied rElia” SEUM ‘" 911a] CHAPTER IV RESULTS Results for One -to -Four Scorig When the items of the Inventory of Beliefs were scored on a one —to -four scale, three methods of change item analysis and a control method of random selection were employed to select subsets of 15, 30, 60, and 90 items. The three methods of change item analysis were: selection on pretest response frequency, selection on item change score variance, and Saupe' s correlation between change item score and total change score. Detailed results of the item analyses are presented in the Appendix. After the subsets of items had been selected, using data from the 132 students in the item analysis group, the change score reliability for each item subset was computed using the item responses of the 131 students in the cross -validation group. These reliability coefficients are presented in Table 4. 1. From the results presented in Table 4. 1, it is apparent that Saupe' 8 method of change item analysis consistently resulted in more reliable subsets of items than did either of the other two item analysis 32 33 methods or the control method of random selection. There was little difference between the. reliability coefficients of item subsets chosen by the two response frequency methods (Method I and Method II); however, both of these methods resulted in higher reliability of change scores than did the control method for subsets of 15, 30, 60, and 90 items. TABLE 4. 1. --Change score reliability coefficients computed for the total cross-validation sample using the one -to -four scoring system. Number of Items Item Analysis Method 15 30 60 90 Method I (Change Variance) . 50 . 61 . 75 . 83 Method 11 (Pretest Frequency) . 50 . 65 . 78 . 83 Method HI (Saupe' s rdD) . 63 .70 .80 . 85 Method IV (Random) . 30 . 49 . 70 . 80 Another point that should be noted from the data presented in Table 4. 1 is that the differences between reliability coefficients were greater when fewer-items were selected from the original pool. At the 90 -item level the reliability values ranged only from . 85 for Method III (Saupe' s) to . 80 for the control. At the 15—item level, however, the range was from . 63 for Saupe' s method to .30 for the control. 34 To test the statistical significance of the differences between change score reliability estimates obtained for item subsets chosen by the different methods, the cross-validation sample-was divided into four random subsamples with 32 students in each group. Each of these samples was then randomly assigned to a different item analysis method. The reliability coefficients for 15, 30, 60, and 90 items chosen by a method were then computed using the data from the small group which had been assigned to it. Thus reliability estimates obtained under different item analysis methods were cal- culated for independent samples to meet the assumptions of the analysis of variance model. These change score reliability coeffi- cients are reported in Table 4. 2. TABLE 4. 2..--Change score reliability coefficients computed for independent cross -validation samples using the one - to -four scoring system. Number of Items Item Analysis Method 15 30 60 90 Method I (Change VarIance) . 48 _ 61 . 75 _ 34 Sample 1 Method II (Pretest Frequency) . 56 . 67 . 76 . 30 Sample 2 Method III (Saupe' s rdD) . 76 . 80 . 35 , 39 Sample 3 Method IV (Randmn) .36 .42 .64 .76 Sample 4 35 Fisher r-to -Z transformations of the values in Table 4. 2 were used as the dependent variables in a two -way analysis of variance (fixed effects model) with one observation per cell (Winer, 1962, p. 217). In this analysis, item analysis method was one independent factor with four levels; number of items was the second factor with four repeated measures on each sample. Because there was only one replication per cell, a Tukey one-degree -of -freedom test for nonadditivity (Winer , 1962, p. 218) was conducted to test for the confounding effects of an interaction in the error term prior to running the two -way ANOVA. No significant interaction effect was detected at the alpha level of .05. TABLE 4. 3. --Two-way analysis of variance for the effects of item analysis method and number of items on change score reliability (with the one -to -four scoring system). Source of Variance 33:12:: d. f. M. S. F Ratio Item Analysis Method .622 3 .207 51. 75** Number of Items .742 3 .247 61. 75** Residual . 039 9 . 004 Total 1. 403 15 **Significant at alpha = . 01. 82 .' e alt—m'r' xii asr'Isv 9:0 to annltamotanm to .ti's'lleznr vswe u'oVI £- 11? Sssldsi‘mv 11198! ‘ I ywnih" ”~93 l n modem-mar. 4:10 this: (lebom ma' ‘ -' Unfiffirn >§:.".:'lrxgl; 11"!“ .ai'i‘d’ISflS m d . moi mm W’ .. .. hwmoqs't 1M «“3 ~ 2- ..z :3: 'tr I-ur'rnun *"i'n’l! 7 I . ' . F ' in . 1 . t, .3”, I. 1 ‘3"; t‘ndknllqs‘l an” ‘ z , . ,,._ _ . . -. .u , , . w-tur. h ) -.;n~:i'yibbalo¢' Is , ‘l _ 1'1: I' 2: ’13:.11'I': W‘H' . ; . “-"r'. ww Md 9m ... , quart: Ia . ‘v - “)"1‘ ,' — FM} Jun-of: I .u_: ' ._ ~ I t . .0 _ Q -,_ ‘ ——‘.. " ’r 36 Results of the analysis of variance are presented in Table 4. 3. The main effect for number of items was significant at the alpha level of .01, using a conservative F test with 1 and 3 degrees of freedom. Main effect for item analysis method was :61: significant at the alpha level of . 01, using an F test with 3 and 3 degrees of freedom (Greenhouse and Geisser, 1959). w «'14- TestinLHypothes‘es for One -to -Four Scoring The hypotheses tested were: 1. The mean change score reliability for item subsets chosen by Method 111 (Saupe' s r dD) would be greater than the mean reliabilities for subsets of items chosen by any other item analysis method or by'the control method of random selec- tion. 2. Mean change score reliability for subsets of items chosen by Method 1, using change item variance, would be greater than the mean reliability of item subsets chosen by pretest response frequency or by random selection. 3. Mean change score reliability forthe subsets of items selected by Method II, using pretest response frequency, would be greater than the mean reliability of subsets of randomly selected items. 38 at II‘JUISEIS'IQ 916 eanal'mv to enum- ;, ‘ . film» In Insuilingie as»! gum lo Tednwn Tol mil. .-. '. - E but: I MN. raw 71 ‘i-‘JUvar-Iezlrn a saint! J... I new 170.1.”qu ataxia“- :7. --.‘ '10] tonne nlsM . . if hm: 1’ git; L‘ 32¢»? '1 m' ‘ v: :x. .Z(‘ 10 level quII w I» - ~51 ‘ a m; rm ., «5055119910, mobs” o" 'I . I j p 'tr); P589“ ywjg 3mm) in? a-ra-.1Jnr~_z_istf!" ‘I. ' Hi:— ‘i. :I '- 'r‘ . . E ' '.‘ "4167)" £82.93?” ‘31" (Iu'WT ., :‘f... 4'3 .. . . ‘ W71. “V3 -I-' "d 3:,“ - ' ,.. ‘ ‘ : ‘ xix-119m _ ’7’1.$’_. . ‘ ’1 . . 11-6 -i‘_ .5178 Pt! :9..:;_1J’;J ar'fii ' . In ~11 I; " ‘M ’II t v (‘1- . E, I, J 'M'...‘ C ‘ ‘- n' ‘ z,h;~.e )--._b_ {- I " 7’. .7 | ’ .‘l p '\ ‘5 I I ’n I ) t‘ I V A t. I W ‘Auurtruw 'I'v - - a 1'. I I" lc .-l..-,. »i ‘ Biz. .._ ,. . '. 37 To test these hypotheses, post hoc comparisons were made to determine the significance of the differences between the mean reliability values obtained under the different item analysis methods. A multiple comparison test for making a series of pairwise compari- sons, developed by Tukey, was employed (Kirk, 1968, p. 88). An HSD (honestly significant difference) value was computed in accor- dance with the formula: HSD = q WM (13) 'y n where n is the number of levels or treatments, q is a value obtained from the tabled distribution of the studentized range statistic, and ’y is the number of degrees of freedom associated with the error term. Differences between the mean reliabilities for item subsets selected by the various methods of item analysis are presented in Table 4. 4. (Reliability estimates were converted to Fisher r-to -Z transformations for testing. ) As Table 4. 4 Shows, the first hypothesis is supported. Change score reliability for subsets of items chosen by Saupe' s method is significantly greater than that of subsets of items selected by any other method. The first part of the second hypothesis is not supported. The reliability for sets of change items selected for their variance ‘1..."- .1 Tn- '6 J- I“ Y! ulmm mswanoahsqmo nod taoq .asas'l - r (”jam {uh «sewed esonm‘mib stir to 99m ' — L ab(u‘b'\.r! maiden: rust? t1t919l‘1.b 9d) 'l'fl'InU Md“‘ 7 . -' . . .4 .> 'I D — inuqz‘rmu serene" Eu 4.21:“: 33 -' gtr‘flq'fl 1071851 WM . T 6‘ Y!) I"? {91'1” -v anw 31min? 1‘ ' 11’. Hit! g MG] ,A'Iz. .. _ _ V um 11b Iabomn‘flll r t ‘I , xo-v-n. m surname»- er» .. ”is; . Kimmie} 91'. . .1 .1. gal: "- .. :‘Ll all! 1 -nt .‘L . . ‘ XV Dz: u I of . " . 'zw: .1 "WITH” 3"" " ' ‘ In'i'I: ( . ~ . ' U r If“? '0 ‘ emnwn ‘ . r , I. » ~ ‘m' a 1. ' ._ . until . u i I ’1' band?” : L)" I ' ktlo .l- .o await .g. . . ' ‘.r.\':{'.'.'an ’ w, .. h '1 . '1“. . . '2 Asa-49’ -. boring! . 59?.FI‘1m - ' 7 :9 ("Li V". ‘ '. I": 38 is not betterthan reliability for items chosen on the basis of pretest response frequency; it is, however, significantly greater than the reliability of the randomly selected subsets of items. TABLE 4. 4. --Differences between reliability estimates for items chosen by different item analysis methods (one -to- four scoring). Change Pretest Sau e r Variance Frequency p dD Change Variance . 018 . 328* Pretest Frequency .310* Saupe r dD Random .226* .244* .554** *Significant at alpha = . 05, HSD = . 218. **Significant at alpha = .01, HSD = .389. The third hypothesis is also supported by the data. Items can be chosen on the basis of pretest response frequency which have higher change score reliability than an equal number of items ran- domly chosen. Results for Zero- One Scoring When the items on the attitude survey were scored on a zero -one basis, it was possible to introduce a fifth method of item feats-Id '20 stand 940 no agenda, email '80! w. . ‘9. mi! merit 19115913 '(I’J'Irsotlmgta nsvswod ,u #3, amen to aladdus botosisa 21“ f _‘ ‘ ‘g'tI-is‘iu'x neawmd asoanm‘ men :nsns'tllb \{d mam (when we! amen to} as: rushes I ' . . L .’ ' 1. a. .. 4.7!“ {mo} ‘ 5;. . wash“ . W . '. H" " i ' '1 -;:.~aeup91‘l I ‘ ' . ' I m," I l _ I ' . l' . 2519.0 ‘i 11);: 't - . 1 I . C 1‘ Int": ‘-'fl' , .grm ”I .1 9d y“ _. ,t . .I. ,, 71 . . . . - 15‘1“" T M’- {hunt I fit”. .~ .- atlm ' wok: .— ' , - .- (IN . '4 'ZI,‘ - t a 1 :.i0' 39 selection (triserial correlation) in addition to the threes-item analysis methods used for one -to -four scoring and random selection. The change score reliabilities for the 15, 30, 60, and 90 item subsets were computed using the responses of the entire cross-validation sample. These change score reliability estimates are presented in Table 4. 5. The differences between the methods of item analysis were much less pronounced under this scoring system. In general, however, all four methods of change item analysis consistently resulted in higher estimates of change score reliability than did the technique of random selection. The greatest differences, again, were observed when fewer items were selected from the original pool. TABLE 4. 5. --Change score reliability coefficients computed for the total cross-validation sample using the zero -one scoring method. Number of Items Item Analysis Method 15 30 6O 90 Method I (Change Variance) . 52 . 56 . 68 . 72 Method 11 (Pretest Frequency) . 36 . 52 . 67 . 72 Method III (Saupe' s rdD) .33 .49 .68 .74 Method IV (Triserial r) .37 . 56 . 68 .75 Method V (Random) . 21 . 48 . 57 . 67 starlet". a". ntriioalsa mobnm bra.- gnmooe anal-oh“ at: ;.9.f._=. 1mm O“. has .95 .m'. ,8! who} 29ml Loin: ' _ :13 A"): n' wag]; ." V!!n«.‘a.a.to. i “7‘ " ’i - .x: ..b~. 133m ‘HIOI Ill .. be my)! «and: and: o: nobibbs ni (noflflnm; , bk£~‘».'» 21.1.0123 suit-.1“ 4,313.. n-ten01391 MI! W «37...? 9.8 a)“ mt? ‘ ' '03‘"“'.'.'l a 0113 agenda “£8.31. :‘rr not? '0 -:lx . " ,- i: xv ' usunsvta'lllbm “If. ' g. ! :--‘:»'{; ;_: 'r : " u béwnnwfinom and 9 ' 6M . ., ,+.§..' M. . , . .:-.vr.-:w‘I‘I-'13M¢t :In) mmhr-J’l I0 9' . ' haw-made j I“: ‘03 " 'l ‘ V" I b 3.1- ’I r‘ ; . .5. - r 1‘. f. . .h x' -- _ m-‘i ‘ "v.31 ( (”I 3 ‘ ’ o q I 6 40 To test the statistical significance of the differences between the change score reliability estimates of items selected by the various item analysis methods, the cross -validation group‘was divided into five independent samples with 26 students per sample. Each sample was then randomly assigned to be used for calculating the reliabilities for 15, 30, 60, and 90 items chosen by a particular item analysis method. The change score reliability coefficients obtained on these independent samples are reported in Table 4. 6. TABLE 4. 6. --Change score reliability coefficients computed for independent cross ~validation samples using the zero -one scoring system. . Number of Items Item Analysis Method 15 30 60 90 Method I (Change Variance) . 50 . 67 . 72 . 76 Sample 1 Method H (Pretest Frequency) . 48 . 60 . 68 . 75 Sample 2 Method III (Saupe' s rdD) .32 .44 .65 , 74 Sample 3 Method IV (Triserial r) . 45 . 62 . 71 . 75 Sample 4 Method V (Random) _ 11 . 35 . 60 , 74 Sample 5) 0t nauwtm zaons-mmb 5m ‘10 anusfilngjs “on. awn ' .rs-r dd! {d [393091-22 eruui '20 summing up!!! 7 ' um: babivib asw quo'zg .‘Oilnbfli‘fi 3301') 9d: ‘ .tvlthe :4: emohnm as mm 3mm. * . . ._‘.A volqmse dosil wyk‘ifirmrfo-x um gtf'tiiu’ils: 0“ been ad 4") barman: . , ‘ z t o J;‘I ;‘.‘.»_A;Iln,. ,.- w: :3, .3 vu‘ .;-_1.:«.:—.:Y; amen 09 “‘5‘. U... .' “.4; ».-_ ; ,1. , mg.» 7". .-'-‘2.~'7l—,-v Jung ennui: A.1-um; mu. aolqmc' i .. t.-. m 4. rv . . ' 4‘ ~ .~ I. ’ ‘M:4;9bfll ( ,uqx :' . ~ ‘ll - : , l I‘ | . A . V v”! w" ‘ 1 \ l'g’ - 41 A two -way analysis of variance was performed using Fisher r—to -Z transformations of the reliability coefficients in Table 4. 6. 2 Prior to running the ANOVA, a Tukey one -degree -of-freedom test for nonadditivity was conducted to detect the significance of an inter- action effect. No significant interaction effect was found at the alpha level of .05. Results of the two -way ANOVA are presented in Table 4. 7. TABLE 4. 7. --Two -way analysis of variance for the effects of item analysis method and number of items on change score reliability (with zero -one scoring). Source of Variance Sums 0f d. f. M. S. F Ratio Squares Item Analysis Method .219 4 .055 7. 857* Number of Items .914 3 .305 43. 570** Residual . 085 12 . 007 Total 1. 218 19 *Significant at alpha = . 05. **Significant at alpha = . 01. Using the conservative F—test with 4 and 4 degrees of free- dom, the main effect of item analysis method was significant at the alpha level . 05. The effect of number of items was significant at the alpha level of .01, using a conservative F-test with 1 and. 4 degrees of freedom . 42 Testi_n_gr Hypotheses for Zero -One Scoring. When the items were scored on a zero -one system there were four hypotheses of interest. 1. The mean change score reliability for items chosen by Method 111 (Saupe' s correlation) would be greater than the mean reliability of item subsets chosen by any other item analysis method or by random selection. Mean change score reliability for subsets of items chosen by Method IV (triserial correlation) would be greater than the mean reliability for the subsets of items chosen by the response frequency methods or by random selection. Mean change score reliability for the subsets of items selected by Method I (using change variance) would be greater than the mean reliability of item subsets chosen by frequency of pretest responses or by random selection. Mean change score reliability for the subsets of items selected by Method 11 (using pretest response frequency) would be greater than the mean reliability of subsets of randomly selected items. A post hoc comparison test for differences between means was employed to test these hypotheses. Tukey' 3 test for an honestly . R » ~ II a , l H .5 U4 n' a ." G N ‘1 g. 3 'tflj‘i’fil-F‘!‘ .- "IQUBZ’ II! M,“ u‘f.‘ 5‘3“" ‘1»: ». in g 4;" htfl', ;. “I .; z: . t 1 >2! " - = . , use. :0 'nitldsilL-‘l m. " ‘T - '-' “ ' ' l {re-rt? m ninth!!! -"- "" ‘ W -." W . ‘ _ m. min :1. (mail Furl) ’3 a c 3 m. ' . g, j! 50,1,th {d 3 1“ f-k ‘ v .(1. 1 ",L‘l"? 3d: ‘ L“ ' I u'h-y'l .' v, F v. I ‘ ’-“'I_ I r ‘N 1 J k .- . ~‘P It a ‘v 1 7‘. ~ ”7 ‘ l A ’0 71¢" '1 V I ’ a: u a c | ~_ I . ,. I . . . x. ' ‘ w an ' *u ‘ . . . 43 significant difference was used for making the pairwise comparisons between means. Results of these comparisons are reported in Table 4. 8. Three methods of item analysis were significantly better than random selection in producing reliable change scales. These were: selecting on change item variance, selecting on pretest response frequency, and triserial correlation. Saupe' s correlational method did not produce results that were significantly better than random selection. Only one significant difference was found between the item analysis methods themselves. Selection of items on the basis of change variance was found to yield higher mean change reliability than selection on the basis of Saupe' s r dD' TABLE 4. 8. --Differences between mean change score reliabilities for items chosen by different methods. (Scores are Fisher r -to -Z transforms. ) Change Pretest Triserial . Saupe r Var1ance Frequency r Change Variance Pretest Frequency .057 Saupe' s rdD .178 , .121 Triserial r .041 -. 016 -. 137 Random .285* .228* .125 .244* *Significant at alpha = .05, HSD = .201. 80 saoahzsqmoa saint-sq mnmnlsm 101 ban: n? behoq'u 91-. anoumaqmos sued: in I ‘- V mum? vhnsofiingia ensw 31211835 and: 30 show " ' rue-3:61“ .aolrxn: «3:15:13 91:55qu1 gnfoubowq at w lasing 2m pnhHJ-m detains. £11931 syuda lb. . N . u '7LI-o': m». 1:71‘9tv':!'11 [X18 . ’ ' ' 7 ‘Il r-r‘L ' when 9'3an id " ‘1» Hart? ‘11:}.W’5 H. r:3i'un;31« 21"..- .=. ‘. . ‘ .. s ‘ 1' ’ " 7.5 ’ ‘ z.‘ uro vino ruined. ’1 , ["- K’Ufilzrl'vm 2. "‘-rfr.;.< ". ‘\l I 1L" VIII, "f’£"“l 17:.If.‘ ‘tII Vi} .‘I‘ a ,I .. l. m‘ t ., ‘_.. '. ribrniyufn mad“. ' H "v' (I. - ..\ .I... ~. «. ,1 .. .7) . . u | at v 53.1mm?! sands ‘. "- . 310.} vain m1!!! 4 v -. , l 'E l,- a ‘ j. y’! .t_, y- I N- I- - ' - 1.x. . . | _ ..... .. V , . )‘ ~.. ‘ . - -~ -. ;','F"v’9'. . V n "g'a! l ’I ~ 7 1:1 a ’ . . ' I .fu‘IO -~ ‘| LLx‘ 44 Thus the first hypothesis is not supported. Saupe' 3 method of change item analysis is not superior to other item analysis methods, nor is it better than random selection, in choosing items for reliable change scales. The second hypothesis is partially supported. Triserial correlation is better than random selection, but is not superior to response frequency methods for selecting change items. The third hypothesis is also only supported by the fact that using change item variance as an index for item selection is better than random selection. This method, however, is not significantly better than selecting items on the basis of pretest response frequency. The fourth hypothesis is upheld. Items can be chosen on the basis of pretest response which have significantly higher change score reliabilities than items which are randomly chosen. Summary Results of the two -way analysis of variance and Tukey post hoc comparisons showed that for the one -to -four item scoring system: 1. Saupe' s r was superior to random selection and to both dD selection for change score variance and selection on pre- test response frequency. M [mfmru a 'swqusa reievfme .mri 19:30 r..‘ a {'14}: .511 ‘::' W! :sdt .ma'i . mum ._ .7 :w '«"-J«|.~’.u.MI;.» W- ' I I I' | r A‘i‘ ;' 1‘ J t I 1 ‘ z)" 1'. 1 'c r u"\ I .V. Ali‘l.‘ " ‘ "'4' l .bxnmmue son at 333910qu . writ-wry” Ion at a ,, .‘ '-- *i ..|n'7'.‘,‘1{vf£ (1101111741 mad! 1wu'fln -. .aslml O - r».:>r'.‘0Q'-ejd m “mu: R811: 7:33:96 '4 I’ ~ ' uhcdyam vunsnpoflf «Ivor, .' '2 1:0 9“ '=:r"s".' :nsz’i 9 ’ [IONS-JOE m - --. use and Mr: an 45 Both pretest response frequency and the change variance methods were better than random selection for choosing reliable change item subscales. Selection on change item variance'was no better than selection on pretest response frequency in providing reli- ablechange scales. When the items were scored dichotomously, the results showed that: 1. The methods of triserial correlation, selection for change variance, and selection on pretest response frequency were all superior to random selection of items for change scales. There were no significant differences between these suc - cessful methods of item analysis. Saupe' s r was not significantly better than random dD selection of items for measuring change. f, , ‘P.£'-~" LN" IJ'VHUVW’U 11' J'ft-IML';7;~"-L -...0"'-;.W 29.5qu no HOBO“ ,.... ‘3‘)! u . ':_=: -' ;. ' .»;3.-nr~r‘. .. 1’) '1‘.) 357!» if in“ ‘.' ‘11: 1 PT! (“I'FT f "[j '.Y-» eonsi'uav egusxlo 9th bna '(Jneupa'xl 91m.” ' gniaoorio 10'! stontxeluz vuobr‘n’; and: 13:19d 019' .a. fueling men 9M: 1:11 11 i 1.1!! , "u -, - ‘. wn’ um 5W!“ ' ." K‘fif‘f‘! UV .1 . now . 1.. , . - _ L ‘ |‘ I L 7'?" V ill'n1 . u , inf” ' ".0... 1% . .. ~. 6; ,f‘ 11") u--- v .111. r.-,_-.!, awedo m .: n. . a.‘ V x. 12153:! 93mm: OM") I - n‘ "war Ran“ 5d! 005w CHAPTER V SUMMARY AND CONCLUSIONS Summary In recent years researchers have become increasingly interested in the problems of measuring change. Low change score reliability has presented a particularly challenging problem to researchers in this area. Bereiter (1963) suggested that item analysis techniques could be applied to change items in an attempt to improve change score reliability. A review of the literature revealed that several techniques for change item analysis were available; however, there was a dearth of empirical research to demonstrate the effectiveness of these procedures or to compare their ability to increase change score reliability. The four methods of item analysis suitable for change'items were: selection on the basis of change item score variance; selection on the basis of pretest response frequency; selection on Saupe' s correlation between change item score and total score; and selection on triserial correlation. (The latter method 46 47 was restricted to the case where items were dichotomously scored on each occasion.) An empirical study was undertakentto determine whether these methods of change item analysis could'lead to theselection of more reliable subsets of items than could items chosen by random selection. Comparisons between the various methods were also made. The sample used for-item analysis and cross -validation was a group of 263 students at Michigan State University who had been tested on the Inventory of Beliefs as freshmen in 1958, and who were retested on this attitude survey in 1961. Half of this sample were assigned to an initial item analysis group. On the basis of their responses the four item analysis pro- cedures were carried out and subsets of 15, 30, 60, and 90 items were selected by each procedure from the original pool of 120 items. In addition, a control procedure of random selection was also used to choose item subsets. The items selected by the item analysis procedures were then scored for the cross-validation group. Two change score reliabilities were calculated from these responses. First, all reliability estimates were computed for the entire group of 131 students. Secondly, the cross -validation group was divided into smaller independent samples and reliabilities for item subsets chosen 19:9:de salary-19196 o: nsxmshnu new thon- ‘, 'In non'mI-va mi: of hp»! blum atavjsns meal a” ‘- moons° rd rvquodu a firm}: 511K") mad: amen To ”W‘ l ' . . ‘ J ”Pl. m mu anuflfv-Tr dut)!‘u.—./ *HJ- nos-mad SWIM: ~' 3*.1.’ “mi: 1212? .3 firm. (“I k up ,’ .4 1131111)! Mal!” {Y'J‘fvd be: o-Iw v.3 21“.: '. l -."‘7. xx; "'7 f»: flushed. 3.“ 2n, L 1' mo 1* ‘7' 3 .. “ ‘ - ' "F ' . 231C301“ V“ . \I. '(l' ,‘ .__-" ’BII ‘ A.1“' 1‘. , ‘ (I 9 ‘ I “ '. 7", .fi- '.i?-.’-d 9mm : 1"8’1 ns' 111.: men 980‘“ . 3‘ " .v -. 1. . _. , ' 1 9 r 93-9_1--.r{'T . ' (I._ ,1. , . A, _. ,. , . . z-nl b19109.“ I . -.; . 1 ; (“viz/u ' . ‘ - ' ' -“ fififidimi Zr» L .. v'-‘.tl'.‘ . v' 1 a." ”'1" fluid“ ..2: 5:33;: >:. ' Cm?» _ '1 I . " 7‘ .830“ I . 1,?!” :1 t'F, “ . - '11 in -- ' -1344)??? Im- 48 by different methods were computed on independent samples. Hypotheses were tested by using a two -way analysis of variance with post hoc comparisons. The results of the analysis showed that when the items were scored on a one -to -fourscale, the three methods of item analysis used resulted in significantly higher change score reliability than did random selection. Saupe' s r dD was the most successful in producing high change score reliability. Selection on the basis of pretest frequency and change score variance were equally effective. When the items were scored on a zero -one basis, three methods of item analysis resulted in greater change (score reliability than did random selection--selection on change variance, selection on pretest response frequency, and triserial correlation. One method (Saupe' s rdD) proved to be no better than random selection. No significant differences were found between the three methods which were successful in improving change score reliability; they were equally effective. Conclusions From this comparative study of change item analysis tech- niques, several conclusions can be drawn. 1. It is possible to produce more reliable instruments for measuring individual change if items are selected through 8) -. aol’qmss ‘mabasqs but as 135111qu 919' riJiw araxufl-mv to amglam; 3:? w—owt a gnisu '{d w -~' 4; 2mm av? ,~!.-. writ hflWOfls wa‘dkm. ed: lo :0 ,— Q - “r. _ (45:1: rrwJ: I). .1 L. {ti—Wm“: eta-‘11? )1" lgffififi 1u°,od-~ . - l s“ .‘_ r1311} '51r1n151'52r1 4:0 , a , "IJQU'D‘HHSI. at, n- iu'ér'aw." :2 )7. . - -- ‘uvgluu". (10119910. 2. {U QC v4”. 1‘ ,2! -> ' _ ‘\.: ’2'. 3833‘” M ,7 [.L‘ .3 "9V1? |‘-:)'3 '4.- " - ' ‘i 'v I 1 '27 Elf? 1".1L‘JZ'N1’M _ . :v «1.: . «Ti-371 an) :19er - . H ';- . v f t. J. .i u ’. .9; P1. - ‘ ‘ ” :5" LI?" '7' q ‘ ' 49 the systematic item analysis procedures suggested in this study. - 2. When a wide-range of item responses is permitted (such as when items are scored on a one -.to -four or one -to -five scale), the methods recommendedifor use are Saupe' s r dD’ selection on the basis of large change variance, and selec- tion on the basis of pretest response frequency. 3. When the range of item responses is restricted to a dichotomy, the recommended methods are selection on the basis of large change variance, selection on pretest response frequency, and triserial correlation. Discussion It should be noted that the differences between the item analysis methods and between item analysis methods and random selection were more dramatic when a smaller number of items was chosen. This probably would indicate that change item analysis techniques would be most useful when a small portion of items are chosen from a larger original pool. It appears, however, to contradict that fact that no significant interaction was found between number of items selected and the method of selection. In view of this it is likely that a Type 11 error occurred in the Tukey 50 one -degree -of-freedom test for interaction. Even if this were the case, the results of the analysis -of -variance can be accepted with confidence since the presence of an undetected interaction would have resultedin an overestimate of errorvariance and, hence, a more conservative statistical test for the main effects of number of items and method of selection. Another point that should not be overlooked is that, statis - tically speaking, most of the items on this scale functioned effectively. Less than ten items were found which had negative triserial r or r (D values. It is not unreasonable to speculate that if there had been a higher percentage of pooritems on the test, the item analysis techniques might have worked even better, and/ or differences between the techniques might have been more apparent. Theoretically, it is not surprising that differences between change item analysis techniques seemed to be greater‘when the one— to -four scoring scheme was used. When there was a greater possible range of response, there was a greater possible range for the variances of the change scores and, consequently, a greater possible range for change-item covariances. Thus a better distinction could be made between "good" and "bad” items. Under the dichotomous scoring system, the items tended to appear more similarwith regard to their change variances and intercorrelations. 9th 919w aids-ii newt! .mflamaim 101“ . ‘ a rifiw beta-32 9d net) ifiu‘lfii’iflV-30- Bird” 0‘, - ' biuew non-mist“! 1154393911“!!! no lo mann“. I '1‘? 1 u .s'mefl. .bne vomit-x9..- é ': 13 lo mama's-mun“ ‘rn wrfmuz‘. PO 2.1:.«2'1'»: (MEET! M! '10) 31-9! isouwm nomads: lo a, - ’4’]: "- ,‘5 ’ x ‘f Lint! .. m '81: Linc-‘3 1:730 mic-.1 M". ‘j' - fcvx' :- 13: n..- .»r. «' o :9 . 5.3.‘3J 7 n 1038001 V .’ wa'. ' ‘L'n‘ --‘:.aw await 8‘. " :.I'i-' ' - 2mm". Id 1 I". df 31 . .." . .. . , . , ,. ‘5‘131'.')~~_'qu 33d“' 11 213131 . f“ 11") l' "5'; r] ‘ u,1"‘1">" tr‘. , l l 9;. n' . y 5‘le /: . .v~ ‘l ,. uf;:J}1'-m-' .n: W. -- '. .v- - . A .. . 1,. . ‘. . , . . . . . 1 4 ’- LIL-“III f/ 3 ,- ,. y ‘ ‘ , ‘.. a; 'IC‘o ”#7. - . " it ' . ' ‘ i’.‘ 1 ' ' H» n r5“ " f» . .. 4- r ; . . , r 5:! ~ 51 From a practitioner' s viewpoint, several of the findings of this study can be applied to the area of constructing instruments to measure change. First, it appears that change item analysis can be a profitable approach to solving the change score reliability problem. Certainly researchers should consider using these tech- niques when constructing new instruments to measure change or when forced to shorten an already existing scale. Second, the use of a multiple aresponse format for items seems to allow a more sensitive observation of change and makes the selection of a method of change item analysis an important consideration. A third point, having great significance for the longitudinal researcher, is that considerable .tirne and expense might be saved by selecting items on the basis of pretest response alone. It should be remembered, however, that this can only be done when the direction of change can be predicted in advance. Implications for Future Research This study has been somewhat of a "pioneer exploration" into the area of change item analysis. It has revealed, nonetheless, that empirical research on techniques for selecting items to measure I! agnibnfi 9‘} 10 12-19st .mtoqw-ah u ' ammummni magmatic: 10 5012. ed! 07 ”MINI ass) autism: mm“: smut-n can! aquqqa u .rrm! . gamma: 91093 agenda mi: yu‘vloe m (to - duo: and: Quinn 1911f'61101' L-zzsnriv msdtnsoa” , . ~13“! OJ an -r:!."f:.m WM! 3.1330“ '- ., dim-15:. cu harm‘s ‘ 3 ~. ,r-irzsl m 13"3311". - nan mi LL'V'HV. {181:0i.'%'9'3-‘3?'.gl.",1.'zx .; '19 59-1.! to!" . -Ltrmno in ”our: --z- ’ a -: ‘fi’ 310!“ I we“ - _:-. ‘f‘ .' ';‘..Lu\rL‘5['ZfiYO HO . 39.. Uiilii‘l’ 3!: 1' " ‘ 1 . g ‘ 213"»; 1311!” A ”a I 'uii: 2! .1 H ‘ ‘ v-j. v . {,1 M41}. . 1n}: ' j-l '. ‘71’ aflm '9’ ' H . ~ 4.. \ h-g‘XSdm who if; no” ‘ 3“ 1".“‘3 ' I .Y ' 1 fl}. Q- a ‘ Ftv W :31: I ‘ . 52 change can be a profitable venture yielding useful information for test construction. Only four possible methods of change item analysis were compared in this study. As other methods are deVeloped, they should be systematically compared with these techniques. Two possible new methods which could be considered are factor analysis and multiserial correlation. Change item scores could be subjected to factor analysis and chosen on the basis of their factoraloadings, just as regular items are often selected. This requires, however, a much larger sample size than was employed in this study. Lange (1969) found that factor loadings for 40items were unstable with a sample size less than 400. Another method which could prove‘use- ful would be the use of the general multiserial correlation developed by Jaspen (1946). The triserial r employed in this study was a simplified version of Jaspen' s multiserial correlation formula. The general formula could be expanded to render correlationacoeffi- cients for total change score and change item responses scored on a one -to -four or a one -to -five scale. Also, before the findings of this study can be generally accepted, replication is needed using other populations and other instruments . 33 ,J 10! miinmo'ml {class gmble‘hj mum»: . - 2"‘1' "A I ‘h- ' '7‘ M mg" I . t,‘ {MU .imqolovgh 9m abmuom 15mm HA .M ' ' i u-mw aiazlsns. man 53.05.19 'to abacus": ' W. . cw!” .21 )Uy‘nru n «so!!! flit": he' snmno v}! ~- 94 -?'{f5.:t. awn“; a. h.) :EhEancu en‘ blur! rhlrhram .noude’nua . JV? bfilup sauna um; 33.1.54! i' ;,,';- «- 2mm." .' “17".; .-'-'1 .. ‘ ' W; .- vamb has 33W' ._ v.7.w,~..‘ ”A .» H '5' . 1' .--, v.‘ "r: 9'1" 3:119911M u»; 'r _ .{L - o 2:33. omens-u no .. . v fli- .: i! “TWIN-33811.4. l 'q:4 t h ' a “ 3391 rm... . ’ , . «a: MD I 1‘ .~;.' ‘ "‘ , ‘11 I ‘1 . ,y ' ., 'lw'H‘ m . T w . ,, . , 53 An actual study of the content of the change items themselves was not within the scope of this investigation. It remains a very important, but, as yet, unexplored area. Cox (1965) warned test constructors to remember that item selection onstatistical criteria alone might change the nature of the test by eliminatingitems designed to measure a specific objective. He proposed a method to use in conjunction‘with statistical item analysis to insure that items were maintained in the test to cover all vital objectives of the evalua- tion. Such selection of items to measure objectives could be practiced equally well with change items now that feasible, statistical item selection techniques are available. A final implication of this study goesbeyond the area of constructing instruments to the broader area of measuring change. This study has demonstrated that researchers need no longer fear to undertake a study of individual change because of the insurmountable problem of low change score reliability. Through item analysis, instruments can be developed which will provide reliable measures of individual change. BIB LIOGRA PHY BIBLIOGRAPHY American council on Education Committee on Measurement and Evaluation. Instructor' 8 manual for the inventory of beliefs. Washington, D. C. , 1953. Bereiter, Carl M. Some persisting dilemmas in the measurement of change. Chapterwl in Harris, Chester W. (Ed.) Problems in measuring change. Madison, Wisconsin: University of Wisconsin Press, 1963. Cox, Richard C. Item selection techniques and evaluation of instructional objectives. Journal of Educational Measure - ment, 1965, 2, 181-185. Cronbach, Lee J. Coefficient alpha and the internal structure of tests. Psychometrika, 1951, 16, 297 -334. Dressel, Paul L. , and Mayhew, Lewis B. General education explorations in evaluation. Washington, D. C.: American Council on Education, 1954. Greenhouse, S. W. , and Geisser, S. On methods in the analysis of profile data. Psychometrika, 1959, 24, 92-112. Gruber, H. E. , and Weitman, M. Item analysis and the measure- ment of change. Journal of Educational Research, 1962, 6, 287 -289. Gulliksen, Harold. Theory of mental tests. New York: Wiley, 1950. Horst, Paul. Multivariate models for evaluating change. Chapter 6 in Harris, Chester W. (Ed.) Problems in measuring change. Madison, Wisconsin: University of Wisconsin Press,. 1963. 54 "I11” {:‘1 3301.18.18 . . Hr x. -.- . . m 9-41Unmn'9noiuaubf‘! no 'rr ‘.‘-_.,-,'.'v»~‘ ('1 «'- ‘101 {rungfiu a “ 'f‘JLI'iJJ-enl .8“ . ,-,1 , '~ .. wmnidasw :1. . .‘ .' '.:.."M§.LJ "12,;"qur‘l’lfia ’ l: 3) 1 J 1' ‘ ' ' hmd'; , «3'11 a..'-;'asfi.'x_fl_l 1. u W.','}l_7.‘7va}"“o - _U ,n'. m u , :‘ w. mail 1) In ~ -- r .-.v ~' . , _, ., - -. 1') luxciiomjlnl )_ ‘ 7 i . ,-.' ,.‘L .(IOCI .3118” ' .~* r':=- ». 99.1 , n“ " 9.1293 ' 1.,‘7‘ 7"" l , v j " h ' V rh’aq , ‘ . ,- g . , . miqxa {,r'uro'f) A _f.‘~1 "[*' , 9 1 .‘NRUOII A . 1' . - - I-'.‘ I > I " .k a 9 V .' a} ‘ z: i 1": .g .1 Li _. , - , Lri.r 'A ‘ T f.” '53:. . c I- A: 9 :,. “-92 __ _ ‘ .l -—I ‘ ' 55 Horst, Paul. Psychological measurement and prediction. Belmont, California: Wadsworth, 1966. Jaspen, Nathan. Serial correlation. Psychometrika, 1946, 11, 23-30. Jenkins, William L. Triserial r--a neglected statistic. Journal of Applied Psychology, 1956, 40, 63-64. Kirk, Roger E. Experimental design procedures forthe behavioral sciences. Belmont, California: Wadsworth, 1968. Lange, Allan L., An empirical study of sampling error in factor analysis. Unpublished doctoral dissertation, Michigan State University, 1969. Lehmann, Irvin J. , and Dressel, Paul L. Changes in critical thinkinggattitudes, and values associa—tea with collegg attendance. Final Report of Cooperative Research Project No. 1646. East Lansing: Michigan State University, 1963. Lord, Frederick M. The utilization of unreliable difference scores. Journal of Educational Psychology, 1958, 49, 150-152. Lord, Frederick M. Elementary models for measuring change. Chapter 2 in Harris, Chester W. (Ed.) Problems in measuring chang_e_. Madison, Wisconsin: University of Wisconsin Press, 1963. Lord, Frederick M. , and Novick, Melvin R. Statistical theories of mental test scores. Reading, Mass. : Addison Wesley, 1968. Magnusson, David. Test theory. Reading, Mass.: Addison Wesley, 1967. . Saupe, Joe L. Technical considerations in measurement. Appendix in Dressel, Paul L. (Ed.) Evaluation in higher education. Boston, Mass.: Houghton Mifflin, 1961. Saupe, Joe L. Selecting items to measure change. Journal of Edu- cational Measurement, 1966, 3, 223-228. .lwmlfifl .noHths'xn has minimum-3m {sol ’ a J.” _11 «MT ,gfji’qammi'y'ajj .u.;m.is'-.1nalnh‘ .c‘? . H I! d -:'a ,.J .—2 L. 1 [21111101. onetime Dvi'rpluzm L- “h M .‘M ' .zgutodu a! £1,921,334 Air"? mi L '.L« W 1 db e935. _. {5.1.3.3119 ‘33”! . N'vfix-".’ :"Tl ‘- :L 9" )m A .' .. . , ' -". 11’1"?“ “A .J 'L,_:,.‘:'HU‘ «W . €.--:i.‘I..-unU . ' . $2.: {NEHWI'IIU , A ' .7i ‘1'. " ‘I LAT'W Uh“. ,.LMV'I] .. . . . -1 -. L - ‘ 71:: ”mm ' L ‘. .:£ .dn‘bI ' T 9 . “10!. 2 “. )l'l 9., - (1 I . ' o r') 56 Shoemaker, David M. Note on the attenuating effect of zero -variance items on K R-20. Journal of Educational Measurement, 1970, 6, 255 -256. Tucker, Ledyard; Damarin, Fred; and Messick, Samuel. A base- free measure of change. Psychometrika, 1966, 31, 457- 473. Webster, Harold, and Bereiter, Carl. The reliability of changes measured by mental test scores. Chapter 3 in Harris, Chester W. (Ed.) Problems in measuring chang_e_. ~ Madi- son, Wisconsin: University of Wisconsin Press, 1963. Winer, B. J. Statistical principles in experimental desiggg New York: McGraw-Hill, 1962. comma v- owes to murk- guusuuwnz art: no M.. m m1 mus - 1M 1 m 1199103 10 Ismael. . "I __ .—.—————~——..-—-—- m- ‘F‘ 4" -9?t—11 r. ,I'nums’” ,H'Jia'a-h’. has .5151 11115111.“. —T‘ .1 1?? _0?Ql‘ twp! 1111\11'1'1'1‘1 Honda 10 ..1 '. 1113:6123 MT 1214.! flunk-1985315 9111911; .11 '* ‘2: ."...-*;-.. ' .1910‘;P1‘}I [8399!!! Yd M .11 ."V . 1:11;“ ."r . "cl. in? "'1 i-T; (v40 '1': (.ba) W 1‘7"" "~-'~ :“ '1. :1 ’. 9* ._.1 :1 :uiamw. v‘)‘ ' ,‘-- 1'59: . '1'" > g 7 31593112111138 .5, ‘11 I-wle‘: ' APPENDIX I. . z . . ‘1‘. 1“ .0 400». ‘ .f' . a. I. . P». ‘l - 1. W . . - Q- m-" .I” .. r. . .. .. . .,M1MNN..,....v I Vt. I APPENDIX The 120 items on theInventcgy of Beliefs, Form 1, used in this study are listed here. Following the listing of items, Tables A. 1 and A. 2.are presented to indicate the subscales- in-which each item appeared. Percentages of overlap of items between the scales selected by different item analysis methods are reported in Tables A. 3 through A. 10. 1. If you want a thing done right, you have to do it yourself. 2. There are times when a father, as head of the family, must tell the other family members what they can and cannot do. 3. Lowering tariffs to admit more foreign goods into this country lowers our standard of living. 4. Literature should not question the basic moral concepts of society. 5. Reviewers and critics of art, music and literature decide what they like and then force their tastes on the public. 6. ' Why study the past, when there are so many problems of the ‘ present to be solved. 7. Business men and manufacturers are more important to society than artists or ‘musicians. 8. There is little chance for a person to advance in business or . industryunless he knows the right people. 57 J‘JU'IIBq‘IA 1:1 Duau .1 nn'w'n". a; ~.!'~LI m ‘I‘iwuwrt. '10? no and” . ‘rnj-iJ '51.? 91' w ‘1'”. 9191! w . ‘E'I’Ilri ‘ " In ': 1z' Fifi-H ; 31' . '» up “S “5'11 (1' home-m .,~,1-,:. ',, m. ' 1 . 1 u_:; ' - ’- 1",: augfilfi’to‘l" gt ; ,1 .1 _ 1 A. . 1:.» {no}. :noxentb fl 0] .fi 1 ’ ‘1 L‘ t7 ”18".“ .1 - . ~13" 15 9M . . ,1 -~.-rv‘n gawk ~v ~, . .1 gnmaN' ’ ‘ an» 819”}- ‘ I. ‘~"-1:‘1”u. vial?“ ,. ' '~'-:\ .. . ' ’ ' ‘. l;"NilV’8 .“ . ‘ ‘l'lnllftga I‘. . 1 ' can: '{d' . ,.l ' , .1 manom' .‘.‘ l)‘, '1? 1 1“ 1_ ‘ u , ' 1 "”11“. g, ”9:" -‘ and! ' ‘ we!!!" . d' ' A 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 58 Man has an inherent guide torright and wrong--his conscience. The main thing about good music is lovely melody. It is only natural and right for each person to think that his family is betterthan any other. All objective data gathered by unbiased persons indicate that the world and universe are without order. Any man can find a job if he really wants to work. We are finding out today that liberals really are soft-headed, gullible, and potentially dangerous. A man can learn as well by striking out on his own as he can by following the advice of others. The predictions of economists about the future of business are no better than guesses. Being a successful wife and mother is more a matter of instinct than of training. A person often has to get mad in order to push others into action. There is only one real standard in judging art works—-each to his own taste. Business enterprise, free from government interference, has given us our high standard of living. Nobody can make a million dollars without hurting other people. Anything we do for a good cause is justified. Public resistance to modern art proves that there is something wrong with it. Sending letters and telegrams to congressmen is mostly a waste of time. Many social problems would be solved if we did not have so many . immoral and inferior people. 8?. 331191931100 aid-~gao'lw has "1311 o: abut. : .vgbolem {19on 2! 9181919 boo: ”lo. 9.111 mm 11:16.1 0: Hum-M1 does 101 3113?: has 3191110 ml M - {.1 . iii oisoti‘m :e:zu.-.'~..q beenl'dnu 21,14 usual!!!) 8‘ ammo Sunni-w 315 M“ '11:; r' 911 .i dnt 8 hill! mu: 0) «java. 1.“ 1.111 '. 1.7 '91-"; m: ”I!” '.‘!'1 9'1/.L‘"‘L IV I. ,sr~ ,.;...11:1~..:e.roqm. 1,. 1" :9, 1.. - ' 9» as maelm~ J 1.1:.“ ' l 1 l 3 ‘ l .4" 4 1.1111110an & . 1 o 1 0 inc 31910“ *1 Jar.) rum I!“ . 11" . - 1 - . -v.'4.9nu|‘. ‘ 1 - ”'7" ‘10 “am, "- ""1950" '4‘ gristvvnA .u ~ 1'“ "JUM .“ v' 2,1101. ' 7. ri 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 59 Art which does not tell a human story is empty. You can't do business on friendship: profits are profits; and good intentions are not evidence in a law court. A person has troubles of his own; he can' t afford to worry about other people. Books and movies should start dealing with entertaining or uplifting themes instead of the present unpleasant, immoral, or tragic ones. Children should be made to obey since you have to control them firmly during their formative years. The minds of many youth are being poisoned by bad books. Speak softly, but carry a big stick. Ministers in churches should not preach about economic and political problems. Each man is on his own in life and must determine his own destiny. New machines should be taxed to support the workers they dis - place. The successful merchant can' t allow sentiment to affect his business decisions. Ministers who preach socialistic ideas are a disgrace to the church. Labor unions don' t appreciate all the advantages which business and industries have given them. It' s only natural that a person should take advantage of every opportunity to promote his own welfare. We should impose a strong censorship on the morality of books and movies. The poor will always be with us. ' .qu-‘a :«u Tibia-06ml! I u" ‘.‘-. .‘w .eté‘ir-urz vr-x ah‘m'xr; :qtriabnsmlflou- -~ ' - ,Huon wnl :3 mi a'unobtvs mu m N 3 1 7' ”] mm of l/mifr»: r 'rtc- ) mi ,awr IL! lo calm I My first . . .~' . ‘ . 'i. , ’ 2.. nails-25» PM)? bluoda servo -"- .. 'J fo bzmmm 858'.“ F .. ' ‘- .'.I-~'x~ ..; \ :1 . . . . -..::s:.. «I bmoda ...;?cv-'I'u,'*. :wr't gar!“ J 7 «V Cl! at: ‘mcu‘r to chat. '- . “1‘s“, t!" 41:30:! 1,! _ 3 . ' , ‘V-; f' ""‘ IX! Bj-’ " " :0 ~ Y‘:-u‘- mom '-"' ' . fit". (- ; . - ; r .‘ nun M , '{rMé‘ ' .- ‘ K ‘ - ‘ detain VI .h'uflfs 81:7“); :1 ' - ,' . ' xndu . alba- -‘.,_ - . .._.‘, a u“ .- r‘vuqqo -_ m: }I.L‘:F.. . 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 60 A person who is incapable of real anger must also be lacking in moral conviction. If we allow more immigrants into this country, we will lower our standard of culture. People‘who live in the slums have no sense of respectability. We acquire the highest form of freedom when our wishes con- form to the will of society. Modern paintings look like something dreamed up in a horrible nightmare. Voting determines whether or not a country is democratic. The government is more interested in winning elections than in the welfare of the people. Feeble -minded people should be sterilized. In our society, a person' s first duty is to protect from harm himself and those dear to him. Those who can, do; those who can't, teach. The best government is one which governs least. History shows that every great nation was destroyed when its people became soft and its morals lax. Philosophers on the whole act as if they were superior to ordinary people. A woman who is a wife and mother should not try to work outside the home. We would be better off if people would talk less and work more. In some elections there is not much point in voting because the outcome is fairly certain. The old masters were the only artists who really knew how to draw and paint. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 61 Most intellectuals would be lost if they had to make a living in the realistic world of business. You cannot lead a truly happy life without strong moral and religious convictions. If we didn't have strict immigration laws, our country would be flooded with foreigners. When things seem black, a person should not complain, for it may be God' 3 will. Miracles have always taken place whenever the need for them has been great enough. Science is infringing upon religion when it attempts to delve into the origin of life itself. A person has to stand up for his rights or people will take advantage of him. A lot of teachers, these days, have radical ideas which need to be carefully watched. Now that America is the leading country in the world, it' s only natural that other countries should try to be like us. Most Negroes would become overbearing and disagreeable if not kept in their place. Foreign films emphasize sex more than American films do. Our rising divorce rate is a sign that we should return to the values which our grandparents held. Army training will be good for most modern youth because of the strict discipline they will get. When operas are sung in this country they ought to be translated into English. People-who say they' re religious but don' t go to church are just hypocrites. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 62 What the country needs, more than laws or politics, is a few fearless and devoted leaders in whom the people can have faith. Pride in craftsmanship and in doing an honest day' s work is a rare thing these days. The United States may not have had much experience in inter- national dealings but it is the only nation to which the world can turn for-leadership. In practical situations, theory is of very little help. No task is too great or too difficult when we know that God is on our side. A sexual pervert is an insult to humanity and should be punished severely. A lot of science is just using big words to describe things which many people already know through common sense. Manual labor and unskilled jobs seem to fit the Negro mentality and ability better than more skilled or responsible work. A person gets what' 3 coming to him in this life if he doesn' t believe in God. Public officials may try to be honest but they are caught in a web of influence which tends to corrupt them. Science makes progress only when it attempts to solve urgent practical problems. Most things in life are governed by forces over which we have no control. Young people today are in general more immoral and irrespon- sible than young people of previous generations. Americans may tend to be materialistic, but at least they aren' t cynical and decadent like most Europeans. The many different kinds of children in school these days force teachers to make a lot of rules and regulations so that things will run smoothly. 20 .wa'l H 2} ,E'Jillfiq ‘10 3176‘ 0m 9106: .m .'.J.'>“n ovum-l m: . _.!qr.3': of; (30.4w m 3195-31 s at how .1 "{le manor! ns gniob at but - . 0115 ~ .,. '1 I :1. -%g.l'.v!'lr."!;-' nf-ium bnd 9le mm '{sm 3”» -. .1“ .1... a (J :‘f‘wvv N‘ .w'fizr': {Ii «(U 31 H Nd .. .qidns- “1 . 'f. ,1 .‘1-'; 41 out!!!“ I 42-, T"!'.il'11h m :3913 001 at ‘ '8 7» . “ Q 11“.“ w "; . :' 'i'il. .-‘.E x'wv'wq I. F r ‘J"“133 10’“: -.:z. rlqoeq ’ ,;, W. - _ t, m. mm.” _ a"! vuudm , - .' 1': .zvd‘xfl ms) rat we“ - 1‘ ~.-‘~f . F. v" 'u odM-x . '~,a .. month 3‘- ‘ ' ‘ ; '7 )179k93 '4 ul‘Js'Iq . ,. 1“ Q 2."! mold .. “'rnt on ' g! .-. . ' ~ . ',‘),',!1L'0Y .fl , , ., _ Md}. ‘ \ ‘ r 3 "(3i " '1' can“ 3 . . , . . .2311? . , . 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 63 Jews will marry out of their own religious group whenever they have the chance. The worst danger to real Americanism during thelast 50 years has come from foreignideas and agitators. Europeans criticise the United States for its materialism but such criticism. is only to cover up their realization that American culture is far superior to their own. The scientist that reallycounts is the one who turns theories into practical use. No one can really feel safe when scientists continue to explore whatever they wish without any social or moral restraint. Nudist colonies are a threat to the moral life of a nation. One trouble with Jewish businessmen is that they stick together and prevent other people from having a fair chance in competi- tion. No worldorganization should have the right to tell Americans what they can or cannot do. There is a source of knowledge that is not dependent upon obser- vation. Despite the material advantages of today, family life now is not as wholesome as it used to be. The United States doesn't have to depend on the rest of the world in order to be strong and self -sufficient. Foreigners usually have peculiar and annoying habits. Parents know as much about how to teach children as public school teachers. The best assurance of peace is for the United States to have the strongest army, navy, air force, and the most atom bombs. Some day machinery will do nearly all of man' 3 work, and we can live in leisure. 88 {5WD '1 swutmtw qUo'lg gum-31191 NW 119!!! ht. ;;.1;'.iulb mandamus-u". 1801 d .a'u‘Jz-‘Jigl‘. has assbi “3101* ‘ t “.0 131?. bmzu'J 9d: MW‘ .1, .n‘: 7212.] 9:0 .4 ' ' 0 ,.J, :1". mt. .‘z'iE-ILI'LHLHI 9,7. t “(pr Tani? 7".'(I.’=':\i;'(;»"‘ 'li‘L'i! (1' nave») 01 [Inc fl . '- F-W'" \ o ,{> 114‘ 1'; '1‘ K ' n: '. 'U‘3': (lib-91 "d’ " .9311 l» - ' i. Tf‘ "7 ‘ 11 ")5! VHBS‘! I” ; 1 win-175w narw {9:8 1 ' 4'! "rmflluim um: .sfd ' .‘.z‘.9‘:c- N101! {5.11 w .' W '1.:' 1.5 (M - ' . a~‘dod'¢l§l~' . I t L. ." lab-IO a ' It“ a I¢II"_ 'ztutl 'J! 1081“”. 1" I ~' ' 11;!" ’flflm‘f "MU MIT ‘ . O ., 3‘ \‘r a: 81563:, _ 1100"” L ‘..‘ t ‘1’ I" : -. rzqaafl" ' ' "313101 I 7 we I“ .i 00119. til 3“ 9 "gnofllv 4.. l .Tr‘. I b ‘ I 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 64 There are too many people in this world who do nothing but think about the opposite sex. Modern people are superficial and tend to lack the finer qualities of manhood and womanhood. Members of religious sects who refuse to salute the flag should be punished for their lack of patriotism. Political parties are run by insiders who are not concerned with the public welfare. As young people grow up they ought to get over their radical ideas. Negroes have their rights, but it is best to keep them in their own districts and schools and to prevent too much contact with whites. The twentieth century has not had leaders with the vision and capacity of the founders of this country. There are a lot of things in this world that will never be explained by science. Sexual relations between brother and sister are contrary to natural law. There may be a few exceptions, but in general Jews are pretty much alike. The world will get so bad that some of these times God will destroy it. Children should learn to respect and obey their teachers. Other countries don' t appreciate as much as they should all the help that America has given them. We would be better off if there were fewer psychoanalysts probing and delving into the human mind. American free enterprise is the greatest bulwark of democracy. ma gamma ob mtw bi'row am: 111 ”spot; L‘Lutrt‘k. 0' N n . M me on! »- want! on! 540.3! d: brie: bns'lsrszl'qua an. m» .(morfvsmow has : , .. ., ‘_ I .4 “.0 L main; a: 0231'} .0 JtI-W ammo weigh-3h. '. v Had-"rug" lo .433! 1190‘” - ~ ' I \ v' new in: "2 :21: {-518 29%;” _ 1.“! 3119' a I J 3:.“ 'l:' ' ‘. _: 'mr .25.! mug ulqosq , . . . . . “.1318 'WIzrl soot . . . . ; .' Jainsahha» ‘. . ‘. («I ;~ ~ ; Mamie" m 2'. ‘2 {flow 0' l . . ‘ ? . I: ‘- L: snuff} ,1 Wauialqu'fif l-suxsa ‘ Isn'dlfl- .7 . ‘ '1.! 65 119. If a personis honest, works hard, and trusts in God, he will reap material as well as spiritual rewards. 120. One will learn more in the school of hard knocks than he ever can from a textbook. um :‘N .1300 ni aiam: has .lrmd um , ., . ab':8w91 laurmiqz u not 'mvu ~1sri nut: EN’WXL‘I 1:11:11 'to looms; 5d: at 010‘, .i 66 TABLE A. 1-. --Listing of subscales in which each change -item first appeared after-item analysis with one -to -four scor- ing system. * Item Analysis Method Item I ‘ Number Change II III IV Variance Pr etest Saupe 1‘ Random 1 " 6° 90 60 2 " 15 -- ' 15 3 15 5° 90 9o 4 30 90 __ 3O 5 15 90 30 -_ 6 3° -' 90 9o 7 ‘ ' - - 60 60 8 " " -- 3o 9 30 3o _ _ 15 1° 9° 60 60 -- 1 1 6° 60 60 90 12 “ '- -- '30 13 90 15 __ 60 14 90 __ 30 90 1 5 60 60 60 60 16 -- 90 __ 3o 17 60 90 __ 15 18 60 90 __ 60 *The numbers in the table indicate the scale length-when the item first appeared. If an item is included in a scale of 15 items, it is obviously included in all scales using the same procedure which are of greaterlength. .‘afil maxi-spams dose {faith at CMME vnoa wol- ot-aao mw eievfsas malt m g boxhell r-Iavlanlx meal . l . v'l 1 H: g H . . o _ .‘x:ub-.;s.h I 1 m ‘ 139:9?! ‘ I A---“ J v- .1- -_1 -1 - J 1‘ I '1 .- 1 l . J V ‘ ‘v‘;~ I v" - l ,, , I . "t s .. Lb‘l I r ’1.: 5 f, . _-.’. TABLE A. 1. --Continued. 67 Item Analysis Method Heni I Number Change Priie st Savage r Rarfii’om V ariance 19 30 15 30 90 20 30 60 30 -- 21 60 90 90 -- 22 60 90 -- 90 23 -- -- 90 60 24 60 -- 90 30 25 15 60 15 -- 26 60 90 -- 6O 27 15 60 60 90 28 -- -- 90 90 29 3O 30 60 -- 3O 60 15 90 90 31 30 30 60 60 32 30 60 -- 30 33 15 90 -- 15 34 15 15 -- 60 35 90 -- 30 -_ 36 60 90 90 90 37 90 90 30 -- 38 15 15 -- 90 39 -- 30 60 60 40 30 60 90 30 41 -- 15 -- 15 1'8 bother/i am;th mull __ --- ~._ ~-.» r w I. ,, "U «I'M _ .r. 1. v L . till Ilrlix