___._ I LIBRARY Michigan State L University PLACE iN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE ‘ MAGi 2 “LL 1 MSU Is An Affirmative Action/Equal Opportunity Institution cMMunG-nt AN ORDERING THEORETIC ANALYSIS OF THE SATO CAUTION INDICES IN A MALAYSIAN CONTEXT BY Ivan Douglas Filmer, Jr. A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1992 ABSTRACT AN ORDERING THEORETIC ANALYSIS OF THE SATO CAUTION INDICES IN A MALAYSIAN CONTEXT BY Ivan Douglas Filmer, Jr. The Sato caution indices are derived by investigating the observable patterns of students' responses on a test and summary statistics. It involves constructing a students-by- items matrix of the binary responses where students are arranged from highest to lowest scoring, and items arranged in order of increasing difficulty. The indices derived by a formula introduced by Sato ranges from zero to over 1 and indicate the extent to which items are aberrant. In this study, a 40-item objective test was administered to 354 fifth form students in six Malaysian schools. The item and student caution indices were calculated first using the students-by—items matrix of the Sato model and then again after ordering the items with the probabilistic model 2 of the ordering theory (Krus, 1975). A principal components factor analysis and an agglomerative hierarchical cluster analysis were conducted on the items to aid the ordering theoretic analysis. The results of the study showed that in the Sato model, the ordering or arrangement of the items did not affect the Ivan D. Filmer, Jr. calculation of the item caution indices but affected the calculation of the student caution indices. Similarly, the ordering of the students affected the calculation of the item caution indices. The sample size of items and students had an effect on the magnitudes of the item and student caution indices derived. The arrangement of the items according to the ordering theoretic analysis correlated almost identically with the arrangement of items in the Sato model (rmmsu999). Identical item caution indices were produced. The item caution indices derived from the different group characteristics of item format, school location, students' gender, students’ SES, and teachers’ working experience, were all not significantly different. There was also no significant interaction effect between the student caution indices derived from students of different SES and different school locations. Dedicated to my wife, Voon Mooi, and my children, Andrew and Andrea. iv ACKNOWLEDGEMENTS This dissertation has been completed with the help of a great many people. First of all, I wish to thank the Malaysian Ministry of Education for awarding me a scholarship to aid me in my doctoral studies, the Assistant Director of the Malaysian Educational Planning and Research Division, Dr. Hanafi Mohamed Kamal, and Tuan Haji Jumali Kassan, the State Education Director of Selangor Darul Ehsan. Special thanks are extended to Puan Nik Faizah Nik Mustapha, the former Assistant Director of the Examinations Syndicate, for her genuine concerns, Mr. Lim Chee Tong, the former Assistant Director of the Vocational and Technical Schools Division, without whose help this study would have been more difficult, and Puan Hajah Badiah Abdul Manan, the former Director of the Examinations Syndicate, for granting permission to obtain certain data from the Syndicate. I am also in debt to Datin Hajah Rapiah Tun Abdul Aziz, En. Abdul Rafor Ibrahim and Mr. A. Sivanesan, all senior officers of the Examinations Syndicate. I wish to express my appreciation also to my friend, Mr. Leslie Fredericks, the Head of the Examinations Unit at the Teachers' Training Division, for helping me at some of the crucial stages of my study. Without his help, this study would have taken a much V longer time to complete. I wish to express my gratitude to the members of my Dissertation Committee. I wish to thank Dr. William Mehrens, who served as the dissertation chairman of my committee, for his concerns and guidance in the completion of this study. His friendship and willingness to offer suggestions are most appreciated. I also wish to thank Dr. Betsy Becker for her statistical expertise and suggestions in the data analysis, Dr. Norman Bell for his constructive suggestions as a member of the committee, and Dr. Frederick Ignatovich. for his assistance during the initial stages of my study. I especially appreciate the support and cooperation of all the pupils, teachers and school administrators who participated in this study. I will always remember their contributions of time and effort. Finally and most of all, I wish to thank my wife, Voon Mooi, for her love, patience, understanding and support. I appreciate the many sacrifices made by her and my children, Andrew and Andrea, to make it all possible. vi TABLE OF CONTENTS Page LIST OF FIGURES.. ............... . ...... .. ............. ix LIST OF TABLES......... ............................... x CHAPTER I. STATEMENT OF THE PROBLEM..... .................. 1 Introduction................. ..... ............ 1 Problem Statement............................. 3 The Purpose of the Study...................... 4 Significance of the Study..................... 5 Research Questions............. ......... . ..... 7 Overview................... ................ ... 8 II. REVIEW OF LITERATURE.. .......... . .......... .... 10 Introduction.................................. 10 The Sato Caution Index........................ 14 Assumptions of the S-P Model.................. 19 S-P Model Interpretations..................... 20 Precision of the Caution Index................ 21 Advantages of the S-P Model................... 22 Limitations of the S-P Model.................. 23 Controversies of the 5-? Model........ ........ 24 The Ordering Theory........................... 24 Ordering Analysis and Dimensionality....... 24 Reliability and Test Validity of the Ordering Procedure...................... 30 Factor Analysis and Ordering Analysis...... 31 Summary....................................... 33 III. RESEARCH DESIGN AND PROCEDURE.................. 35 Introduction.................................. 35 Development of the Test Instrument............ 36 Selection of the Sample....................... 37 Procedures of Test Administration............. 38 Data Analysis..... ..... ....................... 40 Summary............. ..... ........... .......... 43 vii IV. ANALYSIS AND INTERPRETATION OF THE DATA........ 44 Introduction .................... .... ...... 44 Characteristics of the Sample. ........... ..... 44 Analysis of the Test Data..................... 46 The Ordering Theoretic Analysis............... 53 Arranging the Items According to the Sato Model...................................... 56 Answers to the Research Questions ..... ........ 58 Research Question 1........................ 58 Research Question 2........................ 66 Research Question 3........................ 73 summary.........0.0.0.0.........OOOOOOOOOOO.... 74 V. SUMMARY, CONCLUSIONS, IMPLICATIONS, AND RECOMMENDATIONS................................ 76 Summary of the Purpose and Procedures of the Study....... ......... ...................... 76 Discussion and Conclusions........ ........ ..... 78 The Test Data ..... ......................... 78 The Sato Model .............. ....... ..... ... 78 The Ordering Theoretic Analysis............ 79 Group Characteristics...................... 81 Item Format............................ 81 School Location........................ 82 Students' Gender....................... 82 Students' Socio-Economic Status........ 83 Teachers' Working Experience........... 83 Interaction between School Setting and Students’ SES..................... 83 Implications ......... ....... ....... . ..... ....... 84 Recommendations............... ......... ......... 85 APPENDICES A. THE TABLE OF SPECIFICATIONS OF THE TEST......... 87 B. THE TEST INSTRUMENT............................. 88 C. THE TEACHER’S QUESTIONNAIRE..................... 101 D. DESCRIPTIVE STATISTICS OF THE ITEMS CHOSEN FOR THE TEST INSTRUMENT............................. 102 E. DENDROGRAM OF THE AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS OF THE TEST ITEMS USING THE COMPLETE LINKAGE METHOD......................... 103 F. THE ITEM CAUTION INDICES OF THE 40 ITEMS OF THE TEST INSTRUMENT AS DERIVED UNDER VARIOUS SAMPLES 104 BIBLIOGWHY... .......... O ...... O ..... ...... ..... .0... 106 viii Figure LIST OF FIGURES 2.1 A summary of the student characteristics as reflected by their achievement levels and their caution indices.................................. A summary of the item characteristics as reflected by the item difficulty and the item caution index..... ...... ......................... Response patterns characteristic of prerequisite, logical equivalence, and logical independence relationships in percentages (Reproduced from Wise, 1986, Figure 1, p.444)............ ..... .... Histogram frequency of 40 items on the test ..... . Histogram frequency of the first 25 items on the test00000000000000000000000000 ..... 0.000.000.0000 Histogram frequency of the last 15 items on the test000000000 000000 0 000000 0.000000000000000000000 An ordered students-by-items matrix (S-P table) for 15 students and 10 problems............. ..... A plot of the standardized difference in item caution indices (D) with its corresponding item number0000.0000.00000000000000000000000000.000000 Student SES-by-school location matrix of student caution index meanSOOOOOOOOOOOOO0000000..0.00.000 ix Page 17 17 28 49 50 50 59 69 74 LIST OF TABLES Page Information on the Subjects Used in the Study.... 38 National Examinations Statistics for the Biology Paper for 1986 to 19900000...00000000000000.00000 41 Distribution of Students by School Location, SES, and Gender................ ..... . ............ 45 Distribution of Teachers by Teaching Experience, School Location, and Gender...................... 46 Results of a Reliability Analysis on the Items in the Test Instrument00000000000000.000000000000000 47 T-test of the P-values of the Composite of the National Test and the Test Instrument............ 48 A Distribution of the Test Items According to Prerequisite Requirements........................ 57 Mean Student Caution Indices According to Sample Size and Number of Students at or above .50...... 64 Item Caution Indices of Six Discrepant Items..... 68 One-Way Analysis of Variance on the Differences in the Mean Item Caution Indices Derived from Teachers With Different Teaching Experience...... 72 Two-Way Analysis of Variance on the Effect of Students' SES and the School Location............ 73 CHAPTER I STATEMENT OF THE PROBLEM Introductign It has been common practice in education to use a single total score on a test to report the academic achievement or ability of a student. However, Harnisch and Linn (1981) have pointed out that there are as many as 184,756 possible item response patterns that yield a score of 10 on a 20-item test. Therefore, a student's total test sCore, which is merely the number of items correct on a test, can often be misleading. Blixt and Dinero (1985) echo the view of many educational authorities when they pointed out, "There is no guarantee that any one total score on a test will give the same information about each of the examinees who take the test" (p. 239). As such, in order to assess student performance and student errors, teachers or administrators need more than a single total test.score (Harnisch, 1983). Presently, researchers have turned to examining more closely the item response patterns of students. The type of information obtained from such analyses is above and beyond the kind provided by the traditional method of scoring a test, which is, as Birenbaum and Shaw (1985) explain, theiconsiderationtonly for the total number of correct responses. 2 Researchers like Birenbaum & Tatsuoka (1982), and Tatsuoka & Tatsuoka (1983) have shown that the analysis of students' response patterns in a test would enable a teacher to prescribe individualized remediation to correct some of those students’ misconceptions. Others like Brown & Burton (1978) , and Tatsuoka & Baillie (1982) have developed computer programs, such as "BUGGY" and "SIGNBUG", for diagnosing students' misconceptions of learning from tests. However, the appropriate interpretation of any test depends on knowing the dimensions underlying the items and the correspondence between the items and those dimensions (Green, 1983) . Studies have also shown that tests that were previously calibrated as unidimensional were subsequently found to be multidimensional (Reckase, 1985; Bock, Gibbons, & Muraki, 1986; Zimowski & Bock, 1987) . If a test was multidimensional, it would be even more inaccurate to say that two students receiving the same score on a test possess the same ability or have attained the same level of academic achievement. This problem has stimulated research in different approaches to diagnosing atypical item response patterns of tests. Studies have shown that a variety of factors have contributed to atypical item response patterns. Prominent among these factors are differences in students’ background experiences, students' exposure to different subject matter or school-to-school variability in content coverage and emphases, test anxiety, students’ guessing, carelessness, cheating on the test, different curriculum emphases or coverage, and 3 students' attendance patterns. As many as 20 different types of item response indices have been formulated to gauge the extent to which an individual's response pattern on a test is unusual. Harnisch and Linn (1981) have conveniently categorized these indices into two major groups. The first group, which they label "appropriateness" indices, is based on the item response theory (IRT). The second is based directly on the observable pattern of right and wrong answers, and summary statistics. The most popular index used in this latter group is the Sato caution index. EEQDLQE.§£QL§E§DL The problem addressed in this study concerns the calculation of the Sato caution indices. These indices are derived based on the assumption that the items on a test may be linearly arranged in terms of item difficulty. A students- by-items matrix is first constructed wherein students are arranged from Ihighest scoring' to lowest, and items are arranged from easiest to most difficult. As most tests are generally found to be multidimensional, this simple ordering may not be adequate. As such, the values of the caution indices derived in this way may not be accurate. Furthermore, when arranging the matrix in this manner, the ordering of marginal totals that have the same magnitude, are resolved arbitrarily. This is to say that there is no organized or conceptual method of arranging the students who have the same 4 score, or items that have the same p-value. According to McArthur (1987) , these arbitrary allocations contribute to the instability of the S (Student) and P (Problem) curves in the students-by-items matrix. Invariably, this instability would affect the accuracy of the caution indices. Harnisch (1983) has stated that a student caution index is group dependent because it indicates whether an individual student's pattern of responses is atypical, relative to the responses of the whole.group of students who took the test. No research work has been reported on the effects of the group characteristics on the caution indices. The Purpose 9f the Study The purpose of this study is to address two main issues concerning the Sato caution index. The first issue is the effect of the dimensionality'of the test on the calculation of the caution indices. The second issue is the type of group characteristics affecting the magnitudes of the item and student caution indices. Concerning the first issue, the purpose is to apply the ordering theoretic analysis of Bart and Krus (1973) on the students’ responses to a test instrument to determine the Ihierarchical pattern of items on the test. The test items will then be arranged according to the hierarchical pattern. The caution indices will then be calculated and compared with the caution indices obtained by using the Sato model where the items are arranged from easiest to most difficult. It is the 5 intention of this study to show the use of the ordering theory to resolve the arbitrary assignment of the marginals of the items with the same p-values. Regarding the type of group characteristics, this study will examine the characteristics of students' gender, students' socio-economic status, school location, and teachers' teaching experience. Significange 9: the Study The 8-? model (Student-Problem model) is a conceptual method for identifying atypical student item responses and test items. Its intuitive appeal to educators lies in the fact that it can be quite easily interpreted. However, one of its important assumptions is the linearity of the items on the test. This assumption is plausible if the test is unidimensional. But studies have shown that many tests are multidimensional. The use of the ordering theoretic analysis of Bart and Krus (1973) to reorder the test items provides a means of constructing logical item hierarchies. This arrangement should validate the Sato model. The other problem affecting the S-P model lies in the instability of the S and P curves due to the arbitrary assignment of marginal totals that are of the same numerical value. The ordering theoretic analysis would also resolve this arbitrary assignment of the marginal totals of the items, and serve to reduce the instability of the S and P curves. The use of the ordering theoretic analysis produces a conceptual 6 diagram of the hierarchical ordering of the items on the test. This would enable the educator to map the conceptual understanding of her students and thus identify their misconceptions in learning. The analysis also provides the educator with a means to identify the prerequisites of various topics being taught, and to aid in planning alternative sequences of learning experiences to cater to the needs of the students. Knowing the impact which certain group characteristics like students' gender, school location, students' socio- economic status, and teachers' working experience, have on the caution indices would enable teachers and other concerned parties to correctly interpret the caution indices. This, in turn, would enable teachers to obtain more returns on the time and energy invested in a classroom test. Presently, the research has shown that there exists a considerable gap between testing and instruction (Floden, Porter, Schmidt, and.Freeman, 1980; Leinhardt.& Seewald, 1981; Schmidt, 1983; Linn, 1983, 1990). It is hoped that this study will contribute towards narrowing this gap. There have also been various studies directed at integrating testing and instruction. Baker anleerman (1983) discussed the importance of "task structure" in integrating testing and instruction. They described "task structure" as a model of skills that are expected of the learner. Birenbaum and Shaw (1985) provided an example of the use of a task specification chart (TSC) that integrated the content facets and the procedural steps of a 7 specified task. They suggested that the TSC be used as tool for designing a test and for interpreting its results. Based on the cognitive theories of Piaget, Bruner, and Gagne', that there exists a hierarchical structure in learning, there exists a need to devise reliable ways in which these hierarchies may be identified and interpreted. This study, using the ordering theoretic analysis, provides a way to do this effectively. Because teachers spend as much as 12 per cent of class time for testing (Dorr-Bremme & Herman, 1986; cited in Switzer & Connell, 1990), it would be convenient and economical to employ class tests to determine these hierarchies. When these hierarchies are identified, it may be possible for teachers to identify which topics should be taught before others. In this way a closer link.may be brought between testing and instruction. As Linn (1990) has aptly pointed out, "Improving the quality of classroom assessments can have a positive influence on the quality of learning." (p.425) W The students' responses on the test instrument were analyzed in relation to the following research questions: 1. Is there a difference between the item caution indices derived using the items ordered by the ordering theoretic analysis, and the Sato item caution indices? 2. Are the derivations of the item caution indices affected by any of the following group characteristics: format 8 of the test items, school location, students’ gender, students’ socio-economic status, and teachers’ teaching experience? 3. Is there an interaction among the student caution indices between the student’s socio-economic status and the school location? Overview It is generally accepted that a single test score on an achievement test cannot accurately measure a student’s ability. Many studies have investigated students’ atypical item response patterns in order to identify their misconceptions. This is in order to help bring about a closer link between testing and instruction. Several item response indices have been formulated to aid in identifying students’ atypical response patterns. Among the more popular indices is the Sato caution index. The basic assumption made in using the Sato .caution index is that the test being analyzed is unidimensional. As unidimensional achievement tests have been subsequently found to be multidimensional, the Sato caution index :may' not be accurately interpreted for all tests. Furthermore, the students-by-items matrix generated in order to calculate the caution indices employs the arbitrary assignment of marginal totals that have the same numerical value. The purpose of this study is to use the ordering theoretic analysis of Bart and Krus ( 1973) to construct logical item hierarchies. This hierarchical arrangement of the 9 items is then used in the students-by-items matrix. In this way the arbitrary assignment of those marginal totals that have the same values would be resolved. The effects of students’ gender and socio-economic status, school location, and teachers’ working experience on the values of the caution indices will also be examined. In doing so, it is hoped that this study will contribute towards bringing a closer link between testing and instruction. CHAPTER II REVIEW OF LITERATURE IBEIQQBQELQD Hoge and Coladarci (1989) have stated what seems to be a widespread finding, particularly among school psychologists, educational researchers, and other professionals, that teachers are generally poor judges of the attributes of their students. This is because their perceptions are often subject to bias and error (Clark & Peterson, 1986). The increasing population of school-going children and its increasing heterogeneity are two factors that have made the public more aware of assessment in schools (Fuchs & Fuchs, 1986). This public concern has prompted related litigation (e.g. Debra P. vs Turlington, 1981) and laws (e.g. PL94-142 mandates) (both cited in Mehrens & Lehmann, 1987) to bring about a closer link between assessment and instruction. Presently, commerically prepared standardized tests provide the most efficient and economical method to assess a large number of students. Egan and Archer (1985) explain that, "It is commonly argued that commerical tests provide teachers with valuable information about the abilities and deficiencies of their students, from which it follows that teachers who rate their students without such information will often be in error" (p. 25). This view is also shared by Mehrens and 10 11 Lehmann (1987), "... users of tests will make better decisions with appropriate data than without such data" (p. 7). They go on to elaborate that "The data provided from such tests should help teacher, counselor, administrator, student, parent, and all those concerned with teaching-learning process make the soundest educational decisions possible" (p. 10). But, Bejar (1984) and other researchers have argued that standardized tests results, frequently have little or no impact on instruction because the test results offer little or no help in designing instruction that is optimal for an individual student. This is mainly because the standardized tests used may not be customized toward the local curricular needs. Mehrens and Lehmann (1991) make this point too when they state "Both standardized and teacher-made tests serve a common function: the assessment of the pupil’s knowledge and skills at a particular time. It is usually agreed that the teacher-made achievement tests will assess specific classroom objectives more satisfactorily than standardized achievement tests" (p. 349). They add that standardized achievement-test scores may be used to supplement the empirical data obtained from teacher-made test scores to arrive at better educational decisions. In addition to this, a single global summary score of a student’s performance seldom reflects the same response pattern for another student with the same total score on the 'test. In other words, two students with the same score on a ‘test may not have the same proficiency in the same content 12 areas of the test. Harnisch (1983) pointed out that there are as many as 252 different item response patterns that yield the same number correct score of five in a 10-item test (p.191). But he went on the say that even though 252 distinct item response patterns can be identified, it is obviously not feasible to provide different interpretations for each unique response pattern. Interest in this area of student response patterns on a test has led to the development of powerful techniques for examining item response patterns. At least 20 different item response indices have been developed for identifying atypical response patterns. Harnisch and Linn (1981) have categorized these indices into two major groups. The first group of "appropriateness" indices is based on item response theory (IRT), while the second group is based directly on the observable pattern of right and wrong answers together with summary statistics. The first group of item response indices based on IRT was described by Levine and Rubin (1979) with various modifications initially suggested by Drasgow (1978). Later, Wright (1979) described another example of an IRT based index of the chi-square test for person fit that is sometimes used with.applications of the Rasch model. Tomic (1987) stated that at that time there were nine such different indices based on IRT. However, she classified them somewhat differently. She grouped the appropriateness measures as those which use the maximum likelihood function to estimate the item and the 13 student’ s ability. Her second group of indices were those which made use of standardized residuals to calculate the weighted or unweighted total fit mean square. Harnisch and Linn’s second group of indices include the Personal Biserial Correlation (rf) (Donlon & Fischer, 1968); the van der Flier Index (Ui) (van der Flier, 1977); the Sato Caution Index (ci) (Sato, 1975); the Dependability Indices (0,, 9.9 (Kane & Brennan, 1980); the Agreement (A,) and Disagreement (0,) Indices (Brennan, 1980); the Personal Point-Biserial Index (ri) (Brennan, 1980; cited in Harnisch & Linn, 1981); the Modified Caution Index (ci') (Harnisch & Linn, 1981); the Norm- Conformity Index (NCIi) (Tatsuoka & Tatsuoka, 1982) ; the Individual Consistency Index (Tatsuoka & Tatsuoka, 1982, 1983), and lastly the Person Average R (PAR) (Ayabe & Heim, 1988, cited in Shishido, Ayabe, & Heim, 1988). Harnisch and Linn also termed this group of indices as group dependent indices because they indicated whether an individual student’s pattern of responses was atypical relative to the whole group of students who took the test. Tomic (1987) went on to describe another category of indices as mathematical extensions which connect Sato’s indices to the IRT model. Such extensions have been shown by Tatsuoka and Linn (1983) when they extended the concepts of Sato’s Student-Problem (S-P) curve theory to take advantage of the results of item response theory. They showed analogous relationships between S-curves and the test response curves (TRC) , and between P-curves and the group response curves 14 (GRC) in logistic models. Tatsuoka (1984) has also described the applications of such extended caution indices. Recently, Harnisch and Jenkins (1990) developed a computer program entitled the S.P.P. (Student Problem Package) for the analysis of atypical student responses and atypical item functioning based on the Modified Caution Index. Despite the research done on the development of these group dependent indices, their practical applications have been limited. The only apparent exception is the Sato Caution Index which is widely used in Japan (Fujita, Satoh, & Nagaoka, 1977; Sato & Kurata, 1977; Tatsuoka, 1984; McArthur, 1987). The Sato gahtieh Iheex In 1963, Takahiro Sato, an engineer lat the Nippon Electric Company (NEC) in Japan developed an instrument known as the Response Analyzed (RA) (Tatsuoka, 1978). This instrument was used to provide a teacher in the classroom with a means to determine the mean class performance and the mean response time of her pupils to a test of multiple-choice questions. Interest in analyzing class performance data with computer applications led Sato and Professor Hiroichi Fujita of Keio University to the development of a new non-parametric method of data analysis. This eventually lead to the building of the "S-P Chart" (Student-Problem Chart). Sato ( 1990) claims that "For the analysis of performance data, the S-P chart is preferable to the traditional test-score theory with its reliance on the normal (Gaussian) distribution of errors of 15 measure or the more recent latent trait theory" (p.135). The S-P (Student-Problem) technique of analyzing patterns of student responses on a test involves the construction of a students-by-items matrix where the columns consist of the test items arranged in.order of increasing difficulty, and the rows consist of students arranged from those who scored highest to lowest on the test. Sato’s Caution Index, C(SQ, for student i is then calculated as follows: 2:; ””99 Yj-Efim xin' C(51) = 3:1 Yj-Xipl where the i—th student’s score on the j-th item, coded 1 for correct and 0 for incorrect the i-th student’s total score on the test the number of students getting the j-th item correct u’ = the average item score on the test n = the number of items on the test W3< Similarly, the Sato Caution Index, C(Pp, for the j-th item can be calculated by using the equation below: Y 2111 (1”‘11’X1'qu1 xv": Y 21.34 Xi-Yjp c119,) = where u = the average of all the students’ test scores. N = the number of students Sato (1980) has alsoigiven the equations for both caution indices in terms of covariances. 16 In interpreting the caution indices, Sato (1980) used a value .5 on the caution indices for classifying students into any one of six different categories. This is summarized in Figure 2.1, reproduced from Sato (1980), Figure 8, p.155. Similarly, using the same value of .5 in the item caution indices, Sato classified items into four categories. (See Figure 2.2, reproduced from Sato (1980), Figure 9, p.157.) Because the Sato Caution Indices range from zero to values above 1, Harnisch and Idnn (1981) have developed a Modified Caution Index, cf, to produce an index with a lower bound of zero and an upper bound of 1. The advantage of this Modified Caution Index is that it eliminates the upper extreme scores that are obtained when using the Sato Caution Index. It also serves as a basis of relative comparisons between indices. Harnisch (1983) adopted the lower value of .3 to categorize students and items with the Modified Caution Indices. To measure the degree of discrepancy of the Student and the Problem.curves to each other or to the Guttman scale, Sato developed a measure which he termed the Disparity coefficient, D. Student achievement level 17 High Is doing fine Is making careless mistakes Medium Needs to study a little harder Is making careless mistakes, and needs to study a little harder Low Needs to study much harder Has sporadic study habits and/or insufficient readiness for the material covered in the test Figure 2.1 Item difficulty (P) 1.0 .5 1.0 Student caution index A summary of the student characteristics as reflected by their achievement level and their caution index Easy A fair item but may contain clues to the correct answer. It helps discriminate low achievers from the rest of the students. Probably needs revision. It is missed by a few high scorers but answered correctly by some low scorers. Perhaps there is a poor option included in the possible responses. Hard A good item for discriminating the high achievers. Poor item. It may be mis- keyed or contain ambiguous terms. This item is heterogeneous with respect to other terms; it may be measuring a different content than the other items. .5 1.0 Item caution index Figure 2.2 A summary of the item characteristics as reflected by the caution index item difficulty and the item 18 The Disparity coefficient is calculated as follows: Lb A(N3n;é) AB(N,n, p) where A(N,n,'p') is the area between the S-curve and the P- curve in the given S-P chart for a group of N students who took the n-problem test and obtained an average problem passing rate 5} and A,(N,n,§) is the area between the two curves as modeled by cumulative binomial distributions with parameters N, n and 5, respectively. Based on the experience of a large number of S-P charts, Sato developed a "rule of thumb" that a disparity coefficient value for an achievement test is usually around .45 to .50. This is "about right" for an ability test involving several distinct abilities (or factors), while a value exceeding .60 is a danger signal (Fujita, Satoh, & Nagaoka, 1977). In the latter case, he explains that it may signify that the set of items is excessively heterogeneous or that the group of students of two or more subgroups are having varying degrees of exposure to the material being tested. Sato has also developed a modification of the Caution Index in order to examine patterns of responses to clusters or subtest scores in comparison with an "ideal" pattern of scores of individual subtests, namely the perfect Guttman pattern (McArthur, 1987). 19 Aesthetiohs of the 8-2 hegel All items must be scored dichotomously and students must answer all questions on the test. Any missing values where students have omitted or have not attempted the questions in the test, need.to be meaningfully scored, usually with a zero. The S-P model can be applied to as few as two students and two problems. This is to say that the model can work for a 2 x 2 students-by-items data matrix, or a 2 x J matrix, or a I x 2 matrix, where J represents the columns of the items and I represents the rows of students. There is theoretically no upper limit for the number of students or items to which.this model may be applied. The only possible limit imposed would be that of available computer memory space. When the rows of the matrix are ordered by the total scores of the students, two or more students can share the same total score. As the positions occupied by the students must be unique, these marginal total ties must be resolved arbitrarily. Similarly, there may be ties in the item difficulty values. The resolution of this latter problem would depend on the test builder’s experience and knowledge of the content material. Usually, this would be less arbitrary than in the case of students. These two steps to resolve the marginal scores would result in some instability in the S and P curves. The calculation of the caution index depends on the linear interpretation of steps between the marginal totals. 20 This necessitates treating all the elements in the matrix equally and does not consider the influence of students guessing on the test. - o e te ' s A large value of the Sato Caution Index would indicate that an atypical response pattern is present. Harnisch and Linn (1981) suggest that some of the reasons for this may be guessing, carelessness, high anxiety, an unusual instructional history or other experiential background, a localized misunderstanding that influences responses to a subset of items, or copying a neighbor’s answers to certain questions. As such, they add that a large value of the caution index would raise doubts about the validity ‘of the usual interpretations of the total score for an individual. Harnisch’s (1983) interpretations of the caution indices are referred to as Modified Caution Signals. The latter are very similar to Sato’s interpretation except that Harnisch places 0.3 as the cut-off criterion value for an aberrant item. Harnisch also refers to each student’s classification in terms of test performance (high or low) and the Modified Caution Index, MCI, (high or low). His four classifications for students are as follows: Signal A = high test performance (greater than 50% of items correctly answered) and low MCI (less than or equal to 0.3), Signal B = high test performance (greater than 50% of items correctly answered) and high MCI (greater than 0.3), , 21 Signal C = low test performance (less than or equal to 50% items correctly answered) and low MCI (less than 0.3), and Signal D = low test performance (less than or equal to 50% items correctly answered) and high MCI (greater than 0.3). Similarly, his four Modified Caution Signals for items are as follows: Signal W = difficult item (50% or fewer students answered correctly) and low MCI (less than or equal to .3) difficult item (50% or fewer students answered correctly) and high MCI (greater than .3) Signal X Signal Y = easy item (greater than 50% of students answered correctly) and low MCI (less than or equal to .3) easy item (greater than 50% of students answered correctly) and high MCI (greater than .3) Signal Z ec's o t e Sato Ca t'on Index All item responses are taken to be equally meaningful. As such, the S-P analysis gives an indication of how good any single response really is. Sato (1980) suggested that a caution index value of above 0.5 would indicate the existence of an anomaly in the response pattern. Tatsuoka (1984) has suggested an index of 0.8 instead. However, Harnisch (1983) has suggested the division point of 0.3 on his Modified Caution Indices (MCIs) as indicative of atypical response jpatterns. He qualifies this as saying that not enough is known about the distributions of the MCIs to say that 0.3 is always a reasonable cutting point. Perhaps it would be more reasonable to leave that decision to the subject teacher as he or she may be more familiar with each of the students’ 22 academic ability as well as the quality of the test questions built. v nt t e - ode An advantage of using this model is that there are few assumptions made by this model. The interpretations of the model do not require a strong theoretical background. As such, most teachers, school administrators, and parents would not have much difficulty understanding the interpretation of the model. The Sato caution indices are also reported to be less demanding to calculate than other item response indices like the Cliff’s cil and ci2 indices, Mokken’s Hf index, Tatsuoka and Tatsuoka’s Norm Conformity Index (NCI) and the van der Flier’s U’ index (Harnisch & Linn, 1981). Harnisch and Linn (1981) have also shown that the caution index compares well to all of the indices previously mentioned. According to McArthur (1987), the S-P technique is being mostly used in Japanese schools. An appropriate.microcomputer (marketed only in Japan) has been configured exclusively for the purposes of the S-P method. This microcomputer has enabled classroom teachers to use this technique interactively (McArthur, 1987). In the‘U.S., there have been some efforts by Harnisch and Romy (1985) and, more recently, by Harnisch and Jenkins (1990) to apply the S-P model in a computer program. This program is presently marketed as the Student Problem Package (S.P.P.) (version 2.2) for IBM conpatible computers. 23 McArthur (1982) and Jaeger (1988) have also shown that the Sato caution indices may be adapted to examine various aspects of test bias. at' s o e - .A significant limitation of the Sato caution index is that it is group dependent since it is calculated from a students-by-items matrix. However, this is a limitation also found in many traditional psychometric analyses. Another limitation is that the development of the S-P technique was not based on any strong psychometric or educational theory. As such, this does not allow one to draw strong inferences from the S-P model about the way in which students are performing or the manner in which the items on the test are functioning. As McArthur (1987) aptly points out, "... in developing a diagnostic interpretation of a student’s score pattern, the teacher or researcher must make a conscious effort to balance the evidence in light of uncertainty about what constitutes critical or significant departure from the expected" (p.90). Another concern is the absence of established criteria for determining the significance of the caution indices calculated. As Harnisch (1983) points out, the statistical properties and standard errors of these indices are not well understood. Little is also known about the stability of the indices when students or items having the same marginal totals are arbitrarily fixed. 24 Qentrovetsies about the 8-2 model The S-P technique does not account for students guessing on the test. Guessing will affect the pattern of pupil responses over the items and this pattern will in turn affect the derivation of the item caution indices. As such, the interpretations of the caution indices may be potentially misleading or inappropriate. 7 The ordering of items according to their difficulty levels assumes linearity and unidimensionality of the data. Thus, data that are nonlinear or multidimensional will not be appropriately analyzed by the S-P method. The Ordering Theory Otdet Analysis and Dimensionality Conceptually and intuitively, a linear ordering among a set of items represents the most parsimonious scaling of the items (Airasian & Bart, 1973). This is true if the test is unidimensional. Guttman (1950) was the first to attempt a linear ordering of the items of a test in his Scalogram Analysis. Birenbaum and Tatsuoka (1982), in their article on the dimensionality of achievement test data, stated that studies regarding the dimensionality of achievement test data inmdifferent subject.matter areas have indicated.that there is always more than one major factor underlying any test data in the achievement domain. Tatsuoka & Birenbaum (1979, 1981) and Birenbaum (1981) have found this result particularly for problem-solving tests and even when measuring achievement in 25 a specific topic. Kingsbury and Weiss (1979) (cited in Birenbaum & Tatsuoka, 1982) have further shown, by factor analysis, that the dimensionality of a test can change depending on when the test was given. In their study, they found that the variance accounted for by the first factor at the time of the pretest was much less than at the peak of instruction. Thus, in a test on two groups of students exposed to different curricular emphases, different dimensionalities for the same test may exist. Reckase (1979) also raised such concerns about the multidimensionality of test data when applying the assumption of unidimensionality in latent-trait models. Airasian (1971) and Airasian & Bart (1973) have shown that orderings developed from logical and statistical analyses indicate that non-linear orderings among tasks are the rule rather than the exception. This finding is supported by research in areas of cognitive development (Bart & Airasian, 1974; Airasian, Bart, & Greaney, 1975; ) and curriculum development (Resnick, 1976; Gagne’, 1985). Furthermore, using the Guttman scales, which measured only the linear hierarchy of tasks, made it difficult to obtain reproducible scales when there were more than six or seven tasks (Airasian & Bart, 1975). The ordering theory developed by Airasian and Bart from tree theory, is based on scalogram analysis. It extends the analysis from linear to include non-linear hierarchical networks of tasks. Airasian and Bart (1975) define ordering 26 theory as "... a deterministic measurement model which uses ‘task response patterns to identify both linear and non-linear qualitative, prerequisite relations, among tasks and behaviors" (p. 166). Its primary purpose is either to test.the hypothesized hierarchies among items or to determine the hierarchies among the items (Bart & Krus, 1973). The primary concern of the ordering analysis model is the prerequisite relationship between tasks. In the case of achievement tests, this is mapped by the test items. To apply the ordering theory to achievement tests, the items must be dichotomously scored and that the examinees must respond to all the items on the test. In the ordering analysis, one is interested in using the observed order or dominance relations between persons and items within the same set (Wise, 1981). Essentially, this means that.if a person passes item i, she is said to dominate that item. Likewise, if she is able to pass item i but fails item j, then, for her, item j dominates item i. Krus (1974) describes assymetry, transitivity, and connectedness as the essential properties of an order relation. He adds that these are the properties that provide for an inference of dimensionality of the data matrices. In terms of dominance or ordering, Wise (1981) describes asymmetry being present when elements i and j cannot dominate each other. Connectedness is explained as the existence of a relationship between two elements i and j within an order, 'which is to say that either i dominates j or j dominates i. 27 Transitivity, on the other hand is interpreted to mean that for any of the elements i, j, and k, that are in an order, if i dominates j, and j dominates k, then i must dominate k. It is this property of transitivity that permits the determination of item-item and person-person dominance in order analysis. Thus, considering the four response patterns for two items i and j; (0,0), (1,0), (0,1) and (1,1), the item i is a prerequisite to item j to the extent that the response pattern (0,1) occurs infrequently. In terms of dominance or ordering, the response pattern (0,1) is termed disconfirmatory, and the response patterns (0,0), (1,0) and (1,1) are termed confirmatory (Bart & Krus, 1973). The ordering analysis will provide information concerning the following types of relationships: 1. Empirical Prerequisite - A task i is determined to be empirically prerequisite to»another task j if a score 0 for task i does not co-occur with a score of 1 for task j. 2. Empirically Equivalence - Two tasks, i and j, are considered to be empirically equivalent if the scores on task i are identical to scores on task j for all response patterns. 3. Empirical Independence - A task i is empirically independent of another task j if the score for task i is unrelated to the score of task j (Bart & Airasian, 1974). Steven Wise (1986) simplifies the characteristics of these relationships in a diagram. (See Figure 2.3) However, he 28 prefers to use the term ’logical’ instead of ’empirical’ in the last two relationships shown below. Task j 0 1 0 30 0 Prerequisite Relation Task i 1 50 20 Task j 0 1 0 40 0 Logical Equivalence Relation Task i 1 0 60 Task j 0 1 0 10 20 Logical Independence Relation Task i 1 40 30 Figure 2.3 Response patterns characteristic of prerequisite, logical equivalence, and logical independence relationships in percentages (Reproduced from Wise, 1986, Figure 1, p.444). In identifying the presence of any one of these three relationships among tasks which in tests are represented by items, there is the problem of measurement error. As Wise (1986) points out, this is "Because the tasks will not be perfectly reliable measures of the model’s components ... " (p.443). Therefore, the perfect response patterns of zeros in the disconfirmatory response (0,1) boxes in Figure 2.3 will rarely occur in practice. Often instead there will be a small 29 number of persons showing a disconfirmatory response which may be the result of random measurement errors of the items. The most common solution to this problem is to accept a certain tolerance level for the percentage of disconfirmatory responses allowed. This value of the tolerance level should be based on the researcher’s judgement of the amount of measurement error present in the data. Most often the ‘tolerance level is set between 5% to 12% (Piazza.& Wise, 1988; Wise, 1986) . If the percentage of disconfirmatory responses is lower than the tolerance level chosen, then a prerequisite relationship is said to exist. Similarly, the logical equivalence relationship is said to exist when the percentage of response patterns of (0,1) and (1,0) are lower than the tolerance level chosen. Krus (1977) also developed a probabilistic order-analytic model based on the deterministic model. The probabilistic model generates "order loadings" for the items on each dimension. For a given dimension, the order loadings reflect the relative order position of each item (Wise, 1981). Wise (1981) has suggested a modified order-analysis procedure (ORDO) which is also based on the deterministic model of Krus and Bart (1973). It serves to eliminate problems associated with item dominance and proximity by considering the partial order model of dimensionality. Essentially, it is equivalent to performing a factor analysis followed by the ordering analysis. 30 Beliehility ehg Test Velidity of the Otdeting Proeedute Traditionally, the reliability of a test is dependent on the assumption that the variable measured by a test is unidimensional and the quality of the test is unifactorial (Bart, 1974). This implies that the variable measured is linearly ordered and the item response patterns comply to a high degree with the Guttman scalogram scale. This is to say that there is a one-to-one correspondence between summative scores and item response patterns. According to Bart (1974), a test is reliable from an ordering-theoretic framework to the extent to which observed item response patterns conform to the true item response patterns where each true response pattern indicates the true item scores of a subject in the test. He elaborates that the test would have reliability to the extent that the item ordering of the test at time t1 was the same as that at time t2 when the same subjects were used in both cases. Alternatively, the test would have reliability to the extent that the ordering of a parallel form of a test produces the same ordering as the first test, given that a common group of subjects was used. For traditional forms of validity, the correlational procedures are designed to measure the degree of linear relationship between linearly ordered variables. As such, predictive validity may not be‘ quite applicable for non-linearly ordered sets of test items. However, content and construct validity would be applicable to both linearly and 31 non-linearly ordered sets of items. Eeete; AhaLysis and other Ahelyeie In order to use factor analysis, the data must be interval in nature, whereas order analysis only requires that the data be at least ordinal. This implies that order analysis would be more appropriate than factor analysis for evaluating the dimensionality of ordinal data. However, studies have shown differing results when order analysis and factor analysis were both used to evaluate the dimensionality of the same sets of dichotomous items. Krus and Weiss (1976), Krus (1977), and Bart (1978) found that the item orders obtained using the probabilistic order-analytic model corresponded only slightly with the factors obtained from a factor analysis. Similarly, Krus and Bart (1974), Reynolds (1981) and Wise (1981) found that there was little congruence between factors and item chains obtained from several deterministic order-analytic models. However, empirical studies using both factor and order analysis by Krus & Tellegen (1975) and Krus, Weidman, & Bland (1975) (both cited in Krus & Weiss, 1976) have frequently found the results of both methods in general agreement. These conflicting results can best be explained by Krus and Weiss (1976) who state that the degree of congruence of factor and order analytic solutions appear to be jointly determined by the character of the data analyzed and by the values assigned to the tolerance level of the order analysis. 32 By comparing the two methods in two classic experiments, the Thurstone’s "box problem" and Armstrong & Soelberg’s experiment, they found that if the data structures are highly organized, or non-random, order analysis at any tolerance level and factor analysis would frequently converge. As the data became more random, they had.to lower the tolerance level in order to find convergence with the factors. Krus and Krus (1980) showed that a correlation matrix for two dichotomous items Ihas two jprincipal components representing proximal and hierarchical relationships between the two items. The proximal information is given by the proportion of (0,0) or (1,1) response patterns while the hierarchical or dominance information is given by the difference between the numbers of (0,1) and (1,0) response patterns. Thus, both proximity and dominance information is used in a factor analysis of correlation coefficients. However, in order analysis, only dominance information is analyzed. Wise (1983) (cited in Wise & Tatsuoka, 1986) showed that there were two major problems when using traditional order analysis procedures. Firstly, items that were similar in difficulty levels and measuring the same factor, for example, two parallel items, commonly do not show a dominance relation and are deemed to belong to different dimensions. This was also a finding of Wise (1981). Secondly, two items that have substantially disparate difficulty levels will tend to show consistent dominance relations, whether or not the two items measure the same factor. Hence, items measuring the same 33 factor may not also show a dominance relationship because they have similar difficulty levels. Similarly, items measuring different factors may show a dominance relationship only because of the difference of difficulty levels between the items. The solution to these two problems may be found in the item proximity information. It follows that items that are measuring the same factor and are similar in difficulty level should also show high proximity. Alternatively, the items that measure different factors should show low proximity (Wise & Tatsuoka, 1986). Thus, Wise & Tatsuoka (1986) recommend that a factor analysis be performed first on the data, to identify which factors each item measures, followed by successive order analyses on the groups of items measuring the various factors. figmeEY The Sato caution index is relatively easy to compute and interpret. It involves constructing a S-P matrix where items are arranged from easy to difficult and students are arranged from highest scoring to lowest. This linear arrangement of test items in the S-P matrix implies that the test is unidimensional. However, many achievement tests are multidimensional and hence the interpretation of the caution indices of the students as well as the problems may not be accurate when considered from a unidimensional perspective. The ordering theoretic analysis permits the test items to be arranged in a hierarchical or non-linear manner. This requires 34 the construction of an item-student by item-student dominance matrix. However, as the ordering theoretic analysis only identifies dominance relationships, it is suggested that a principal components factor analysis be performed first to identify which items measure the factors. Following this, the ordering theoretic analysis may be conducted to identify the hierarchical order among items. CHAPTER III RESEARCH DESIGN AND PROCEDURE Ihttoductioh This study is comprised of five phases. The first phase was to obtain permission to conduct the study from the relevant authorities in Malaysia and at Michigan State University. Approval for this study was first obtained from the University Committee on Research Involving Human Subjects at Michigan State University. Permission was then sought from the Educational Planning and Research Division of the Malaysian Ministry of Education to conduct this study in six Malaysian schools. The researcher’s approved proposal was submitted to both authorities for this purpose. In the second phase, the test instrument used for gathering the data of this study was developed after the Malaysian Examinations Syndicate denied the researcher access to the student answer scripts for the 1991 national examination. The third phase was concerned with the selection of the sample of the study. In the fourth phase, the test instrument was administered, and in the last phase the data gathered were statistically analyzed. 35 36 ve ment 0 the es Instr me t A multiple choice objective test paper was constructed and administered to a sample of 354 fifth form students (equivalent to'U.S. 11th graders) from.six.schools. Forty test items were selected from five previous national examinations (1986-1990) for the subject of Biology. The test format conformed to that of the national examinations and the test items were selected in accordance to the table of specifications used by the Malaysian Examinations Syndicate. The latter examinations board is responsible for three of the four national examinations conducted annually in Malaysian schools. The test.was made up of two sections. Section I comprised 25 items and Section II comprised 15 items. Each section had a different multiple choice format. Both sections had five options to each item but, in Section II, each option was composed of a combination of multiple answers. In Section II, for example, the student chose option A if she believed that the first three statements after the item were true; alternatively if the student chose option B, then she believed that the first and the third statements were true, and similarly other alternatives were provided for options C, D and E. This format corresponded to what is known as the Item Kiformat. Instructions were given on the top of each page and ‘the students were familiar with this format as it conformed to the format of the national examination for that subject. The researcher was careful to include items that invoked some 37 higher order thinking skills from the thirteen categories that were listed for the subject in the syllabus. The table of specifications for the test is found in Appendix A, and the test instrument is found in Appendix B. We Theoretically, the Sato caution index may be applied to cases of at least two students and two test items. For the ordering theoretic analysis, studies have been done with as few as 15 subjects (Bart & Krus, 1973) to as many as 1000 subjects (Bart, 1978). As for tasks, a study with as few as five Inhelder Piagetian formal operations tasks have been ordered (Bart, 1978; Bart, Frey, & Baxter, 1979;) and as many as 30 animals have been ordered in a study of the hierarchy among attitudes toward animals (Bart, 1972). However, researchers have expressed the increasing difficulty of ordering the items as the number of items increases. The six Malaysian schools chosen for this study comprised three urban and three rural schools. The classification of these schools in terms of location conformed to that used by the authorities of the Malaysian Examinations Syndicate who were responsible for conducting pretests for the national examination. The schools chosen for the study had at least 50 candidates enrolled for the 1991 national examination in the subject of Biology. This information was obtained from the 1991 student candidature enrollment list of the Examinations Syndicate. In spite of this, when the test instrument was 38 eventually administered, one rural school in the study had only 36 students due to absenteeism . All of the six schools were classified by the Examinations Syndicate as having had students who *were average: to above average in academic performance, based on the previous students’ performance in the last five years’ national examinations. A summary of the number of students from each school who participated in this study is shown in Table 3.1. Table 3.1 Information on the Subjects Used in the Study. School School Number of Total number location code students of students School 1 79 Urban School 2 54 192 School 3 59 School 4 59 Rural School 5 67 162 School 6 36 ce u es t ' ' ra ' After selecting the schools for the study, permission was sought from the State Education Director of Selangor Darul Ehsan to conduct the Biology test, and collect demographic information on the students and Biology teachers in the six schools. On obtaining'his approval, the permissions of the six principals of the schools were sought and arrangements were made to administer the Biology'test.toitwo classes of students in each school. The administration of the test to the six schools was conducted over a two-week period. All of the 39 schools were in the midst of preparing for their school trial examinations and most of them had one topic or less left in the syllabus to cover. This showed that all the students were approximately' at ‘the same level of’ preparation. for the subject. This is important as comparisons of the student and item caution indices will be made between the students of different school location, gender and SES. The data should.not be biased by unequal student readiness for the test. The researcher personally administered the test to all the 12 classes of students. The purpose of the test was made clear to the students and they were instructed to answer every item on the test. The students were told that the test was a means of collecting test data by the researcher for his doctoral dissertation. Each student was given an answer sheet on which to shade their answers. This answer sheet required them to write their name, their sex and class, and the name of their school. The students were closely proctored to ensure that there was no cheating on the test. The students were given a maximum time of 75 minutes for the test. This time limit conformed to the time allocated for the Biology test in the national examinations. Almost all the students had no difficulty completing the test in this allocated time. A wall clock was placed in clear view to help the students keep track of the time. A sheet of blank paper was also given to each student for any rough calculations that they wished to make. The students were told to treat the test seriously and to answer all questions. Some students expressed the wish that 40 the results be made known to them as soon as possible. They were reassured that the results of this test had no bearing on their forthcoming performance in the national examinations. The data collected were meant strictly for research purposes. After the administration of the test, copies of the test instrument were given to the Biology teachers and the students were told that, if they so wished, they could discuss the questions with their teachers. Each of the Biology teachers who taught the students in the study were given a questionnaire to fill out. When they returned the questionnaire, they were asked if any of the questions on the questionnaire needed clarification. A copy of the questionnaire is given in Appendix D. ‘Demographic information on the students was also obtained from the school authorities. This information pertained to the individual students’ school attendance and their parents’ occupations. The researcher was only able to obtain the students’ school attendance for the school year of 1991 as the previous year’s records had been sent for auditing. ata s's The researcher was able to obtain some descriptive statistics of the previous national examinations regarding the items chosen for the test. (See Table 3.2) The keys to the items used in the test were obtained from the Malaysian Examinations Syndicate. The student answer sheets were first hand scored by the researcher and the 41 Table 3.2 National Examinations Statistics for the Biology Paper for 1986 to 1990 Year N Items Mean SD KR20 SEM Mam MRwh 1986 44,726 40 27.185 6.445 .831 2.650 10.990 .359 1987 49,555 40 25.770 6.980 .842 2.774 11.442 .372 1988 50,744 40 25.094 6.546 .826 2.731 11.585 .356 1989 30,687 40 23.153 6.544 .816 2.807 12.100 .349 1990 43,258 40 24.917 6.765 .842 2.689 11.610 .371 subtotals for Section I, Section II, and the combined total for both sections of the test were recorded. The students’ responses and the demographic information of the students and their teachers were then coded into an ASCII (American Standard Code for Information Interchange) file. qurintout of this file was obtained and each of the entries were then checked for mis-entry with each of the students’ answer sheets. Following this, the file was then transferred into the SPSS program for the:data analysis. The students’ responses on the test were then recoded to a dichotomous score where "1" was given for a correct answer and "0" was given for a wrong answer. All missing responses were coded "9". The subtotals of the students’ responses for Section Itand Section II were then obtained together with the students’ total score on the test. These totals were then compared with the totals obtained from hand scoring each of the students’ answer sheets and found to be identical in all respects. In analyzing the data, a preliminary item analysis was carried out and the p-values of the test items were obtained. These p-values were then compared to the p-values obtained the 42 same items on the national exams. A t-test was conducted for the two sets of p-values. As the test instrument was a composite of questions taken from five previous national examinations, the t-test between the two sets of p-values was to ascertain that the items on the test instrument were functioning in a similar way to that on the national tests. This is mainly because all test items in national examinations are not secure items. Histograms for the students’ scores were also plotted to examine their distributions. A reliability analysis was conducted to report the Cronbach alpha coefficients for the whole test as well as for the two sections of the test. A S-P tablewwaS'then constructed with items arranged from easiest to most difficult, and students from highest scoring to lowest scoring. Prior to the calculation of the caution indices, the statistical commands of SPSS used for the analyses were tried out on the raw test scores found in a paper by Sato (1985). The caution indices reproduced for his study of 30 students and 31 problems were identical to the values he had obtained in his study. The Item and Student caution indices of the test were then calculated. The item caution indices were recalculated for various samples of the students in accordance to the type of group characteristics required by the research questions. A principal components factor analysis was performed on the data before conducting the ordering theoretic analysis. This was to facilitate the researcher to order the items on 43 the test in a hierarchical fashion. A hierarchical cluster analysis was also performed on the test items. Statistical tests were then conducted on the results obtained in the analysis. Summety The research design of this study was aimed at addressing two main issues concerning the caution index. The first issue is the effect of the dimensionality of the test on the calculation of the caution indices and the second issue is the type of the group characteristics affecting the magnitudes of the item and student caution indices. Six Malaysian schools were selected for this purpose, three from an urban setting and the other three from a rural setting. A Biology test was constructed and administered to a total of 354 fifth form students from the six schools. Demographic information was also obtained regarding the students and the teachers who taught them the subject. All this information was coded into an ASCII file, cleaned and subjected to a data analysis. The data analysis was designed to obtain the item and student caution indices using the Sato model and the ordering theoretic analysis. A factor analysis and a cluster analysis were conducted to aid the latter analysis. Various other statistical tests were also performed on the results of the test data. CHAPTER IV ANALYSIS AND INTERPRETATION OF THE DATA IDLEQQBELIQD This chapter presents the way in which the data analyzes were conducted. A general description of the characteristics of the sample will first be presented. This will be followed by an account of the factor analyzes and the cluster analysis on the test data. The ordering theoretic analyzes procedures and the construction of the Sato model will then be described. The manner in which the item caution indices were derived will be explained and its implications on the ordering of the items discussed. Finally, the results for the three research questions of this study will be reported together with their interpretations. Chezaetezisties e: the Semple A total of 354 students participated in this study. They were students in the fifth Form (equivalent to 11th graders in the U.S.) . There were slightly more male than female students, and more students from the urban than the rural setting. All of the students had at least 72% school attendance, up to the time of the test administration for the academic school year of 1991. The students’ socio-economic status (SES) 44 45 was coded into two categories, low and middle/high. Students whose parents held professional jobs were classified as middle/high SES and those without professional jobs were placed in the low SES category. Table 4.1 shows the distribution of students by school location, SES, and gender. Table 4.1 Distribution of Students by School Location, SES, and Gender School Low SES Middle-High SES Total location Male Female Male Female Urban 42 20 76 53 191 Rural 58 53 22 29 162 100 73 98 82 353‘ ' One female student from the urban setting did not report her parent’s occupation and was excluded from the table. A total of eight Biology teachers filled out the questionnaires of the study and their teaching experience ranged from half-a-year to 20.5 years. There was only one male teacher. With the exception of the teacher with only half-a- year’s teaching experience, the other seven teachers had taught the same class the previous year. All the six schools used the same textbook and had followed the same sequence of topics for classroom instruction as outlined in the textbook. Table 4.2 shows the distribution of the teachers by teaching experience, school location, and gender. 46 Table 4.2 Distribution of Teachers by Teaching Experience, School Location, and Gender Teaching experience Urban Rural Total in years Male Female Male Female 0 - 5 0 1 0 1 2 6 - 10 0 O 0 l 1 11 - 20 0 2 0 1 3 above 20 0 1 1 0 2 0 4 1 3 8 Analysis of the Test Date A reliability analysis was conducted on the test items and the results are shown in Table 4.3. In the analysis of variance, the F statistic for the variation between items was significant (F=52.695, p<.001). This indicated that the items have significantly different means. This finding was confirmed by the large Hotelling’s T-squared statistic (T2=2817.7197) which is a test for the equality of means. Its F statistic (F=64.4717, p<.001) was significant and this indicated that the hypothesis that the items Ihave equal means in. the population can be rejected. The hypothesis that the items are additive cannot be rejected as the F statistic for nonadditivity was not significant. This was also shown by the Tukey’s test statistic which had a value close to 1. The 40-item test was reasonably reliable with Cronbach’s alpha at .7509. 47 Table 4.3 Results of a Reliability Analysis on the Items in the Test Instrument Source of Sum of DF Mean F Prob. Variation Squares Square Between persons 268.9624 353 .7619 Within Items 3003.0500 13806 .2175 Between items 390.0633 39 10.0016 52.695 .000 Residual 2612.986? 13767 .1898 Nonadditivity .0145 1 .0145 .076 .782 Balance 2612.9722 13766 .1898 TOTAL 3272.0124 14159 .2311 Tukey estimate of power to which observations must be raised to achieve additivity = 1.0282 Hotelling’s T-squared = 2817.7197 F = 64.4717 Probability = .0000 Degrees of freedom: Numerator = 39 Denominator = 315 Cronbach’s Alpha = .7509 Standardized item alpha = .7565 As one of the research questions was to compare the effects of the two item formats used in the test, the Cronbach alphas for the first 25 items (first format) and last 15 items (second format) on the test were determined. The Cronbach’s alpha coefficients were found to be .6592 and .5207 respectively. When corrected for length ,. the alpha coefficients were .7558 and .7434 respectively. This showed that both item formats had about the same reliability. The Cronbach’s alpha for all 40 items of this test instrument was lower than any of the KRZOs of the five national examinations from which the test items were taken. The KRZOs of the 48 national examinations ranged from .816 to .842 (see Table 3.2) . The point biserial and the biserial correlation coefficients of the test items were also computed. This was to ascertain if the items were all performing in the same manner in the test instrument. It must be remembered that the test items were not secure questions and Students would have access to these questions. Thus it.was important to see if there were any large carry over effects of some items over others. These coefficients were correlated with the point biserial coefficients of the same test items derived from the national examinations. They were found to be .794 (p<.001) and .744 (p<.001) respectively. The reasonably high correlations indicated that the items of the test were functioning in a similar’ manner as ‘when ‘they ‘were used in the national examinations. To determine if the sample chosen was representative of a normal population, a histogram of the students’ scores on the test was plotted (see Figure 4.1). The plot showed that the distribution of the test scores ‘was reasonably normal. For purposes of comparison, the points of a normal curve based on all valid values of the scores were superimposed on the histogram. 49 Midpoint value of scores 7 I 9 . 11 .. 13 _ . 15 III-III: 17 III-IIIIIIIII - 19 IIII-llllllllllllllllllll3I 21 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII:II 23 IIIIIIIIIIIIIIIIIIIll-IIIIIIIIIIIIIIIIIIII - 25 |IIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIII 27 IIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIII - 29 IlllllllllIlIlIIIll-llllllllllllllllllll3II 31 IlIlIllllllllllllllllllllllllllfiIll-llllll 33 lllllllllllllllllll‘... 35’ Ill-Illllllfilllnll 37 IllllfiIllI 39 . . I....+....I....+....I....+....I....+....I....+.... 0 10 20 30 40 50 Number of students Figure 4.1 Histogram frequency of 40 items on the test. Two other histograms were plotted for the students’ scores on the first 25 items, and the last 15 items on the test. They were also found to be quite normally distributed (See Figures 4.2 and 4.3). Midpoint P‘H iJF‘H m~aoxsc4hao¢p~4m4>a> N O UIOUIOUiOUiOUiOUIOUiOUI MN UN 25.0 26.5 value of scores I0.00+.0.0I...0+0...I.000+0000I.000+...010000+00001 0 15 30 45 60 75 Number of students Figure 4.2 Histogram frequency of the first 25 items on the test. Score value 1.0 . 2.0 :- 3.0 -:- 400 - O 5-0 Ill-IIIIIIII: 5-0 IIIIIIIIIIIIIIIIIIIIII' 7-0 —=I 3-0 IIlllllIlIIll-llllllllllllllllllll=Ill-Ill 9'0 IIlllIllllllllllllllllllllllll - 10-0 IIIIIIllllllllllllllllllllllll - 11-0 _=I 12-0 —=— 13-0 III-IIII:II 14.0 _. 15.0 .. I.........I.........I.........I ......... I... 0 15 3O 45 60 Number of students Figure 4.3 Histogram frequency of the last 15 items on the test. 51 A t-test was also performed on the p-values obtained from the test instrument and the p-values of the same items obtained from the national examinations. The results showed that the hypothesis that the means of the two groups were the same could not be rejected (t=-.06, df=78, p=.954). Three principal components factor analyses were conducted on the test for all 40 items, the first 25 items, and the last 15 items respectively. The analyses showed that the factors failed to converge when the varimax rotation procedure was used. The factor analyses ‘were repeated. using the less powerful quartimax procedure. This procedure is considered less powerful because high as well as moderate factors are included in the rotation. But this procedure was adopted to enable the results to be more interpretable. The rotation redistributes the explained variance for the individual factors whereby permitting the factors to be more easily differentiated from each other. This allowed for the group of items under each factor to be identified. The subject matter of the items in each factor were then examined for any common curricular emphasis among them. The factor analysis for 40 items on the test extracted 16 factors. The analysis for the first 25 and last 15 items on the test extracted 10 and 6 factors respectively. The results of the factor analysis with 40 items showed that the Bartlett’s test of sphericity was large (1635.3703). ~This indicated that it was unlikely that the population correlation matrix was an identity. The Kaiser-Meyer-Olkin 52 measure of sampling adequacy at .72062 was reasonably large. These two statistics justified performing a factor analysis on the data. The 16 factors extracted by the factor analysis accounted for 58.6% of the variance which left 41.4% of the total variance still unexplained. There were also low interitem correlations of the items that were grouped into factors in the rotated factor matrix. This finding is similar to the findings of Wise (1981), Reynolds (1981) and Krus and Bart (1974) when they found little congruence between factors and the item chains from several deterministic order-analytic models. This is also probably why the initial varimax rotation procedure was unsuccessful. Due to the low interitem correlations, the 16 factors extracted from the factor analysis were not very meaningful. An examination of the various topics represented by the items that were grouped into the factors, did not make any strong conceptual sense. Similar findings were found for the other two factor analyses. The factor analyses were performed primarily to facilitate the interpretation of the results of the ordering analysis. However, the factors extracted were not able to do this. The reason for this may be explained by Krus and Weiss (1976) who state that the degree of congruence of factor and order analytic solutions appear to be jointly determined by the character of the data analyzed and by the values assigned to the tolerance level of the order analysis. In this case, the data may not be highly organized in a hierarchical manner. 53 Other reasons for the failure of the factor analysis to be more informative were largely due to the moderate inter- correlations between the items, and the fact that binary data were used in the analysis. In addition, the sample size used was too small. According to Nunnally (1978) there needs to be at least 10 subjects per item for the factor solutions to be stable. In this study there were 40 items with only 354 students. In this case, the scree plot which is the plot of the total variance associated with each factor showed that there was essentially one factor. Again it may be deduced that perhaps the data may not be highly organized in a hierarchical fashion. An agglomerative hierarchical cluster analysis using the complete linkage method was performed on the data. Six clusters could be identified from the dendrogram produced. It was interesting to note that the ordering of the items produced by the complete linkage method was very similar to the order of items of the Sato model. A Spearman rank correlation between the order of items produced by the cluster analysis and the Sato model produced a rm = .903. The dendrogram of the cluster analysis is found in Appendix E. W The results of the ordering theoretic analysis were first interpreted by using a tolerance level of 10 per cent. At this 54 tolerance level, the researcher was unable to accurately construct a conceptual map using all 40 items of the test. A 10 per cent tolerance level meant that if the percentage frequency of the confirmatory response of a pair of items was more than 10 per cent of the total possible responses, the prerequisite relationship was considered to exist. However, using the same tolerance level with the first 25 items on the test did reduce the difficulty of the task of constructing the conceptual map. This task became much easier with the last 15 items of the test. This confirmed the findings of other studies that reported. the increasing’ difficulty’ of constructing a conceptual map when the number of items to be ordered became large. Further analysis also showed that if the tolerance level was lowered, it also became less difficult to construct the conceptual map. However, somewhat different conceptual maps were produced depending on the choice of the tolerance level. For example, using the tolerance level of 10% on the last 15 items of the test produced a conceptual map of three hierarchical levels. Lowering ‘the ‘tolerance level to 5% produced a conceptual map with six hierachical levels. Where in the former the lowest item in the conceptual map, item 34, was a prerequisite to ten items, in the latter it was a prerequisite to only one item. Realizing the inconsistencies produced by this deterministic method of analysis, the researcher decided to adopt Krus’s (1975) solution of constructing logical item hierarchies in one dimension. In his 55 method, Krus recommended the use of the probabilistic model 2. He explained that deterministic models of medium-sized data :matrices "... frequently' yield. structures 'which are too complex and difficult to interpret" (p. 56, cited in Krus, Bart and Airasian, 1975). The probabilistic model uses the McNemar’s z criterion score for nonindependent proportions instead of a tolerance level. McNemar’s z criterion score is given by the formula: where dij= confirmatory responses (1,0) between item i and j c§= disconfirmatory responses (0,1) between item i and j The 2 criterion scores were calculated for all possible pairs of items in the test. Usually, z criterion values are selected with the accepted rules of significance testing where 1.96 is the critical value for 5% level for two-tailed tests. For a particular pair of items, positive values of 1.96 and above indicated that item i served as a prerequisite to item j. This meant that the item j was above item i in the conceptual map. Conversely, if the z criterion value was -1.96 and below, item j is a prerequisite to item i, and is placed below item i in the conceptual.map. The number of positive and negative significant values was then recorded. These results were then tabulated in descending magnitude of the number of items that a particular item is a prerequisite to (see Table 56 4.5). However, there were instances when items had the same number of prerequisites. Ties in the position of the test items were resolved by going back to the particular items in question and selecting the item with the next 2 value closest to the absolute value of 1.96. The item with the closest negative value would be placed highest in the order of items in the table. Conversely, the item with closest positive value would be placed lower in the order. This gave the sequence of ordering of the test items as shown in Table 4.5. This arrangement of the items gave the sequence in which topics represented by the items in the test were ordered by prerequisite conditions. Otdetihg the Itehe According to the Sato Model To begin the analysis using the Sato model, a students- by-items matrix was constructed where the 40 items were ordered from easiest to most difficult and the students from highest scoring to lowest scoring. To verify if the SPSS/PC+ commands used for this analysis were correct, the commands ‘were first applied to the test data found in an article by Sato (1985). The caution indices computed for both. the students and the items were found to be identical to those obtained by Sato in the mentioned article. The same commands ‘were then applied to the study’s data to compute the caution indices. 57 Table 4.5 A Distribution of the Test Items According to Prerequisite Requirements. Order Test No. of items No. of items of item prerequisite to this item is test number this item a prerequisite to item (1.96 and above) (-1.96 and below) 1 4 0 35 2 12 0 34 3 34 0 34 4 18 0 34 5 5 0 33 6 10 1 33 7 3 4 27 8 8 6 25 9 1 6 24 10 6 6 24 11 32 6 24 12 9 6 23 13 11 6 23 14 16 7 23 15 19 7 23 16 25 7 23 17 31 11 17 18 36 16 14 19 35 15 13 20 37 15 12 21 29 15 12 22 13 15 12 23 20 16 12 24 39 15 10 25 28 17 8 26 17 17 8 27 40 18 8 28 21 19 6 29 15 23 6 30 22 23 6 31 2 24 6 32 26 24 6 33 14 27 5 34 7 27 4 35 33 32 4 36 24 33 4 37 38 36 l 38 27 36 1 39 30 36 0 40 23 38 0 58 Ahsyers te the Reseetch Qhestions Three research questions were formulated in this study. In the following pages, each research question will be stated and the results of the analysis pertaining to that question will be reported. Research Question 1 Is there a difference between the item caution indices derived using the items ordered by the ordering theoretic analysis, and the Sato item caution indices? The analysis showed that the items arranged according to the Sato model and according to the ordering theory model produced identical values for the item caution indices. This implied that the ordering of the items had no effect on the item caution indices. On closer investigation, it was noticed that Sato’s formula used for the computation of the item caution indices did not consider the arrangement of the items in the students-by-items matrix. The item caution indices were calculated using the simplified formula below (Sato, 1980): 1 W(2:1xvxi ’ Y1”) C(13)): 1 - 1 y '17 (2.11 X, ' Y1“) C(Pi = the caution index for the j-th item = the i-th student’s score on the j-th item, x.. u coded 1 for correct and 0 for incorrect Xi = the i-th student’s total score on the test 2, = the number of students getting the j-th item correct u = the average of the students’ test scores the number of students 59 In order to show why the calculation of the item caution index is not influenced by the arrangement of the items in the S-P table consider the derivation of the item caution index for item P4 in Figure 4.4. Uirfdibflafifl’m 10 11 12 13 14 15 Figure 4.4 Problems (Items) P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 1 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1 1 0 9 1 1 1 1 1 0 1 1 0 1 8 1 1 1 1 1 1 1 0 0 0 7 1 1 1 1 1 0 1 1 0 0 6 1 1 0 0 1 0 1 0 1 0 6 1 1 1 1 0 1 0 0 0 0 5 1 1 1 0 1 0 0 0 1 0 5 1 1 0 1 0 0 1 0 0 1 5 1 0 0 1 0 1 0 1 1 0 5 0 1 1 1 0 1 0 0 0 0 4 1 0 0 0 0 1 0 1 0 1 4 1 0 1 0 1 0 0 0 0 0 3 0 0 1 0 0 1 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 1 12 11 10 9 8 8 7 6 5 4 Correct answers An ordered students-by-items matrix (S-P table) for 15 students and 10 problems hamr+018 (DHOOUI 60 The first term in the numerator on the right side of the equation is given as follows: EL'IXUX, = 10+9+8+7+6+0+5+0+5+5+4+0+0+0+0 = 59 It follows that this term in the equation is not influenced by the arrangement of the students in the S-P table. For the second term in the numerator, Y]: Mxfi = 9 This is the number of students getting item P, correct. 1 1 80 = _ x. = _ 1o+9+8+7+6+6+5+5+5+5+4+4+3+2+1 = ..— [4 N211 ’ 15( ) 15 This term gives the average student’s score for all the 15 items. The first term of the denominator is as follows: XXIX, = 10+9+8+7+6+6+5+5+5 = 61 This term gives the cummulative sum of the students’ scores from student #1 to student #9 (i.e., the number of students getting item P, correct). Combining the terms to obtain the item caution index, _.. - —9::°> C(P,) = 1 - = .154 _1_(61 — _9"8°) 15 15 Therefore, the arrangement of the items on the S-P table does not influence the derivation of the item caution indices. Similarly, it can be shown that the derivation of the student caution indices are not influenced by the arrangement 61 of the students in the S-P table. Using the same notation, the student caution index can be derived by the following modified formula (Sato, 1980): 1 7148-13516, _ Xiul) C(S,) = 1 1 a (2:1)]; - xii-l") where C(SJ = the caution index for the i-th student u’ = the average number of students on the test getting the problems correct the number of items n The arrangement or ordering of the students in the matrix would affect only the calculationdominant plants-—9 competition Competitionr—acolonization-—+succession——+ dominant plants Dominant plants——9colonization-—9succession-—9 competition Competition——9succession-—)dominant plants——a colonization 18. How does the body react when a large quantity of water is drunk all at once by a person? A. B. C. D. E. large quantity of dilute urine. small amount of dilute urine. normal quantity of dilute urine. normal quantity of concentrated urine. small quantity of concentrated urine. He excretes He excretes He excretes He excretes He excretes ”9900’ 19. Samples of blood from separate arteries were analyzed for their oxygen content. The results are shown in Table 1. Artery Oxygen content (cm3/1000m3 blood) 10.6 18.0 18.2 18.5 19.0 Nr<> Spe Female Cambium Pollen Meristem Root Xylem III IV Zygote ————> Embryo II . organism a Ovu Figure 3 21. Figure 3 shows the various stages of natural reproduction. What are the processes that occur at stages I, II, III and IV? Stage I Stage II Stage III Stage IV A Meiosis Mitosis Meiosis Fertilization B Meiosis Meiosis Fertilization Mitosis C Mitosis Fertilization Meiosis Meiosis D Mitosis Meiosis Fertilization Mitosis E Mitosis Mitosis Fertilization Meiosis 22. A patient with diabetes (Diabetes mellitus) usually is treated with injections of insulin because insulin A. B. C. D. E. stimulates the production of antibodies stimulates glucose to change to glycogen increases the oxidation of glucose in the intestines stimulates the absorption of glucose in the intestines reduces the carbohydrate metabolism (Turn over 95 8 Coleoptile Direction of light Dark Illuminated portion portion Figure 4 23. Figure 4 shows the reaction of a coleoptile towards the stimulus of light. Which of the following statements causes the coleoptile to bend? A. B. C. D. E. Light stimulates greater production of auxin in the coleoptile. More auxin accumulates in the illuminated portion of the coleoptile. Auxin is degenerated on the dark portion of the coleoptile. The cells on the illuminated portion of the coleoptile stops growing. The cells on the dark portion of the coleoptile elongates faster. 24. Which of the following shows an epiphytic relationship? A. B. C. D. E. Mucor growing on a piece of bread Mushroom growing on a wooden branch Bacteria living in the root nodules of a legume Moss growing on the bark of a tree Mistletoe growing on the branch of a tree 25. Orchard growers know that leaf bugs destroy many of their trees. Which of the following is the most suitable biological control method to eliminate this pest? A. B. C. D. E. Release ladybird bugs in the fruit orchards Cut off the branches that are infected Spray insecticide on the fruit trees Spray fungicide on the fruit trees Release snakes on the fruit trees (Turn over 96 9 Section II Directions: For each question below, one or more of the statements are correct. Determine whether each of the statements is true or false. Then choose A. if I, II and III only are correct B. if I and III only are correct C. if II and IV only are correct D. if IV only is correct E. if all I, II, III and IV are correct Directions summarized A B C D E I, II, III I, III II, IV IV I, II, III, IV only only only only (all four) 26. Which of the following can lower a person’s body temperature when the surrounding conditions are hot? I The relaxation of the retractor muscle of the hair II Vasodilation of the skin’s blood vessels III An increase in the respiration rate IV An increase in the metabolic rate 27. The outer surface of an animal’s respiratory organ is thin in order to I enable active transport of gases to occur II increase the surface area for gaseous exchange III reduce the formation of carbon dioxide IV facilitate the movement of gases through this layer 28. The features of the small intestine that enable the absorption of digested food substances include I a large surface area II a damp surface III possessing thin-walled villi IV possessing layers of muscle (Turn over 97 10 Directions summarized A B C D E I, II, III I, III II, IV IV I, II, III, IV only only only only (all four) 29. A number of food tests were carried out on a sample of food. The observations of the tests are shown in Table 2. Test Observation Mixed with iodine solution Yellow solution Mixed with DCPIP solution Blue color disappears Heated with Millon’s solution Brick-red precipitate Heated with Benedict’solution Blue solution Table 2 The observations show that the food sample contains I protein II reducing sugar III vitamin C IV starch 30. Which of the following features is(are) true for both hormones and enzymes? I Performs specific reactions II Controlled by temperature III Required in small quantities for reactions IV Produced by glands with ducts 31. The rate of photosynthesis is controlled by I total number of leaves II total amount of light received by the leaves III the size of the stoma IV the amount of oxygen in the air (Turn over 98 11 Directions summarized A B C D E I, II, III I, III II, IV IV I, II, III, IV only only only only (all four) 32. The capacity of a sample of soil to retain water depends on I the amount of humus in the soil II the size of the soil particles III the amount of air spaces in the soil IV the quantity of clay in the soil 33. The importance of transpiration in plants is to I assist the movement of water II reduce the temperature of the plant III assist in the absorption of mineral salts IV prevent the wilting of leaves 34. Among the following features, which assists in the wind dispersal of fruits? I The mesocarp layer that is hollow or fibrous II Wing-like extensions from the growth of the pericarp III The endocarp that is succulent and sweet IV A tuft of hair from the remains of the pistil 35. Which of the following statements is(are) true of vaccines? I One example of a vaccine is BCG II Vaccines are made from pathogens that have lost their virulence III Artificial active immunization is attained through vaccines ' IV When a vaccine is injected, it will kill the pathogen 36. Which of the following occurs during succession in the area of an abandoned tin mine? I The chemical features of the soil change II The species of the plants change III The species of animals change IV The organic substances in the soil change (Turn over 99 12 Directions summarized A B C D E I, II, III I, III II, IV IV I, II, III, IV only only only only (all four) 37. When a person donates blood, the doctor removes the donor’s blood from a vein instead of an artery because I the blood pressure in the vein is lower II veins are found closer to the skin’s surface III the wall of the vein is thinner IV the flow of blood in the vein is slower Legend Species X Species Y Average length of the root (mm) 1 : 1 3 : 1 1 : 1 1 : 3 (X:Y) Pla ted e—————— Planted t gether -——————9 separately Figure 5 38. Figure 5 shows the length of the roots of two species of plants X and Y when they were planted separately and when they were planted together in different ratios, under the same conditions. What conclusion(s) can be drawn from the results that were obtained? I The growth of the roots of species X is influenced by the number of species Y that is present. II The growth of the roots of species Y is reduced when the number of species X increases. III The growth of the roots of species X is reduced because of competition with the growth of the roots of species Y. IV The growth of the roots of species Y increases when planted together with species X. (Turn over 100 13 Directions summarized A B C D E I, II, III I, III II, IV IV I, II, III, IV only only only only (all four) 39. When a person sees a red flower from a distance, what changes occur in his eyes? I The cone cells are stimulated. II The focal length of the eye lenses increase. III The ciliary muscles relax. IV The eye lenses become thicker. 40. X is a green plant that carries out photosynthesis. Y and Z are two types of plant that live on the outer surface of X. Y decomposes the remains of the bark of X and this supplies mineral salts to Z. Z is able to photosynthesize. Y and Z are always found living together and receive mutual benefit from one another. Which of the following statements is(are) true regarding the living relationships and nutritional habits of the plants? I X is an autotroph II Y is a saprophyte III Y and Z live symbiotically IV Z is an epiphyte APPENDIX C THE TEACHER’S QUESTIONNAIRE Please fill in the appropriate information or circle the appropriate answer. Thank you for your cooperation. 1. Name of school:........................................ 2. Sex: M / F 3. Academic qualifications:............... Year:.......... 4. Professional qualifications:........... Year:.......... 5. Number of years teaching Biology at SPM level:......... 6. What Biology classes were taught by you in 1991/1992? 7. Were the classes taught in 1992, a follow up from 1991? Yes / No 8. What is the name of the textbook used for instruction? 9. Did you use any supplementary texts? Yes / No If "Yes" what were they?................................ 10. Did you teach the class(es) of 1991/92 the topics in the same sequence adopted by the textbook used? Yes / No If "No" what was the sequence of topics adopted? In 1991: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In 1992: 0 0 0 00000 0 0 0 00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000 0 0 0 0 0 0 0 0 0 0 0 0 11. Did you complete the syllabus? Yes / No If "No" what topics did you leave out? 101 APPENDIX D DESCRIPTIVE STATISTICS OF THE ITEMS CHOSEN FOR THE TEST INSTRUMENT Item Key p MCS pris Year #" Topic” 1 A .685 13.970 .357 1986 1 1 2 B .635 13.244 .080 1988 1 1 3 C .721 13.897 .360 1986 3 2 4 B .844 13.512 .298 1989 3 2 5 A .845 13.457 .266 1987 5 3 6 C .595 14.635 .400 1988 4 4 7 C .447 14.937 .435 1990 11 4 8 D .735 14.096 .456 1987 9 5 9 B .727 13.952 .388 1988 9 5 10 B .774 13.986 .455 - 1988 6 6 11 A .660 14.518 .528 1986 11 6 12 C .857 13.461 .282 1987 17 7 13 B .594 13.693 .209 1990 12 7 14 B .411 14.353 .282 1989 14 8 15 C .619 13.630 .200 1990 14 8 16 D .810 13.699 .359 1990 17 9 17 A .704 13.879 .338 1989 13 9 18 A .763 13.675 .301 1986 18 10 19 E .675 14.305 .470 1989 18 10 20 B .659 14.422 .493 1987 21 11 21 B .669 14.283 .455 1989 20 11 22 B .701 14.011 .386 1987 24 12 23 E .439 15.223 .491 1990 22 12 24 D .518 14.837 .476 1988 23 13 25 A .887 13.422 .294 1989 25 13 26 A .501 14.556 .390 1988 26 1 27 D .325 14.872 .325 1988 28 2 28 A .588 13.782 .233 1988 29 3 29 B .712 14.303 .511 1990 29 3&9 30 A .268 13.728 .109 1990 39 3 31 A .642 14.050 .351 1988 31 4 32 E .773 13.378 .173 _ 1986 31 5 33 A .431 15.012 .438 1989 33 6 34 C .872 13.504 .329 1986 34 7 35 A .647 14.448 .490 1990 34 8 36 E .609 14.116 .348 1987 34 9 37 E .388 14.559 .309 1989 37 10 38 E .388 14.559 .309 1989 36 11 39 A .663 14.097 .384 1989 38 12 40 E .631 14.146 .374 1989 39 13 'Thisnfcntodlcqucsucnnumbcruitappcemdmmcchxmofm-tyw. ‘ThetopicmlnbcrrefcrstomctopicauiincdinthetubicofspccificetiomhccAppendixA). 102 APPENDIX E DENDROGRAM OF THE AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS OF THE TEST ITEMS USING THE COMPLETE LINKAGE METHOD Rescaled Distance Cluster Combine C A s E 0 5 10 15 20 25 Label Seq +- + — + ——+ —+ ......... + 4 -+ 12 —+---+ S -+ +---+ 18 ----- + 34 ----- + I I 8 s 10 ----- + I I 11 + +—--+ 9 s 19 — + I 20 ——- u: +++ 4tHHtHH+ I I + in ‘D +I+ H q +++ ..n \I ll o O l +++ ... a: p i +++ 4FHHHHHHHH+ a: to 4+- 4?- I? II 4+ 4+ ++ 5: <4 +++ 103 APPENDIX F THE ITEM CAUTION INDICES OF' THE 40 ITEMS OF THE TEST INSTRUMENT AS DERIVED UNDER VARIOUS SAMPLES TYPE OF SAMPLE ITEM WHOLE SCHOOL SETTING STUDENT SEX STUDENT SES NUMBER SAMPLE URBAN RURAL MALE FEMALE LOW M/H 1 .74 .77 .76 .85 .62 .77 .74 2 1.00 .91 1.03 1.05 .99 1.01 .98 3 .55 .57 .55 .59 .50 .50 .58 4 .47 .47 .52 .67 .41 .50 .45 5 .59 .53 .68 .41 .75 .66 .52 6 .37 .52 .32 .47 .30 .29 .50 7 .44 .46 .50 .46 .42 .39 .50 8 .32 .26 .41 .28 .37 .30 .39 9 .45 .44 .41 .42. .49 .39 .50 10 .45 .57 .35 .39 .49 .31 .67 11 .45 .38 .55 .48 .39 .36 .58 12 .57 .55 .65 .67 .52 .57 .61 13 .74 .67 .77 .80 .67 .81 .66 14 .54 .65 .46 .62 .60 .49 .59 15 .88 .69 .99 .79 .94 .99 .73 16 .48 .52 .48 .47 .48 .40 .57 17 .66 .61 .64 .63 .68 - .60 .69 18 .56 .68 .55 .73 .36 .48 .64 19 .41 .53 .36 .39 .40 .39 .47 20 .59 .57 .61 .57 .61 .57 .63 21 .66 .67 .65 .61 .68 .59 .76 22 .69 .66 .77 .62 .76 .76 .64 23 .65 .70 .57 .76 .52 .48 .83 24 .50 .51 .48 .61 .36 .52 .47 25 .69 .69 .45 .79 .63 .74 .60 26 .60 .64 .64 .77 .46 .63 .59 27 .65 .67 .58 .72 .57 .55 .76 28 .79 .79 .84 .69 .90 .80 .78 29 .48 .56 .41 .46 .49 .38 .59 30 .91 .82 .99 .87 .94 .98 .81 31 .53 .59 .47 .58 .49 .48 .55 32 .68 .65 .71 .83 .53 .61 .71 33 .49 .51 .52 .47 .54 .52 .47 34 .52 .55 .55 .47 .56 .48 .57 35 .41 .49 .33 .45 .38 .42 .38 36 .65 .76 .53 .63 .66 .59 .77 37 .62 .74 .55 .65 .60 .52 .70 38 .71 .66 .79 .61 .88 .70 .74 39 .60 .61 .53 .58 .64 .63 .57 40 .54 .50 .57 .53 .50 .61 .45 n 354 192 162 198 156 173 180 104 APPENDIX F (CONT’D). THE ITEM CAUTION INDICES OF THE 40 ITEMS OF THE TEST INSTRUMENT AS DERIVED UNDER VARIOUS SAMPLES TYPE OF SAMPLE ITEM WHOLE TEACHERS’ EXPERIENCE TEST NUMBER SAMPLE 0-5 YRS o-1o YRS >10 YRS FORMAT‘ 1 .74 .62 .69 .79 .66 2 1.00 1.11 1.03 .95 1.01 3 .55 .48 .46 .64 .51 4 .47 .32 .43 .52 .51 5 .59 .83 .66 .57 .65 6 .37 .54 .40 .34 .41 7 .44 .54 .51 .36 .42 8 .32 .19 .29 .34 .33 9 .45 .45 .44 .46 .45 1o .45 .77 .65 .28 .40 11 .45 .37 .41 .48 .41 12 .57 .65 .60 .55 .53 13 .74 .42 .61 .87 .68 14 .54 .59 .50 .58 .49 15 .88 .70 .90 .86 .80 16 .48 .58 .53 .44 .45 17 .66 .51 .59 .72 .58 18 .56 .63 .60 .52 .59 19 .41 .49 .42 .40 .40 20 .59 .74 .63 a .54 .55 21 .66 .69 .70 .63 .64 22 .69 .56 .61 .75 .62 23 .65 .49 .57 .74 .60 24 .50 .54 .42 .58 .49 25 .69 .63 .64 .73 .69 26 .60 .72 .68 .53 .53 27 .65 .71 .60 .71 .59 28 .79 .84 .73 .84 .68 29 .48 .52 .49 .46 .43 3o .91 .70 .73 1.15 .78 31 .53 .61 .48 .57 .47 32 .68 .69 .72 .64 .56 33 .49 .56 .48 .51 .42 34 .52 .63 .56 .46 .44 35 .41 .45 .40 .42 .35 36 .65 .66 .61 .69 .51 37 .62 .59 .49 .75, .52 38 .71 .76 .73 .69 .64 39 .60 .46 .45 .76 .54 4o .54 .53 .54 .53 .47 n 354 198 156 173 354 MmmccombinedvahuforthcfimmeSection Dndoecosdfomatmuoun). 105 BIBLIOGRAPHY BIBLIOGRAPHY Airasian, P. W. (1971). A method for validating sequential instructional hierarchies. Eeheetiehel_1eehhelegy, 11(12), 54-56. Airasian, P. W., & Bart, W. M. (1973). Ordering theory: A new and useful measurement model. Eeeeetiehel Technology, 22(5), 56-60. Airasian, P. W., & Bart, W. M. (1975). Validating a priori instructional hierarchies. J d t Meashzement, 22(3), 163-173. Airasian, P.W., Bart, W. M., & Greaney, B. J. (1975). The analysis of a proportional logic game by ordering theorY- shild_§tudx_leurnal. 2(1). 13-24- Airasian, P. W., & Madaus, G. F. (1983). Linking testing and instruction: Policy issues. gehthel_et_figheetiehel Meesurement, 29(2), 103-118. Baker, E. L., & Herman, J. L. (1983). Task structure design: Beyond linkage. JQBIDQI of Edeeet2eha; heashteheht, 29, 149-164. Bart, W. M. (1974). Test validity and reliability from an ordering-theoretic framework. Eeucetioha; Iechhelogy, 25(1), 62-63. Bart, W. M. (1978). An empirical inquiry into the relationship between test factor structure and test hierarchical structure. Applieg Psychelogicel Meeeutement, 2(3), 331-335. Bart, W. M., & Airasian, P. W. (1974). Determination of the ordering among seven Piagetian tasks by an ordering-theoretic method. leurnal.ef_Edusatienal Wt fir 277-284 ° Bart, W. M., Frey, S., & Baxter, J. (1979). Generalizability of the ordering among five formal reasoning tasks by an ordering-theoretic method. ' d St d o , 2, 251-259. 106 107 Bart, W. M., & Mertens, D. M. (1979). The hierarchical structure of formal operational tasks. Applied Peyehologicai Measprement, 2(3), 343-350. Bart, W. M., & Krus, D. J. (1973). An ordering-theoretic method to determine hierarchies among items. Educetiohai end Esychoiogical Measutemeht, 22, 291-300. Bejar, I. (1984). Educational diagnostic assessment. gepthei pt Educatiehei Measutehent ,2i(2), 175- 189. Birenbaum, M., & Shaw, D. J. (1985). Task specification chart: A key to a better understanding of test results. gouthei pt Edueatiehei heesutepeht, 22(3), 219-230. Birenbaum, M., & Tatsuoka, K. K. (1982). On dimensionality of achievement test data. Qeurhai pt Egpcetiohai heasutemeht, 22(4), 259-266. Blixt, S. L., & Dinero, T. E. (1985). An initial look at the validity of diagnoses based on Sato’s caution index. Educetionel and Psychoiegieei heasutement, 15. 293-299. Bock, R. D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis. Appiiee Eeychelegieai heesupement, i2, 261-280. Brown, J. S., & Burton, R. R. (1978). Diagnostic models for procedural bugs in basic mathematical skills. Qoghitiye Sc'e ce, 2, 155-192. Clark, C. M., & Peterson, P. L. (1986). Teachers’ thought processes. In M. C. Wittrock (Ed.), Third hahdheoh e: teseareh eh teaching (pp. 225-296). New York: Macmillan. Donlon, T. F., & Fischer, F. E. (1968). An index of an individual’s agreement with group-determined item difficulties. cat'o a s ch 0 'ca Meesurehent, 2§, 105-113. Drasgow, F. (1978). 's ' a ' d' eppr eptieteness o: aptitpde teet ecetes. eUnpublished doctoral dissertation, University of Illnois, Urbana-Champaign. Druva, C. A. (1985, April). A c o 't o e s ptpeeeptee. A paper presented at the American Educational Research Association National Convention at Chicago, Illinois. (Eric Document Reproduction Service No. ED 275 764) 108 Ebel, R. E. (1972). ssent'a s educat'o a mea e ent (3rd ed.). New Jersey: Prentice-Hall, Inc. Egan, O., 8 Archer, P. (1985). The accuracy of teachers’ ratings of ability: A regression model. Ahetiean Edusetienel_Besearsh_Ieurnal. 22. 25-34- Floden, R. E., Porter, A. C., Schmidt, W. H., 8 Freeman, D. J. (1980). Don’t they all measure the same thing: Consequences of standardized test selection (pp. 109- 120). In E. L. Baker, 8 E. S. Quellmalz (Eds. ), Edu_at1eaal_t_§tiag__nd_exaluatiea Beverly Hills. CA: SAGE Publications. Fuchs, L. S., 8 Fuchs, D. (1986). Linking assessment to instructional intervention: An overview. Seheel Esychelegy Beyiew, 15(3). 318-323. Fujita, H., Satoh, T., 8 Nagaoka, K. (1977). Graphical analysis of test scores using an S-P table. Educatiohal Iechhoiogicai Beseateh, 1, 21-31. Gagne’, R. M. (1985). t' s ea ' nd t eo e: ihstrpctieh (4the ed. ). New York: Holt, Rinehart 8 Winston. Green, S. B. (1983). Identifiability of spurious factors using linear factor analysis with binary items. Applied Esychologicai Measuremeht, 1(2), 139-147. Guttman, L. A. (1950). A basis for scalogram analysis. In S.A. Stouffer et al. (Eds.), Studies in social s ' W W : eas eme d e 'ct' (Vol. 4, pp. 60-90). Princeton, NJ: Princeton University Press. Harnisch, D. L. (1983). Item response patterns: Applications for educational practice. 1 d 'o Meesutemeht ,29(2), 191- 206. Harnisch, D. L., 8 Linn, R. L. (1981). An analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Jo r c ' nal Meeeptement, 12(3), 133-146. Harnisch, D. L.., 8 Romy, N. (1985). geer’e guide to: the Student Ezehiep Pechage (SEE) en the iBh-E . University of Illinois at Urbana- Champaign, Office of Educational Testing, Research and Service, Champaign, Illinois. Hoge, R. D., 8 Coladarci, T. (1989). Teacher-based judgments of academic achievement: A review of literature. Review ef_Edusatienal_Be§earsh. 52(3). 297-313. 109 Jaeger, R. M. (1988). Use and effect of caution indices in detecting aberrant patterns of standard-setting judgments. Applied heasutemeht ih Educatien, 2(1), 17-31. Kane, M. T., 8 Brennan, R. L. (1980). Agreement coefficients as indices of dependability for domain-referenced tests. 'ed s c 0 sure t, 5(1), 105-126. Krus, D. J. (1974). A computer program for deterministic and probabilistic models of order analysis. Eepeetiehei_ehe Esxshelegieal_uea§nrement. 11. 677-683- Krus, D. J. (1975). Opde: epalysie o; hipepy gete hetticee. Los Angeles: Theta Press. Krus, D. J. (1977). Order analysis: An inferential model of dimensional analysis and scaling. Sheeetiehe1_ehe Eeychoiegieei heaepzehent, 21, 587-601. Krus, D. J. (1978). Logical basis of dimensionality. Appliee Esyeheiogicai heeeppement, 2(3), 321-329. Krus, D. J., 8 Bart, W. M. (1974). An ordering theoretic method of multidimensional scaling of items. Sdpeational and Psycheiegicei heespteheht, 25, 525-535. Krus, D. J., Bart, W. M., 8 Airasian, P. W. (1975). QIQQIIBQ theery ehd hethede. Theta Press. Krus, D. J., 8 Krus, P. H. (1980). Dimensionality of Hierarchical and proximal data structures. Appliee Seychologieei heeeuteheht 5(3), 313-321. Krus, D. J., 8 Weiss, D. J. (1976). Empirical comparison of factor and order analysis on prestructured and random data. EultiYariate_Behexieral_Beseareh. 11. 95-104. Leinhardt, G., 8 Seewald, A. M. (1981). Student-level observation of beginning reading. Septhei_et Edueatienal_neasuremsnt. 18(3). 171-178. Levine, M. V., 8 Rubin, D. B. (1979). Measuring the appropriateness of multiple choice test scores. Jouthei of S dpeetiehei Stetietics, 5, 269- 290. Linn, R. L. (1983). Testing and instruction: Links and distinctions. JQBIDQI of Educatippei Measutepeht, 22, 180-189. 110 Linn, R. L. (1990). Essentials of student assessment: From accountability to instructional aid. Teeehete_geilege Becotd, 21(3), 422-436. McArthur, D. L. (Ed.). (1987). t t'v a ac 0 he eseessmeht pt echievehen . Boston: Kluwer Academic Publishers. Mehrens, W. A., 8 Lehmann, I. J. (1984). Meeeptemept_ehd ev uat' ° u o d c , (3rd ed.). New York: Holt, Rinehart, and Winston. Mehrens, W. A., 8 Lehmann, I. J. (1987). geihg_etehgetgi2ee teete_ih_eepeetieh, (4th ed.). New York: Longman Inc. Mehrens, W. A., 8 Lehmann, I. J. (1991). s e nd eveiuatieh ih eeucatieh ehe peyeheiegy, (4th ed.). New York: Holt, Rinehart, and Winston. Nunnally, J. C. (1978). ESYQDQEEEIi£_&h§Q£Y (2nd ed.). New York: McGraw Hill. Piazza, N. J., 8 Wise, S. L. (1988). An order-theoretic analysis of Jellinek’s disease model of alcoholism. The Ihtephatiohei Jeutnei pf Addictiehs,l22(4), 387-397. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. JQQIDQI ef Sdgeatiohai Stetieties, 5(3), 207-230. Reckase, M. D. (1985). The difficulty of test items that measure more than more ability. Applieg_geyeheiegieei Meashtemeht, 2, 401-412. Resnick, L. B. (1976). task analysis in instructional design: Some cases from mathematics. In D, Klahr (Ed.), Qeghitieh ehd Ipstructien. Hilldsale, NJ: LEA. Reynolds, T. J. (1981). ERGO: A new approach to multidimensional item analysis. Egpeetiehei_ehe Esxshelegi_al_uea§_rem_n_ 11. 643-659. Sato, T. (1980). The S-P chart and the caution index. 9 g C SYetsms_Be§earsh_Laberaterie§. Nippon Electric C0-. Ltd. Sato, T. (1990). The S-P chart analysis. In D. L. Harnisch 8 M. L. Connell ( Translation Eds. ), AD_iD&IQQEE§iQD t__edu2at12ne1_1nfermetien_teehnel_gx (PP- 159-175)- Japan: NEC Technical College. 111 Sato, T., 8 Kurata, M. (1977). Basic S-P score table characteristics. nEQ_Researsh_and_nexeleemsnt. 11. 64- 71. Schmidt, W. H. (1983). Content biases in achievement tests. I2uraal_2f_£ducat1enal_ueasurement. 22. 165-177- Shishido, J. A., Ayabe, H. 1., 8 Heim, M. (1986). Relationship between caution indices and student demographic data in a Japanese language placement examination situation. In Kim Chul-hwan and Lee Wha-kuk (Eds.), ' he e c t a t ' - c' ' c ur . Proceedings of the ’88 PRAHE Seoul Conference. (ERIC Document Reproduction Service No. ED 311 753) Switzer, D. M., 8 Connell, M. L. (1990). Practical applications of student response analysis. Egpeetiphei Measursms811_Issuss_and_£ractiss. 2(2).15-18- Tatsuoka, K. K. (1984). Caution indices based on item response theory. Esychehetziha, 52(1), 95-110. Tatsuoka, M. M. (1978). c t e ' e 'n a : ' ee s ac e educ t 0 a1 measu e e ptebieme. Paper presented at the ONR Contractors Meeting on Individualized Measurement, Columbia, MO. Tatsuoka, K. K., 8 Baillie, R. (1982). EIQNBH§1_AD_£IIQI nost' c m uter m f ne - umber epithmetic en the ELATO syetep. Urbana-Champaign, Illinois: University of Illinois Computer-based Research Laboratory. Tatsuoka, K. K., 8 Linn, R. L. (1983). Indices for detecting unusual patterns: Links between two general approaches and potential applicationS- Applisd_£sxchelegisal n_asurem_n_ 1(1). 81- 96. Tatsuoka, K. K., 8 Tatsuoka, M. M. (1982). Detection of aberrant response patterns and their effect on dimensionalitY- I2urna1_9f.£dusatienal_§_atie_1cs 1(3) 215-231. Tatsuoka, K. K., 8 Tatsuoka, M. M. (1983). Spotting erroneous rules of operation by the individual consistency indeX- I2urnal_9f_Educatienal_uea§urement. 2_(3), 221- -230. Tomsic, M. L. (1987, April). The effeet of peer fitting te n t e d'st 'but' s ende au ° ' ces. A paper presented at the annual meeting of the AERA at Washington, DC. 112 van der Flier, H. (1982). Deviant response patterns and comparability of test scores. Jo a o C ss- u u Esychelegy, 12(3), 267-298. Wise, S. L. (1981). A moditiee etdet-enaiysis ptocedure to; detetmihing unidimensional item sets. Unpublished doctoral dissertation, University of Illinois, Urbana-Champaign. . Wise, 8. L. (1983). Comparisons of order analysis and factor analysis in assesing the dimensionality of binary data. Applied Psychoiogical Meesutemeht, 1(3), 311-321. Wise, 8. L. (1986). The use of ordering theory in the measurement of student development. Jopphai of Qeiiege Personnel. 21(5). 442-447- Wise, S. L., 8 Tatsuoka, M. M. (1986). Assessing the dimensionality of dichotomous data using modified order analysis. Sducatiohei ang Esycheiegieei Measpremeht, 55, pp.295-301. ‘ Wright, B. D. (1979). Solving measurement problems with the Rasch model. on a o duc t'ona eas rement, 15, 97-116. Zimowski, M. F., 8 Book, R. D. (1987). Euii-ipfotmatioh item factot ehalysis o; teet items item the ASVAS CAT pool. Chicago: National Opinion Research Center. ‘F "Iii!iiiiiiiiiiiiii