k I “. *3!“ A @553 ' 21.53»... at; 43"." 1.1 \J' ' y...‘ .' "1 5w , "'fi‘; .‘3 )3 ‘5- 3:._ . .rr 78" J ‘- t fir; I: m'i'h «- < -. u. 63*- . =2 m. ! .‘w ‘ NJ' -3- w} 'I.' "M N} ‘A I ‘ , J' I { Nil,“ ' ,. p. . 1 .. I ‘W‘ “' .AI ‘ \' we 3? ‘ ‘ u ' .. 4.313% . $§§$fififl :' "I ‘:".x‘.’ I . r 5“ 7,1 a. I 0'; V NM! '7 _ "u v ‘\.- u‘ .~ :fiw'z '. @593" 5%.- " .0. .. Q 11‘ E '6' QIJIIK'S.‘ "f7 2'?" “3%“ ~ 32x3 ‘3‘ :‘u‘ ' ‘ ’ ¢u‘ . .. 01'“: u‘ . “. {GNU-:15 ‘ .“HH ‘vl‘ .. ’ n‘q" ‘ w:';:-'.‘.'v ‘. ' I: ' " ' I " .! ' . k 3 A ~ 1 3' ;,' n :i.‘ ‘u :I_ “9’ n '. ‘> ‘ ' 1‘" V‘; . ‘1 fr," ' I I. ' \‘ ‘ I“ 4‘ ‘ 1‘ .‘- II “I 4 ‘ . .. p.15 0 ‘- " on“ \.| .;',»,':".l .».-- . ’1.11.:.:~h (‘2‘.(2'4; WWI "9““- ' p, ‘H‘Ja Mun; {eh-.1 ' mags 7‘ M 'n that "unqnunzxiczao { a A .2 a. r . 9,. . _.-. (‘, ._- ‘ "- ...I _ .. i ‘ ‘ ‘ ‘ ' ~ r» . t. .‘ g ‘ , ‘ ¢:___« .1. .-~l.. »- ._ i . I ’"n' 29 'ctv '1'! " ‘- gnaw»? If: 13)." LT! + Mismmmr This is to certify that the thesis entitled A Study of the Qualifications of the Teacher as an Evaluator in Bangkok, Thailand presented by Tuanjai Sethtasakko has been accepted towards fulfillment of the requirements for Ph . D . degree in Education: Measurement, Evaluation 8 Research Design LM/ZLAM 7%; i/cZALA’ Major professor Date October 14, 1980 0.7 639 A STUDY OF THE QUALIFICATIONS OF THE TEACHER AS AN EVALUATOR IN BANGKOK, THAILAND By Tuanjai Sethtasakko A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, and Educational Psychology 1980 ABSTRACT A STUDY OF THE QUALIFICATIONS OF THE TEACHER AS AN EVALUATOR IN BANGKOK, THAILAND By Tuanjai Sethtasakko This study was aimed at providing data concerning the quality of the teacher as an evaluator to the administrators and educators in Thailand. It was the purpose of the study to find out which groups of teachers actually need in-service training and in which areas of measure- ment the need is the greatest.‘ This study also yields some follow-up information on the effects of the previous in-service programs in measurement and the effects of a measurement course offered by the teacher-training institutions. The population of interest of this study was the public elementary and secondary school teachers, who are under the Ministry of Education, in Bangkok. The instrument used in the study was a questionnaire con- cerning the teachers' opinions on national testing and their perceived needs in measurement and a true-false test measuring basic knowledge in measurement corresponding to the four subject matters that the teachers should know. There were eight questionnaire items and fifty-two test items, thirteen items for each subscale. The instrument was sent to 540 Thai teachers who were randomly selected from 12 strata. The stratifi- cation was based on the three variables: level of school — elementary school or secondary school; level of teacher education - teaching Tuanjai Sethtasakko certificate holders or bachelor degree holders; and teaching experience - less than or equal to three years, between four to ten years, or more than ten years. The design of the study was a 2 x 2 x 3 factorial design with four measures. The four areas of basic knowledge in measurement and evaluation were: planning a classroom test; item writing; item analysis; and test score statistics and marking system. The design was crossed and balanced with 30 observations per cell. The multivariate repeated measures analysis was employed to test the fifteen null hypotheses. It was found that the subject matter by teacher education inter- action was significant. The analyses showed that both certificate teachers and degree teachers got their lowest scores on Planning a Class- room Test subscale. The degree teachers got their highest scores on Test Score Statistics and Marking System subscale, but the certificate teachers got their highest scores on the Item Analysis subscale. The degree teachers got higher scores than the certificate teachers in all subscales. An interaction effect was also found on the level of school by teacher education interaction. The total score mean indicated that the degree teachers in secondary schools got higher scores than the degree teachers in elementary schools, but the mean of certificate teachers in secondary schools was slightly lower than the mean of certificate teachers in elementary schools. There was no significant difference between certificate elementary school teachers and certificate secondary Tuanjai Sethtasakko school teachers, but a significant difference between degree elementary school teachers and degree secondary school teachers was found at the .05 level. Further analyses were done to compare measurement needs among various groups of teachers, to observe the relationship between per— ceived needs and measurement needs, and to compare the mean differences between sample means and criterion scores. It was found that the teachers who took a college measurement course had less need for instruc- tion on measurement than those who did not take a course. There were no significant differences on measurement needs among teachers who attended the training program and those who did not attend the program, or among teachers who favored and did not favor the national testing program. No relationship between perceived needs and measurement needs was found. There were significant differences between sample means and cri- terion scores both on total mean and subscale means. The results showed that the teachers in Bangkok had measurement needs in all four subject matter areas. ACKNOWLEDGEMENTS I wish to express my sincere appreciation to those persons who have assisted so greatly in conducting this dissertation and doctoral study. First, to the dissertation director and chairman of the doctoral committee, Dr. W. A. Mehrens, whose interest, assistance, and guidance were essential to the development and completion of the study. Second, to the other members of the doctoral committee who made important contributions: to Drs. R. L. Ebel and S. Cherney for their help in the development and improvement of the study; to Dr. V. Scheifley for her help in analyzing the data and her moral support during the past few years. Third, to Mr. Boonlue Tong-Yoo and Ms. Orathai Tong-Yoo for their help in distributing and collecting the questionnaire. To the P.E.0. International Peace Scholarship Fund and the Sage Foundation Fund for some financial aid in the preparation of this dissertation. Finally, to my parents, Mr. and Mrs. Kamol Sethtasakko, and my husband, Dr. Saichol Ketsa, for their support, encouragement and patience throughout the doctoral study. ii TABLE OF CONTENTS CHAPTER Page I The Problem . . . . . . . . . 1 Introduction . . . . . . . . 1 Need for the Study . . . . . . . 2 Purpose of the Study . . . . . . . 4 Statement of the Problem . . . . . 5 Limitation of the Study . . . . . 6 Definition of Terms . . . . . . . 6 Overview . . . . . . . . . 7 II Review of the Literature . . . . . . 9 The Role of Teacher-Made Tests . . . . 9 The Role of the Teacher in Testing and Evaluation . . . . . 14 The Qualifications of the Teacher as an Evaluator o o o ’ o o o o 0 l7 Improving the Competence of Teachers in Measurement and Evaluation . . . . . . . 24 Summary . . . . . . . . . 30 III Procedures . . . . . . . . . 32 Population . . . . . . . . . 32 Sampling Procedure . . ' . . . . . 35 Instrument . . . . . . . . . 38 Data Collection . . . . . . . . 41 Design . . . . . . . . . . 41 Analysis . . . . . . . . . 45 Summary . . . . . . . . . 45 IV Analyses and Results . . . . . . . 47 Descriptive Data . . . . . . . 48 Repeated Measures Analysis on Measurement Scores . . . . . . . . 49 Additional Analyses . . . . . . . 59 Summary . . . . . . . . . 66 iii CHAPTER Page V Summary and Conclusions . . . . . . 70 Summary . . . . . . . . . 70 Conclusions and Implications . . . . . 75 Recommendations for Further Study . . . . 78 APPENDIX A The Instrument (First Edition) . . . . . 81 B The Instrument (Final Edition) . . . . . 88 BIBLIOGRAPHY . . . . . . . . . . 100 iv TABLE 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 4.4 4.5 LIST OF TABLES Total Number of Schools in Bangkok Classified by Location, and Level of School . . . Total Number of Teachers in 33 Elementary Schools and 92 Secondary Schools in Bangkok Classified by Level of School and Level of Teacher Education Number of Elementary and Secondary Schools in Each Region Used in the Study . . . Total Number of Teachers in 30 Elementary Schools and 30 Secondary Schools in Bangkok Classified by Level of School, Level of Teacher Education, and Years of Teaching Experience . . . . Total Number of Returned Responses Classified by Level of School, Teacher Education, and Years of Teaching Experience . . . . ‘ Design of the Three Independent Variables: Level of School, Level of Teacher Education, and Teaching Experience . . . . . . . . Means and Standard Deviations of Four Subscales and Total Scores . . . . ‘ . . . Multivariate Repeated Measures Analysis on Basic Measurement Scores . . . Univariate Analysis of Variance on Subscale Scores of Certificate Teachers and Degree Teachers . . Univariate Analysis of Variance on Subscale Scores of Certificate and Degree Secondary School Teachers . . Analysis of Variance on Subscale Scores and Total Score of Teachers Who Took a College Measurement Course and Those Who Did Not Take a Course . Page 33 34 36 37 42 43 50 51 54 S6 60 TABLE Page 4.6 Analysis of Variance on Subscale Scores and Total Score of Degree Teachers Who Took a College Measure- ment Course and Those Who Did Not Take a Course . 61 4.7 Analysis of Variance on Subscale Scores and Total Score of Secondary School Teachers Who Took a College Measurement Course and Those Who Did Not Take a Course . . . . . . . . . 62 4.8 Presentation of Cell Means for All Four Subject Matter Areas of Certificate and Degree Teachers Who Were Classified into Four Different Groups According to Area of Measurement They Thought They Knew Most . 64 4.9 Presentation of Cell Means for All Four Subject Matter Areas of Certificate and Degree Teachers Who Were Classified into Four Different Groups According to Area of Measurement They Thought They Knew Least . 65 4.10 Comparison Between Sample Means and Criterion Scores . . . . . . . . . 66 vi LIST OF FIGURES FIGURE Page 1 Graph Presentation of Cell Means for Subject Mat- ter and Teacher Education . . . . . 53 2 Graph Presentation of Cell Means for Level of School and Teacher Education . . . . . . 55 3 Graph Presentation of Cell Means for Subject Matter and Level of Education of Elementary School Teachers . . . . . . . . . 57 4 Graph Presentation of Cell Means for Subject Matter and Level of Education of Secondary School Teachers . . . . . . . . . 58 vii CHAPTER I THE PROBLEM Introduction From the earliest beginnings of society, people have measured the abilities of other people and have recognized the existence of differ- ences in the abilities possessed by different individuals. Impressions of people often develop from unsystematic observations. Often these impressions are totally incorrect or unfair. The drawing of conclusions about students from classroom examination results is analogous to the drawing of everyday life impressions from social incidents. Here too, conclusions drawn are sometimes radically incorrect or unfair to the stu- dent who has taken the examination. What causes these misinterpretations? While the interpreter is sometimes to blame, the major cause is usually the test: it does not properly measure the ability being judged; it does not adequately sample the individual's behavior; it does not offer an accurate or consistent measure of the trait; or the test is not sensitive enough to measure small gradations in ability. Tests and testing processes are more closely related to everyday activity than is commonly realized. The situational conditions which lead to accurate impressions of human behavior are much the same as the characteristics of a classroom test which lead to an accurate evaluation of student achievement. In fact, such conditions are essential to effec- tive measurement throughout the entire field of evaluation. Tests have little or no value in their own right. They are good or bad primarily in terms of how they are used to affect the learner. Tests can improve the effectiveness of the instructional decisions by providing more objective information on which to base the judgments. The use of tests can have an immediate and direct effect on the learning of students. Tests have become an integral part of our life and many decisions are often made on the basis of a student's score. It is very essential to recognize that test results may play an important role in students' lives. If questions are well-chosen and tests well-constructed, they can help students learn how to organize, analyze, and judge ideas and con- cepts; how to sort relevant details from irrelevant ones; and how to think critically about the possible relationships in the materials they have studied. Students who are given varied and challenging tests re- ceive, in effect, valuable learning experiences. The tests are also likely to enable them to generalize about the importance of a course or even the value of education itself. Thus, a teacher's test is a powerful contributor to what students learn and how they learn it. Need for the Study Teachers have an obligation to provide their students with the best instruction possible. This implies that they must have some procedures whereby they can reliably and validly evaluate how effectively their students have been taught. The need to employ all applicable techniques of appraisal in the schools is a practical matter faced daily by teachers. One of the major responsibilities given to the teacher is the difficult, but necessary, task of assigning grades. Considering the importance of grades to the students, one must be as certain as he can be that this is done wisely. Grades should reflect both achievement and quality of performance. A high or low grade may be a determining factor in finding a job, being permitted to sit for university entrance examinations, or establishing individual interest in certain careers and vocations. Examinations help to determine not only the degree of achievement but also the individual's achievement relative to others in his class. For all these reasons, evaluative instruments must be as reliable and valid as they can be because they will provide the basic measurements from which the final grade is made. Teachers must become proficient in constructing classroom tests because they occupy the central role in the evaluation process. Unfor- tunately, very often teachers have not been given adequate preparation in this area of competence. There is evidence that the problem of pre- paring teachers in measurement and evaluation is real, and substantial. Many teachers do not seem to be adequately prepared in this respect (Goslin, 1967; Roeder, 1972). The quality of teacher-made tests in Thailand is critical because of two important reasons. First if a student fails the final examination at the end of the school year he has to be in the same grade for another year with some unfavorable attitude and with the label "repeater." Secondly, teacher-made tests are not only used for the final examina- tions, they are also used for the entrance examinations. Since there are limited seats for students in secondary school and the university level, entrance examinations play a very important role in selecting students. Students who want to get into secondary schools or univer- sities must obtain high percentile ranks on the entrance examinations for admission. Thus, the need for improving the competence of teachers in measurement and evaluation is urgent. It is recognized that the instructional values of teachers' roles in testing and evaluation in elementary and secondary school in Thailand has received much less attention. If the advantages of objective test- ing are to be fully realized, it seems clear that the teacher training institutions will have to pay increased attention to the problems of test construction as they apply to classroom teachers and provide the future teachers with the kinds of skills they will need to do an adequate job of constructing tests. The teacher training institutions might per- form a particularly useful role in helping the teachers by developing practical courses in test construction. There is little doubt that examination and evaluation have become an integral part of our academic life. Although crucial decisions are often made on the basis of student's test scores, little if any thought is given to the qualifications or skill of the teacher as an evaluator. Purpose of the Study The purpose of this study was to identify the areas of instruction in measurement which are needed most by the elementary and secondary school teachers in Bangkok, Thailand, and also to compare the basic knowledge in measurement and evaluation of teachers who are grouped according to the amount of their experience in teaching, to the levels of their education, and to the levels of school they teach. ‘Measurement needs in four fundamental areas in measurement and evaluation are meas- ured: planning a classroom test, item writing, item analysis, and test score statistics and marking system. Secondly, the study tested the relationship between measurement needs and perceived needs of teachers in Bangkok. It was also aimed at providing the data and information concerning the quality of Thai teacher as an evaluator to the administrators and educators in teacher training institutions in Thailand. Statement of the Problem The research described here was an investigation of the following questions: V 1. Is there a difference in measurement needs between elementary school teachers and secondary school teachers in Bangkok? 2. Do teachers who have higher levels of education have less measurement needs than those who have lower levels of educa- tion? 3. Is there any difference in measurement needs between teachers who have more teaching experience and those who have less teaching experience? 4. Is there any difference in measurement needs among the four subject matters covered in the study? The result of this study will provide information about the quali- fication of the teacher as an evaluator to the administrators and edu— cators in Thailand. As a result of this study, one should be able to conclude: a) which group of teachers in Bangkok has the most urgent need for in-service programs in measurement and evaluation, and b) what area of in—service training is most needed. Limitation of the Study This study was based on the sample of elementary and secondary school teachers who have been teaching in public school in Bangkok. The sample includes only the teachers who are under the Ministry of Education. Generalizations of the result of this study should be made to the teach- ers who are in the same population of the study. Generalizations to other groups of teachers might be made only if the reader is willing to take responsibility for the validity of such generalizations. Definition of Terms Teacher is an elementary school teacher or secondary school teacher. Elementary school teacher is a teacher who teaches in a public school in Bangkok which is under the Ministry of Education only, those who are under the Ministry of Interior are not included in this study. Secondary school teacher is a teacher who teaches in a public school in Bangkok which is under the Ministry of Education. Measurement need is defined as a lack of knowledge in a sub-area of tests and measurement as is disclosed by responses to groups of re- lated test items that are designed to test desirable knowledge of meas- urement and evaluation. Perceived need is defined as the feeling of lacking desirable know- ledge in measurement and evaluation as disclosed by responses to ques- tionnaire items. Experience in teaching is defined as the number of years a teacher has been teaching in school. The higher the number of years a teacher has been teaching in school the more experience in teaching the teacher has. In this study, the teachers are classified into three groups: less than or equal to three years experience in teaching, between four to ten years experience in teaching, and more than ten years experience in teaching. Certificate teacher is a teacher who had studied at a teacher train- ing institution at least two years but not more than four years after grade 10, or its equivalence. Degree teacher is a teacher who got at least a bachelor's degree in education, or its equivalence. Item difficulty is defined as the percent of people who give an incorrect answer to the item. Criterion score is the ideal mean. It is a point midway between the maximum possible score and the expected chance score (for example, ideal mean of 52 true—false test items = l/2(52 + 52/2) = 39). If the teachers have competence in measurement, the group mean should be higher than or equal to the ideal mean. Overview This study is reported in five chapters, followed by appendices. In Chapter I, the introduction, need for the study, purpose of the study, statement of the problem, limitations of the study, and definition of terms used in this study were presented. In Chapter II, the literature relevant to the general problem and related areas is reviewed. Description of the population, sampling pro- cedure, the instrumentation, the design of the study and methods of analysis are discussed in Chapter III. Chapter IV contains research data and results of the study. The final chapter contains a summary of the study, the conclusions and im- plications, and recommendations for further study. CHAPTER II REVIEW OF THE LITERATURE The intent of this chapter is to review studies that are related to the problem as described in Chapter I. The review is divided into four sections: the role of teacher-made tests; the role of the teacher in testing and evaluation; the qualifications of the teacher as an evaluator; and improving the competence of teachers in measurement and evaluation. The Role of Teacher-Made Tests Classroom teachers are constantly searching for ways to improve their service to students. In line with this objective, they must have some procedures whereby they can reliably and validly evaluate how ef- fectively their students have been taught. The classroom achievement test is one such tool. Classroom tests may vary in form according to individual teachers' preference. Some teachers tend to favor the unstructured type in which the students must create the answers, as in the short answer or essay type. Other teachers prefer to use one or more of the structured or objective type tests in which the students select rather than create the answers. Still others use a combination of both unstructured and structured formats. 10 Most classroom tests must be prepared by the teacher who is teach- ing the class. While there are many standardized achievement tests available for broad areas of subject matter, few are specifically appro- priate to the content and objectives of a unit of study. Teacher-made tests are better in the sense that they are more relevant to a teacher's particular objectives. As Mehrens and Lehmann (1978, p. 161) say, "Not only is the classroom teacher able to tailor the test to fit his par- ticular objectives, but he can also make it fit the class and, if he wishes, fit the individual pupils. Commercially prepared tests, be- cause they are prepared for use in many different school systems with many different curricular and instructional emphases, are unable to do these things as well as the teacher-made test." Clearly, no standardized achievement test can completely serve the needs and purposes of every local situation (Noll, Scannell, and Craig, 1979, p. 148). Teachers usually feel that these tests do not adequately measure their own or the local objectives of instruction. Mehrens and Lehmann (1978, p. 161) have also pointed out that, "Com- mercially prepared achievement tests could be used to obtain some of the information needed by the teacher, and they could be used to motivate students. But, even in those schools that use commercial tests, it is unusual for such tests to be administered more than once a year. Also, the content of commercially prepared tests tends to lag, by a few years at least, recent curricular developments. Teacher-made tests are more likely to reflect today's curriculum. This is especially true in subject- matter areas such as science and social studies, which change rather rapidly in contrast to composition or literature." 11 Good classroom tests provide an efficient means for determining pupil ability and achievement. Ebel (1979, pp. 22-23) states, "The major function of a classroom test is to measure student achievement and thus to contribute to the evaluation of educational progress and attain- ments. A second major function of classroom tests is to motivate and direct student learning. The experience of almost all students and teachers supports the view that students do tend to study harder when they expect an examination than when they do not and that they emphasize in study those things on which they expect to be tested. Classroom tests have other useful educational functions. Constructing them, if the job is approached carefully, should cause an instructor to think carefully about the goals of instruction in a course." It can be seen that a classroom test can serve many purposes, but it cannot do so with equal effectiveness. Mehrens and Lehmann (1978, p. 170) state that classroom achievement tests serve a variety of pur- poses, such as: l) judging the pupils' mastery of certain essential skills and knowledge, 2) measuring growth over time, 3) ranking pupils in terms of their achievement of particular instructional objectives, 4) diagnosing pupil difficulties, 5) evaluating the teacher's instructional method, 6) ascertaining the effectiveness of the curriculum, 7) encouraging good study habits such as frequent review, and 8) motivating students. 12 Teacher-made tests are used principally for instructional func- tions (Stanley and Hopkins, 1972, p. 7). They provide a means of feed- back to the teacher. Feedback from tests helps the teacher provide more appropriate instructional guidance for individual students as well as for the class as a whole. Well-designed tests may also be of value for pupil self-diagnosis, since they help students identify areas of specific weakness. The well-constructed tests can also motivate learning. As Mehrens (1979, p. 17) states, "Tests are credible instru- ments and will help motivate students and teachers." In general, stu- dents pursue mastery of objectives more diligently if they expect to be evaluated. The well-constructed examinations can give students an opportunity to test out their knowledge and constructive feedback can motivate students to improve on their performance. In a study of classroom testing procedures and their influence on achievement, Marso (1970) found that: t 1. unit testing does influence student achievement 2. feedback, pacing of learning, motivation and anxiety are related to student learning 3. testing procedures should incorporate frequent, graded, unit tests followed by class discussion 4. students with measured high test anxiety are not helped by frequent, graded, unit examinations with feedback. Another study of the effect of frequent use of tests and feedback of test results was conducted by Feldhusen. Feldhusen (1964) reported that the majority of research reports prior to the time he undertook his study indicated that frequent use of teacher-made tests and feedback 13 from them resulted in better achievement and increased understanding of the concepts presented by the teacher. In his study, fifty-five college students in an introductory psychology class were given fourteen weekly quizzes consisting of 10 to 20 items. The quizzes were graded, returned, and on ten of the fourteen times when tests were given, classroom dis- cussion ensued. At the end of the course, students were asked to re- spond anonymously to a questionnaire. He found that students con- sistently reported greater study and learning with periodic testing, and the anticipation of a forthcoming test may also affect students' "inten- tion to remember" instructional content. It is generally agreed that teacher-made tests are used because they enable the teacher to engage in continuous appraisal. There are, however, several limitations to the use of such tests that must be recog- nized if the teacher wants to make effective and efficient use of this means of measurement. As Schwartz and Tiedeman (1962, p. 110) say, "The foremost limitation of the use of teacher-made tests is the in- adequate knowledge of most teachers concerning the.principles of test construction. Test construction is a skill that can be learned but it takes time and practice to learn the skills of test construction. Another limitation that needs to be recognized is that good test items may take considerable time to prepare. Since there are limits to the time that is available to teachers, the problem of finding time to con- struct items may pose difficulties for some teachers. If, however, the teacher constructs test items on a daily basis, this limitation can be overcome." 14 Furthermore, Ebel (1979, p. 27) says that paper-and-pencil tests are well adapted to testing verbal knowledge and understanding and ability to solve verbal and numerical problems. These are important educational outcomes, but they are not all. One would not expect to get far using a paper-and-pencil test to measure children's physical devel- Opment. Both performance tests of physical development and controlled observations of behavior in social situations would be expected to offer more promise than a paper-and-pencil test. Constructing a satisfactory test is one of the hardest jobs a teacher has to perform. The process of constructing a good test item is deliberate and time-consuming; it demands an understanding of the objec— tives being assessed and of the examinees and their test-taking behavior. Teacher-made tests are of value only if they yield information that is used to improve the total teaching-learning process. The Role of the Teacher in Testing and Evaluation Teachers should be concerned with the types and levels of learning included in their courses from two perspectives: (l) the development and teaching of their courses, and (2) the assessment of their students' achievement (Lindvall, 1967, p. 3; Erickson and Wentling, 1976, p. 55). The essential purpose of teaching is to provide changes in students. Any program of instruction must be based upon and be guided by infor- mation concerning student aptitude, interest, and achievement. The classroom teacher should be guided by continuous information about stu- dent aptitude, interest and progress. Although the classroom teacher employs as much informal observation as possible as a means of acquiring 15 information about his students, it is necessary to use more formal procedures such as testing and other evaluation techniques. Classroom realities compel teachers both to measure and evaluate student behaviors. The need to employ all applicable techniques of ap- praisal in the school has become a practical matter faced daily by teachers. As Goslin (1967, pp. 5-6) says, "The teacher occupies a cen- tral role in the testing and evaluation process for a number of reasons. First, the teacher is the primary point of contact between the child and the educational system, and what teachers say and do are major influences in the process whereby the child learns to assess his own abilities. Second, the teacher very often serves as the administrator and scorer of standardized tests, especially at the elementary level where testing specialists tend to be scarce. Even in situations where teachers are not directly involved in administering standardized tests, virtually all schools give teachers access to test scores. Finally, in a very real sense the teacher himself is being evaluated as a consequence of the performance of his pupils on standardized achievement tests. Teachers, therefore, are not disinterested observers of the testing process and may be expected to make efforts to improve the performance of their pupils on standardized tests, wherever this is practical. This, in turn, results in tests having a potential impact on school curricula insofar as what is taught and how it is taught is left to the teacher." The teacher has a responsibility for appraising the individual differences among students in their achievement of various educational objectives (Thorndike and Hagen, 1969, p. 33). He must pass on to the next teacher a report of these differences, either in the form of a 16 mark or a specific recommendation, if the school is to provide an opti- mum learning environment for each child. Decisions about permitting students to pursue certain courses of study in high school, about ad- mitting students to college, and about selecting students for certain occupations depend very largely upon judgments recorded by previous teachers concerning the competence of each student. The information on which these judgments are based is provided in considerable measure by tests. Teachers must know how to perform certain aspects of measurement and evaluation themselves, such as constructing tests, giving grades, assessing potentialities, and interpreting standardized aptitude and achievement tests (Ebel, 1961a, pp. 19-32; Stanley, 1964, p. 5). They should know how to select from the many available tests, inventories, questionnaires, rating scales, checklists, and the like, those most suitable for a particular purpose. Besides being able to understand directions for administering, scoring, and interpreting tests, teachers should possess the higher ability to compare the mOSt promising ones before the choice itself is made. This requires attaining various con- cepts necessary to understand test publishers' literature, reviews, and articles reporting test research. Teachers sometimes dislike to assume the role of examiners (Ebel, 1979, p. 28). They, also, may be prone initially to frustration and disappointment when writing test items, possibly more so with one item format than another (Mehrens and Lehmann, 1978, p. 187). Ebel (1975) found that teachers did better in writing multiple choice test items than in writing true-false test items. He, however, comments that this 17 is hardly a fair comparison, since true-false items can be written more quickly by teachers, and responded to more quickly by students, than multiple-choice test items. Hence he feels that there is reason to question the recommendation that classroom teachers should generally give preference to multiple-choice over true—false test items. The Qualifications of the Teacher as an Evaluator The evaluation device used most frequently by the majority of teachers is undoubtedly the teacher-made test; therefore, it is essen- tial that the beginning teacher be skilled in the development and use of such devices. Fortunately, the matter of how to develop and use classroom tests has received considerable attention in past years, and many useful criteria and suggestions have been prepared. The teacher who makes conscientious use of what is available in the area can greatly improve the quality of his tests (Lindvall, 1967, p. 30). Many teachers admit that their tests do not adequately reflect the really important outcomes of their courses. Some are convinced that no test, and certainly no objective test, could adequately measure student achievement of their objectives (Ebel, 1972, p. 121). Sometimes teachers are embarrassed when they think of the way they judge their pupils. This is especially true after a parent-teacher conference if the teacher has not been able to explain the pupil's progress very effectively on the basis of measurement data collected (Lien, 1971, p. 20). Schwartz and Tiedeman (1962) point out that one of the biggest errors teachers make in test planning is their tendency to wait until 18 shortly before an examination is scheduled to begin to write the items to be included in the test. Often the press of other duties seems much more important so that actual item writing is put off until the last minute. The result usually is that too many of the test items are poorly thought out, contain ambiguous terms, and in all too many cases, involve petty details instead of the more important and pervasive outcomes of learning. Ebel (1979, pp. 64-65) discusses some of the mistakes that teachers make in measuring educational achievement: First, they tend to rely too much on their own subjective judgment, and on unverified inferences. Second, some teachers feel obliged to use absolute standards in judging educational achievement, which can almost always be judged more fairly and consistently in relative terms. If most of the students in a class get A's on one test and most of the same stu- dents fail another, some teachers prefer to blame the students rather than the test. Third, teachers tend to put off test preparation to the last minute. A last-minute test is likely to be a poor test. Fourth, many teachers use tests that are too inefficient and too short to sample adequately the whole area of understanding and abilities that the course has attempted to develop. Fifth, teachers often overemphasize trivial details in their tests, to the neglect of understanding of basic principles and ability to make practical applications. Sixth, the questions that teachers write, both essay and objective, often suffer from lowered effectiveness due to unintentional ambi- guity in the wording of the question or to inclusion of irrelevant clues to the correct response. Seventh, the inevitable fact that test scores are affected by the questions or tasks included in them tends to be ignored, and the magnitude of the resulting errors (called sampling errors) tend to be underestimated by those who make and use classroom tests. Finally, many teachers do not use the relatively simple techniques of statistical analysis to check on the effectiveness of their tests. 19 Are today's teachers being adequately prepared for performance of their evaluation responsibilities? Mayo (1967) found that graduating seniors in 86 teacher-training institutions did not demonstrate a very high level of measurement competence. Goslin (1967), in a study of the social consequences of testing and development of talent, found that about 60 percent of all teachers had only minimal exposure to training in test and measurement techniques. The unsatisfactory quality of the majority of teacher-made tests no doubt reflects this inadequacy in training. Not surprisingly, Goslin also found that teachers who had little preparation in tests and measurement tended to make little use of the pupil information obtained from standardized tests. Goslin (1967, p. 140) also states that "The role of teachers in testing is too impor- tant to be left to chance." Unfortunately, in view of studies by Mayo (1967) and Goslin (1967), Conant's recommendation (1963, p. 171) that instruction in tests and measurements be one of the essentials in teacher- training programs appears not to have been implemented adequately at many institutions. Fleming (1971) claimed that not many teachers come to the classroom prepared to observe systematically, construct their own classroom tests, or to interpret the results of standardized tests regardless of the mode in which the scores are reported. She held that, "Part of pupil difficulties in the school have not only been due to inaccurate deci- sions by teachers but to the fact that teachers have been unskilled in constructing instructional cycles of relevant learning experiences based upon valid, definable goals." (Fleming, 1971, p. 71). 20 Roeder (1972) surveyed the qualifications or skill of the teacher as an evaluator. The 940 elementary teacher training institutions located in every state and the District of Columbia, were mailed a one— page questionnaire. The data indicate that 57.7 percent of the institu- tions which were surveyed, or 496 institutions, did not require their prospective elementary teachers to complete a course in evaluation; 12.1 percent (104) required nothing more than a one or two semester hour course; 17.8 percent (158) required a three semester hour course and only 1.4 percent (12) required four or more semester hours of course work in evaluation. Sixty-two institutions (7.2 percent) reported that instruction in evaluation was a component of another course, e.g., educational psychology. The data also indicate that in 1970, the vast majority of teachers who were graduated from accredited teachers col- leges and awarded state certification, appeared to be better prepared to conduct an impromptu art lesson than they were to conduct, select, administer, score and interpret standardized and informal tests. Roeder concludes that it appears that even at institutions which do require a course in evaluation, the majority of teachers receive only a minimal exposure to the complex world of evaluation. Therefore, most of today's elementary teachers are not prepared to use tests. Sor Wasna Pravalpruk (1974) studied the "Comparison among teach— ers in Khon Kaen, Thailand, to determine their testing needs." A Likert type questionnaire was constructed to measure the perceived needs, and the actual needs were measured by a random sampling of test items measuring knowledge on educational measurement. The questionnaire and test had twenty—eight items and twenty items respectively. They 21 were sent together to 400 teachers who were randomly selected from Khon Kaen province. Pravalpruk found that in the past, the in-service train- ing program emphasized item editing in an attempt to improve the quality of the teacher-made test. The results of this training were reflected in the lower needs on both the actual and perceived needs in item editing than were found in other subject matter. The results of this study seem to indicate that it would be appropriate to emphasize item analysis procedures in future in-service programs because it was the area with the highest perceived needs score and was next to the highest in the test of actual needs. She also found that, in the item editing subject matter, teachers in the higher grades had less actual need than those who taught in lower grades. In 1977, the Office of the National Education Commission (ONEC) studied the measurement competencies of the primary school teachers in Thailand. It was found that the third-grade teachers preferred to write multiple choice items but they did not use a table of specifications as the blueprint for test construction. Yeh (1978), in a study of teacher use of test results, reported that only 50 percent of the teachers sampled were able to correctly interpret two standard scores commonly used in reporting standardized achievement results (percentile ranks and grade equivalents). She con- cluded from this that teachers need more knowledge about measurement. Given these findings about teachers' knowledge and the fact that teachers indicated they wanted more training on how to use and construct criterion-referenced tests, it may be that teachers need more training before any potential value of the test is realized. (Yeh, 1978, p. 42) 22 While there appears to be general agreement that teachers are not overly confident of their ability to interpret standardized test scores, the degree of confidence reported varies from researcher to researcher. Olejnik (1979), in a study conducted among non-test specialists (coun- selors, teachers and principals), found that over 90 percent of elemen- tary and middle school educators indicated that they were at least "some- what" confident of their ability to interpret test scores. The least confident were high school educationists. But when a mini-test similar to one given in college-level measurement courses was administered to the respondents, this self-reported "confidence" was not borne out. Most educationists correctly answered an item dealing with a percentile score (73%), yet a similar proportion missed an item that related norms to standards. They showed little understanding of the signifi- cance of stanine differences, and very few could properly interpret a grade equivalent score (12%). On the basis of his study, Olejnik con- cluded that in spite of self-reported confidence it appeared that non- measurement specialists needed additional assistance in the interpre- tation of standard scores. A market survey of Stanford Achievement Test users was conducted by Stetz in 1977 (in Rudman et al., 1980). This study was aimed at determining the extent to which teachers and other educationists under- stand and accept standardized test results. He found that both teach— ers and administrators preferred grade equivalents and percentile ranks for meeting their assessment needs; 59% of the teachers surveyed chose these two scores for individual student evaluation, 56% chose these two scores for class evaluation purposes, 65% chose grade equivalents and 23 percentile ranks for measuring growth, and 67% preferred these two scores for reporting test results to parents. One would like to assume from this that those who showed such a strong preference for these two standard scores understood what they signified, but Olejnik's study (1979) does give one some pause. The authorities seem to agree that testing is an integral part of teaching. Many teachers testify that the improvement of their skills in test construction has resulted in the improvement of their teaching. To make an appropriate and effective achievement test, one must have adequate knowledge of subject matter and skill in the techniques of test construction. Ebel (1961b, p. 68) has outlined six requisites for a teacher to be competent in educational measurement: 1) Know the educational uses, as well as the limitations, of educational tests. ' 2) Know the criteria by which the quality of a test should be judged and how to secure evidence relating to these criteria. 3) Know how to plan a test and write the questions to be included in it. 4) Know how to select a standardized test that will be effective in a particular situation. 5) Know how to administer a test properly, efficiently, and fairly. 6) Know how to interpret test scores correctly and fully, but with recognition of their limitations. Sack (1979) studied the "Measurement competencies of educators defined through task analysis and differentiated by teaching area, grade level, and vocation." The principals at 292 randomly chosen northern Illinois public schools were asked to select a competent class- room teacher and a qualified staff member or administrator'tx>anonymously 24 complete and return a task analysis questionnaire on measurement compe- tencies. Educators and their responses were grouped by teaching area, grade level, and vocation with a view to specific measurement competen- cies preferred by categories of interest. A set of measurement compe- tencies of acknowledged utility across nearly all educator categories tested was developed. These are: 1. Knowledge-of advantages and disadvantages of standardized tests. 2. Understanding of the importance of adhering strictly to the directions and stated time limits of standardized tests. 3. Knowledge of general uses of tests, such as motivating, em- phasizing important teaching objectives in the minds of pupils, providing practice in skill, and guiding learning. 4. Ability to state measurable educational objectives. 5. Knowledge of the techniques of administering a test. 6. Knowledge of effective procedures in reporting to parents. 7. Ability to interpret diagnostic test results so as to evaluate pupil progress. 8. Knowledge of limitations of tests that require reading compre- hension. 9. Knowledge of limitations in interpreting IQ scores. 10. Understanding of the fact that interpretations of achievement from norms is affected by ability level, cultural background and curricular factors. Improving the Competence of Teachers in Measurement and Evaluation Measurement and evaluation are a part of every teacher's respon- sibilities. He must appraise the status and progress of the learner and make reports. A teacher can hardly be of maximum effectiveness to- day without knowing at least how to interpret and use the results of 25 standardized tests of readiness, intelligence, and achievement. In addition, the teacher must know how to measure and evaluate with instru- ments of his own devising. Since these are necessary, some instruction in the fundamentals of measurement should be included in the preparation of every teacher. Workshops, field courses, supervisory assistance, teachers' meetings, and professional reading are all helpful in improving teachers' skills in measurement and evaluation. Improving classroom teacher competence in measurement and evalua- tion must not be treated as an isolated question, Margaret Stevenson (1959) says. It must be viewed in context, against the background of the numerous problems involved in establishing an adequate, democratically structured testing program in the school curriculum. While a great variety of things must be done to foster improvements in teacher compe- tence in measurement, Ebel (1961b) suggests that special emphasis may be focused on only three: 1) increased attention to educational measurement in teacher- training programs; 2) provision of special testing services to teachers in school systems. This requires a school system to employ a staff member with special competence in testing; and 3) special organization of in-service training programs in measurement for teachers. Noll (1961) recommends three possible ways for improving the preparation of teachers in measurement. The first would be to make a commitment to the policy of including a course in measurement as part of the requirement for a teacher's license or certificate. The second 26 would be to work for the strengthening of existing programs for prepara- tion of teachers in all feasible ways in the area of measurement and evaluation but without a specific course requirement. Noll says that these are not necessarily antithetical or mutually exclusive but it seems likely that the requirement for a course in measurement of all prospective teachers would have the effect of reducing the emphasis on this tOpic in other courses where it is now usually included. A third possibility is the requirement of demonstrated proficiency in the area of measurement and evaluation on an examination, probably of a compre- hensive objective nature. In addition, Goslin (1967) states that school systems and testing specialists might be encouraged to initiate formal and informal train- ing programs in measurement and evaluation for teachers. The importance of in-service training programs in measurement has been recognized over the past several years. These programs, variously referred to as conferences, seminars, institutes, or workshops, have ranged from an afternoon lecture to a three-day preschool program with several follow-up meetings later in the year. Some of these programs were sponsored by a single school system and involved the teachers of that school in all subject areas and at all levels. Others reflected the interest of a single professional group, such as engineers or nurses (Ebel, 1961b). A statewide research study in Tennessee about in—service education was conducted by Brimm and Tollett (1974). The purposes of this study were to identify the types of in-service education programs currently in use throughout the state and to ascertain teacher attitudes toward 27 in-service education programs. The results of the study can be sum- marized as the following: 1) The primary purpose of in-service programs is to upgrade the teacher's classroom performance. 2) The teacher should have the opportunity to select the kind of in-service activities which he feels will strengthen his pro- fessional competence. 3) In-service programs must include activities which allow for the different interests which exist among individual teach- ers. If teachers' professional growth is to be taken seriously, public school administrators and teachers must pool their knowledge and resources and seek to make in—service education more responsive to the needs and interests of practicing class- room teachers. There seem to be some problems in in-service training in measure- ment and evaluation. Durost (1959) summarized the problems in in- service training in measurement as the following: 1) There are not enough leaders being trained in this area. 2) There is confusion and competition between professional groups training workers in the field of guidance, school psychology, and measurement per se as to who should be the person in the community with top responsibility for measurement. 3) Centralized training at the university level, no matter how good, will never diminish to the vanishing point the need for local in-service training at the community level because of unique local problems. 4) Teachers in general are afraid of measurement courses or even workshops in measurement because they are afraid of arithmetic, mathematics, statistics, etc. 5) Some teachers feel that the testing program can not genuinely help them to improve instruction. Ebel (1961b) sees the two main weaknesses of the in-service training programs in measurement. Those are: 1) They are too brief. While an hour or two a year spent in considering measurement problems under the guidance of a specialist is far better than nothing at all, it is unreasonable 2) 28 to suppose that satisfying enduring progress in solving the manifold problems of educational measurement, or in developing the requisite knowledge, understanding, and skills, can be made in so short a time. They involve too much talking and too little doing. For the cultivation of a practical art like educational measurement, sound pedagogy requires a mingling of theory and practice. The competence of the teacher in measurement can hardly be im— proved if the in-service program is not effective (Miller, 1977). Durost (1959) suggests specific steps to improve the in-service training pro- gram at the local level: 1) 2) 3) 4) 5) 6) 7) 8) 9) Preparation of local bulletins. A series of bulletins, sup- plementing the published materials concerning the tests in use in the county have been written tying in the testing program with the local program of instruction. Use of school test coordinators. At the elementary level this may be the school principal or it may be a teacher with an interest in measurement who has been designated for this re- sponsibility. These test coordinators meet regularly, es- pecially before and after a scheduled testing program, to dis- cuss problems involved in administering, scoring and interpret- ing the test results. Extension courses in the area of measurement. Faculty workshops. A considerable number of faculty workshops, varying in length from one to four or five sessions, have been held. These workshops concern themselves with the aspects of the total measurement problem which are important to the fa- culty at that moment. DevelOpment of community interest in the measurement program, through a judicious use of local newspaper publicity, talk to parent-teacher associations, etc. Use of local norms. Provision of adequate physical facilities. Use of demonstrations, lectures, etc., in the In-Service Training Center. TV workshop on testing. 29 Ebel (1961b) proposes the ideal program of in-service training in measurement as the following: Suppose that a school administrator and his staff have decided to focus attention for a year on the improvement of classroom test- ing. Suppose they engage a specialist in educational testing to meet with them five times during the year, at intervals of six weeks or so, for a day or two. Participation in the initial program might well be limited to five, six, or seven groups of four to six teach- ers each. The goal of each group would be to make, to use, and to analyze a quality test in a subject which all members of the particular group were teaching. Examples of subject areas in which these tests might be developed are: fourth-grade mathematics; sixth-grade geography; eighth-grade English; or high-school history, chemistry, or economics. The first meeting of each participating group would be devoted to a description of the entire project, with special consideration of the first step - the preparation of specifications for the test to be developed. Sample specifications would be presented for study and analysis. Between the first and second sessions each teacher group would work out the specifications for its test. These could be reviewed at the second meeting, and work on item writing would be launched. The third meeting could be devoted to item review and test assembly, the fourth to test administration and analysis, and the fifth to a review of the test developed and of the entire pro- ject as a learning experience. Ebel believes that a program like this would produce not only a handful of excellent tests but also a sizable group of teachers whose competence in measurement was vastly improved and, by current standard, highly respectable. Zigarmi, Betz, and Jensen (1977) studied teachers' preferences in and perceptions of in-service education. They found that the in-service training programs will be useful if: 1) They are planned in response to the assessed needs of teach- ers and build on the interests and strengths of the teachers for whom they are designed. 3O 2) They start with the assumption that teachers can be resources to each other and, therefore, these programs provide oppor- tunities for teachers to share ideas and resources with each other. Summary Classroom teachers are constantly searching for ways to improve their service to children. In line with this objective, they are an- xious to find new methods of measurement and evaluation and to enhance their skills in using these techniques (Ebel, l96la; Stanley, 1964; Goslin, 1967; Lindvall, 1967; Erickson and Wéntling, 1976). The measurement of pupil achievement requires the extensive use of tests constructed by classroom teachers. This is so because many of the instructional outcomes can be measured by paper-and-pencil tests and because standardized tests are seldom well adapted to the particular objectives emphasized in teaching (Mehrens and Lehmann, 1978; Noll, Scannell, and Craig, 1979). In addition, teacher-made tests can be used for such a variety of instructional purposes (Schwartz and Tiedeman, 1962; Feldhusen, 1964; Marso, 1970; Stanley and Hopkins, 1972; Ebel, 1979; Mehrens, 1979). For example, the teacher may want to measure achievement at the end of a unit of work, diagnose a learning diffi— culty which has come to his attention, or check on how well the pupils have mastered a specific skill. Constructing a good test is one of teachers' most difficult duties, and they, too, sometimes dislike to assume the role of examiners. The qualifications of teachers as the evaluators are widely reported as 31 less than satisfactory for most teachers. There is evidence that today's teachers are not being adequately trained for performance of their measurement and evaluation responsibilities (Conant, 1963; Mayo, 1967; Goslin, 1967; Fleming, 1971; Roeder, 1972; Pravalpruk, 1974; ONEC, 1977, Yeh, 1978; Olejnik, 1979; Sack, 1979). The results of these studies indicate that more publication and in-service programs need to be operated to improve the teacher's competencies. There are a great variety of ways to improve the classroom teach- er's competencies in measurement and evaluation such as: 1) an increase in attention to educational measurement in teacher-training programs, 2) employ a staff member with special competence in testing in the school system, 3) include a course in measurement as part of the require- ment for a teacher's license or certificate, or 4) initiate formal and informal in-service training programs in measurement for teachers (Noll, 1961; Ebel, 1961b; Goslin, 1967). The importance of in—service training programs in measurement has been recognized over several years. The primary purpose of these pro- grams is to upgrade the teacher's classroom performance. To be effec- tive, these programs must include activities which allow for the dif— ferent interests which exist among individual teachers (Durost, 1959; Brimm and Tollett, 1974; Miller, 1977; Zigarmi, Betz, and Jensen, 1977). CHAPTER III PROCEDURES This study can be classified as a comparative study. It was aimed at providing the data and information about the quality of the teacher as an evaluator to the administrators and educators in teacher training institutions in Thailand. Data were collected by question- naires, sent through the mail or delivered personally. This chapter provides a description of the population, sampling procedure, instru- ment, data collection, and plan for data analysis. The findings for the study are presented in Chapter IV and con- clusions are given in Chapter V. Population Geographically, Thailand is in South-East Asia. The area of the country is 514,000 square-kilometers. The pOpulation of Thailand is about 46 million, of which 21% are students. Bangkok, the capital of Thailand, has a population of four mil- lion, 25% of whom are students. There are 422 public elementary schools and 102 public secondary schools in Bangkok. Of the 422 ele- mentary schools, 33 schools are under the Ministry of Education. The rest of them are under the Ministry of Interior. Ninty-two of 102 32 33 secondary schools are under the Ministry of Education. The other ten schools are demonstrative schools which are offered by universities in Bangkok. The population of interest is teachers who have been teaching in public elementary and secondary schools, offered by the Ministry of Education in Bangkok. There are 1,599 teachers and 30,828 students in the 33 elementary schools, and 9,840 teachers and 203,476 students in the 92 secondary schools. The average ratio of teacher to students is 1:19 in elementary schools, and 1:21 for secondary schools. The 33 elementary schools and 92 secondary schools are located in five different regions. The number of schools in each region is pre- sented in Table 3.1. The breakdown of 11,439 teachers into six groups, according to level of school and level of teacher education, is pre- sented in Table 3.2. The data from Table 3.2 show that 2% of 11,439 TABLE 3.1 Total Number of Schools in Bangkok Classified by Location, and Level of School Location of School Level of School Elementary School Secondary School Region 1 9 23 Region 2 4 18 Region 3 7 15 Region 4 4 19 Region 5 9 17 Total 33 92 34 “nooav Anqa.mmV Anne.mmv ANHe.~v ans.HH mom.s oom.o SAN sauce humo.owv Anmo.m~v Auna.emv ANm~.~v ocm.¢ mmm.m cmH.c mmm Hoosom mumvsooom humm.MHv ANHm.wv Auw~.nv Auwa.0v mam.a sum «as am Hooaom stauauamam anuoa oumofimauuou common uoamnumm summon nouns: Hoonum mo Ho>ma wsfinomme coaumosvm umnumma mo Hm>oa aofiumosvm Hosanna mo Ho>mq was Hoosom mo Ho>oA he vowwfimmmao Mostmm aw maoonum hunvaooom mm was nHoonom humusoasam mm ma masseuse mo umpasz annoy N.m mamua wononoa mo Ho>oA ousmwumaxm magnumoa mo mums» ouamwuoexm wswnonoa mo mumow was .aowumosvm assumes mo Ho>mA .Hoosom mo Ho>mA an mmamammmao Moxwamm cw mHoonom humvaoomm on man maoosum unnucoaoam on c« muonumoa mo Honasz Hmuoa q.m mam<9 38 Instrument The pilot test contained 80 true-false items concerning basic knowledge in measurement and evaluation. These items were selected from two tests, 107 and 121 items each, by permission of Dr. Robert L. Ebel. These tests had been used for the final examination at least three times in a basic measurement course at Michigan State University (ED 465: Testing and Grading). Kuder Richardson Reliability # 20 of these two tests has varied between .88 to .91, mean item difficulty (percent of incorrect responses) has varied between .11 to .21, mean item discrimination (upper-lower difference) has varied between .18 to .27. The item selection was based on subject matter, item difficulty index, and item discrimination index. The items in the pilot test covered basic knowledge in measurement and evaluation corresponding to four subject matters that teachers should know. The four areas are: l. planning a classroom test, 2. item writing, 3. item analysis, and 4. test score statistics and marking system. The pilot test was composed of 20 items in each subject matter area, with a total of 80 items. It contained 31 true statements and 49 false statements (see Appendix A). The test was then translated into Thai. The appropriateness of the test items and translation were checked by other Thai educators whose major area is measurement and evaluation. 39 The instrument was piloted out on forty Thai elementary school teachers and forty Thai secondary school teachers, for a total of eighty teachers. These teachers were omitted when selecting the sample. The test was distributed to this group in the first week of November, 1979, and all returns were collected in the first week of December, 1979. All responses were transferred to answer sheets and they were sent to the Scoring Office at Michigan State University for computing reli- ability and item analysis. Most of the test items were difficult. The item difficulties (percent of incorrect responses) ranged from .06 to .89. Of the 80 items, 37 items (46%) had an index of difficulty above .60, only 11 items (14%) had the index of difficulty below .20. The discrimination indices ranged from -.19 to .47. The Kuder Richardson Reliability # 20 was .40, mean item difficulty was .53, and mean item discrimination was .15. The items for each subscale were selected separately for the final test. Within each of the four subject matters, the item dis- crimination indices and item difficulties were considered in deleting items. The final test contained fifty-two items, thirteen items for each subscale. Among twenty items in Planning a Classroom Test, seven items (items 2, 7, 9, 12, l7, l8, and 19) had negative discrimination indices (ranged from -.19 to .00). Five of these seven items had high diffi- culty indices (ranged from .51 to .81), items 2 and 9 had difficulties of .45 and .06 respectively. Therefore, all seven items were deleted. Ahmong the items in Item Writing, items 22, 26, 28, 33, 34, 38, and 40 imare deleted because of low discrimination indices (ranged from -.05 to 4O .09) and high levels of difficulty. The item difficulties of the first six deleted items ranged from .56 to .89. Item 40 had the lowest dif- ficulty index among these seven items, its difficulty index was .43. In the Item Analysis group, the deleted items were numbers 42, 43, 46, 49, 55, 56, and 58 with the range of discrimination indices between -.05 to .09, and the range of the levels of difficulty between .06 to .90. Only two of them had difficulty indices below .20. In the last subject matter area, Test Score Statistics and Marking System, items 64, 67, 68, 72, 73, 74, and 75 were deleted. The difficulty indices of these seven items ranged from .21 to .83. Four of them had an index of difficulty above .71, only two items had an index of dif- ficulty below .21. The discrimination indices of these seven items ranged from -.09 to .24. The final instrument (see Appendix B) was composed of two parts: 1. A questionnaire concerning the general information of the teachers (3 items), Opinion on the national testing program (1 item), and teachers' experiences in measurement and evaluation (4 items). This part contained eight items. 2. The second part was composed of 52 true-false items of which 20 statements were true and 32 statements were false. These test items measured the basic knowledge in measurement and evaluation. The mean item difficulty on the pilot test for the Planning a Classroom Test subscale was .50, for the Item Writing subscale it was .49, for the Item .Analysis subscale it was .49, and it was .55 for the Test Score Statis— tics and Marking System subscale. 41 Data Collection The instrument was sent to teachers in the sample in the first week of February, 1980, both by mail and personal delivery, school by school. To insure getting back a large majority of responses, the follow-up was done weekly either by letter or personal contact. Because of the personal contacts and some help from the school principal, sixty- nine percent of the responses (374 responses) were returned by the last - week of March, 1980. It was noticed that the less personal relationship between the agent collecting data and the teachers in the sample, the less likely that the responses would be returned. Since the number of the returned responses in each group ranged between 30 to 33 (see Table 3.5), fourteen responses (3.7% of responses) were randomly thrown out in order to get thirty subjects in each group for the total of 360 subjects. Because of the homogeneity of population and only 3.7% of responses were randomly thrown out, it is believed that any distortion of data is small. Design The design of this study is a 2 x 2 x 3 factorial design with four repeated measures. The four areas of basic knowledge in measure- ment and evaluation were: planning a classroom test; item writing; item analysis; and test score statistics and marking system. Table 3.6 presents the layout of the three indpendent variables: level of school; level of teacher education; and teaching experience. The design is (Irossed and balanced with thirty observations per cell. 42 chm mNH «NH mNH Hmuoe No on on an acumen Hoonum unmeaooom mm mm mm Hm mumowmauumo mm mm an «m acumen Hoonom humucoaoam am on am on oumowmwuuoo Hmuoa .mu» CH A .mu» 0H t a .mu» m I o soaumusvm Hoosom mo Hm>mA Hosanna mo Ho>mq mosowumexm wcwnomoh mo mums» ma vmamumwmflo noncoamom mucusuom mo umnanz Hmuoa mocmfiuoaxm wafizumoa mo mums» mam .coaumusmm Hosanna .Hoonom mo Ho>oq m.m mgm OH I a common onus .mu% m I o Hoosum eumvaoomm on»: .muh CA A onus .muh CA I e muwuawfiuuoo onus .mum m I o onus .mum 0H A onus .mu% CH I e summon emu: .mu% m I o Hooaum humusmamam on»: .mu% OH A onus .mum OH I a mumowwwuumo emu: .mum m I o m: N: H: ousowumnxm wsazonme coaumoaum Hoonom mo Hu>oA Hosanna mo Hm>sa usuaousmmmz musmfiuoexm magnumoa was .cowumosvm Hosanna mo Ho>oA .Hoonom mo Ho>o4 "moaamwum> unowsoaovsH manna man no smamon o.m mqm 10 yrs. 5.60 6.40 6.37 6.27 26.63 Element- n = 30 1.54 1.99 1.59 1.64 3.94 ary O - 3 yrs. 6.00 6.67 7.10 7.40 27.17 n = 30 1.74 1.95 1.58 1.73 3.91 School Degree 4 - 10 yrs. 6.27 6.47 6.80 6.40 25.93 n - 30 1.60 1.74 1.47 2.06 4.35 > 10 yrs. 6.00 6.80 6.67 7.57 27.03 n = 30 1.84 1.75 1.24 1.36 3.68 O - 3 yrs. 5.83 6.37 6.70 6.40 25.30 n = 30 1.44 1.77 2.02 1.61 4.41 Certifi- 4 - 10 yrs. 5.80 6.27 6.67 6.37 25.10 cate n = 30 1.88 1.64 1.63 1.73 3.82 > 10 yrs. 6.07 6.43 6.87 6.13 25.50 Second- n = 30 1.57 1.59 1.63 1.83 3.86 ary 0 - 3 yrs. 6.60 7.30 7.50 7.13 28.53 n - 30 1.57 1.70 1.89 1.53 4.25 School Degree 4 - 10 yrs. 6.97 7.37 6.83 6.90 28.07 n = 30 1.83 1.77 1.15 1.65 3.27 > 10 yrs. 7.30 6.80 6.80 7.37 28.27 n = 30 1.82 1.73 1.63 1.79 4.73 6.21 6.66 6.87 6.73 26.48 Entire sample 1.76 1.79 1.59 1.73 4.12 PL = Planning a Classroom Test WR = Item Writing AN 3 Item Analysis ST 8 Test Score Statistics and Marking System Multivariate Repeated Measures Analysis 51 TABLE 4.2 on Basic Measurement Scores Sources of Variation DF F P Less Than School (8) 1,348 2.3227 .1285 Education (E) 1,348 23.9135 .0001* Teaching Experience (T) 2,348 .1960 .8222 SE 1,348 5.0161 .0258* ST 2,348 .2536 .7762 ET 2,348 1.0875 .3383 SET 2,348 1.2252 .2950 Subject Matter (M) 3,346 11. 8591 .0001* MS 3,346 1.5126 .2110 ME 3,346 2.8938 .0354* MT 6,692 .9080 .4885 MSE 3,346 1.0388 .3755 MST 6,692 1.0782 .3740 MET 6,692 1.6543 .1297 MSET 6,692 .4990 .8094 *The test is significant at a = .05 level. va: 8'3 Pi me K I. 52 - the subject matter main effect (F = 11.8591, p < .0001), - and, the subject matter by teacher education interaction (F = 2.8938, p < .0354). Since the interaction between subject matter and teacher education was significant, the general profile was not interpreted. One profile was made for all certificate teachers and another for degree teachers. Figure 1 presents these two profiles. Means of each subscale of measure- ment for certificate teachers varied from 5.89 (for Planning a Class- room Test) to 6.78 (for Item Analysis). For the degree teachers, the means varied from 6.52 (for Planning a Classroom Test) to 7.13 (for Test Score Statistics and Marking System). The differences between certificate teachers and degree teachers on Planning a Classroom Test, Item Writing, Item Analysis, and Test Score Statistics and Marking Sys— tem were -.63, -.47, -.l7, and -.79 respectively. The interaction was ordinal with respect to teacher education. The largest difference be- tween certificate teachers and degree teachers was found on the mean of Test Score Statistics and Marking System subscale, the smallest dif- ference of these two groups was found on the Item Analysis subscale. A multivariate analysis of variance was performed to test the significant differences between the mean of each subscale of certificate teachers and degree teachers. The multivariate F ratio was significant at a = .05 (F 8 7.6698, p < .00001), and the univariate F ratios indi- cated that there were significant differences in scores of the two groups of teachers on Planning a Classroom Test subscale, Item Writing subscale, and Test Score Statistics and Marking System subscale. A XI 7.5 7.0 6.5 6.0 5.5 53 Figure 1. Graph Presentation of Cell Means for Subject Matter and Teacher Education L . ........ DEGREE I- b ....... p CERTIFICATE r ‘- PL WR AN ST PL - Planning a Classroom Test WR = Item Writing AN = Item Analysis ST = Test Score Statistics and Marking System PL WR AN ST Certificate 5.89 6.43 6.78 6.34 Degree 6.52 6.90 6.95 7.13 54 significant difference between the two groups, however, was not found on the Item Analysis subscale (see Table 4.3). TABLE 4.3 Univariate Analysis of Variance on Subscale Scores of Certificate Teachers and Degree Teachers Variables DF F Signif. of F Plan 1,358 11.8022 .0007 Write 1,358 6.3681 .0121 Analy. 1,358 .9845 .3218 Stat. 1,358 19.3695 .0000 Plan - Planning a Classroom Test Write 8 Item Writing Analy. a Item Analysis Stat. - Test Score Statistics and Marking System The significant interaction between level of school and teacher education was also observed (Figure 2). The mean of certificate teach- ers in elementary school was slightly higher than the mean of certifi- cate teachers in secondary school (25.60 and 25.30 respectively), but in contrast, the mean of degree elementary school teachers was lower than the mean of degree secondary school teachers (26.71 and 28.29 respectively). The mean of degree teachers in elementary school was higher than the mean of certificate teachers but it was not large enough to be significant at a 3 .05 level. In secondary school, the difference between the mean of degree teachers and the mean of certificate teach- ers was significant at u I .05 level. XI 29 28 27 26 25 55 Figure 2. Graph Presentation of Cell Means for Level of School and Teacher Education SECONDARY SCHOOL ELEMENTARY SCHOOL CERTIFICATE DEGREE Certificate Degree Elementar 25.60 26.71 School Secondary 25.30 28.29 School 56 Since the interaction between level of school and teacher educa- tion was significant, the follow-up analyses were performed using multi— variate analysis of variance testing the significant differences on sub- scale scores between degree teachers and certificate teachers in ele- mentary school and in secondary school separately. Cell means for sub- ject matter and level of education of elementary school teachers are presented in Figure 3, and in Figure 4 for secondary school teachers. The multivariate F ratio of secondary school teachers was significant at G = .05 (F = 7.4127, p < .00002). The univariate F ratio indicated that there were significant differences among certificate teachers and degree teachers in secondary school on only three subscales; Planning a Classroom Test, Item Writing, and Test Score Statistics and Marking System (see Table 4.4). Means of degree teachers in elementary school were higher than means of certificate teachers for every subscale, but the amount of differences were not large enough to be significant at .05 level. TABLE 4.4 Univariate Analysis of Variance on Subscale Scores of Certificate and Degree Secondary School Teachers Variables DF F Signif. of F Plan 1,178 17.5710 .0000 Write 1,178 10.0430 .0018 Analy. 1,178 1.4391 .2319 Stat. 1,178 11.0521 .0011 XI 57 Figure 3. Graph Presentation of Cell Means for Subject Matter and Level of Education of Elementary School Teachers 7.5 l- ,v‘ DEGREE 7]) I- ....... 6.5 - CERTIFICATE 6.0 P 5.5 - uh C), PL WR ST PL = Planning a Classroom Test WR 8 Item Writing AN 8 Item Analysis ST = Test Score Statistics and Marking System PL WR AN ST Certificate 5.89 6.50 6.82 6.39 Degree 6.09 6.64 6.86 7.12 XI 58 Figure 4. Graph Presentation of Cell Means for Subject Matter and Level of Education of Secondary School Teachers 7.5- 6.5 - 6.0 L- 5-5 F 0T 0" DEGREE CERTIFICATE E Certificate Degree PL WR Planning 3 Classroom Test Item'Writing Item Analysis Test Score Statistics and Marking System .5 '1' PL WR AN ST 5.90 6.56 6.74 6.30 6.96 7.16 7.04 7.13 59 Additional Analyses Besides testing the hypotheses of interest, further analyses were done to compare measurement needs (indicated by lower scores from the test) among various groups of teachers as defined by the following inde- pendent variables: taking a college measurement course, attending the training program in measurement, favoring a national testing program, and region of school. The analyses were also done to observe the relation— ship between perceived needs (indicated by the feeling of lacking desir- able knowledge in measurement) and measurement needs, and to compare mean differences between group means and criterion scores (ideal mean). Group means, F ratios testing the difference between groups, and the probabilities of the F ratios are presented in Tables 4.5 to 4.10. Multivariate analysis of variance was used to test the mean dif- ferences of all four subscale scores between teachers who took a col- lege measurement course and those who did not take a course. The multi- variate Fratios of the entire sample, of the degree teachers, and of the secondary school teachers were significant at s - .05 level (F = 5.3747, p < .0003; F - 3.0278, p <.019l; F = 4.4114, p < .0020 respec- tively). The significant differences on subscale scores were not found in certificate teachers groups or in elementary school teachers groups. Tables 4.5, 4.6, and 4.7 present the univariate analysis of variance on subscale score and total score of teachers who had and those who had not taken a college measurement course, the analyses were done individually for the entire sample, for degree teachers group, and for secondary school teachers group respectively. It was observed that the teachers who had taken a college measurement course received higher 60 TABLE 4.5 Analysis of Variance on Subscale Scores and Total Score of Teachers who Took a College Measurement Course and Those Who Did Not Take a Course College Measurement Took Did Not Take Univariate n§269 n:91 Variables (X) (X) F P Less Than Plan 6.29 5.98 2.0930 .1488 Write 6.83 6.19 8.8551 .0031 Analy. 6.96 6.58 3.9065 .0489 Stat. 6.93 6.18 13.1894 .0003 Total 27.00 24.92 18.1281 .0000 Plan = Planning a Classroom Test Write - Item.Writing Analy. = Item Analysis Stat. = Test Score Statistics and Marking System total scores than those who had not taken a course in all three groups of teachers, the mean differences were 2.08, 3.75, and 2.75 respectively, the significant differences were found at s - .05 level. On the four subscales, the teachers who had taken a college measurement course, in all three groups mentioned above, received higher scores on Item Writ- ing and Item Analysis subscales than those who had not taken a course, the significant differences were found at a = .05 level. No signifi- cant difference was found on Planning a Classroom Test subscale in all three groups of teachers. On Test Score Statistics and Marking System 61 subscale, the significant difference was found in the entire sample and in the secondary school teachers group at ¢ .05 level in favor of those who had taken a college measurement course, but it was not found in the degree teachers group. Multivariate analysis of variance was also used to test the mean differences of all four subscales between the teachers who attended the training program in measurement and those who did not attend the program. No significant differences were found in any of the subscales. The total mean of the teachers who attended the training program was 26.4 and it was 26.5 for those who did not attend the program. TABLE 4.6 Analysis of Variance on Subscale Scores and Total Score of Degree Teachers Who Took a College Measurement Course and Those Who Did Not Take a Course College Measurement Took Did Not Take Univariate nsl68 n312 Variables (X) (X) F P Less Than Plan 6.55 6.17 .5118 .4753 Write 7.00 5.50 8.2608 .0045 Analy. 7.01 6.08 4.2666 .0403 Stat. 7.19 6.25 3.3900 .0673 Total 27.75 24.00 9.8179 .0020 62 TABLE 4.7 Analysis of Variance on Subscale Scores and Total Score of Secondary School Teachers Who Took a College Measurement Course and Those Who Did Not Take a Course College Measurement Took Did Not Take Univariate n§l37 ni43 Variables (X) (X) F P Less Than Plan 6.53 6.12 1.7659 .1856 Write 6.98 6.05 9.8969 .0019 Analy. 7.03 6.47 3.7478 .0545 Stat. 6.92 6.07 8.2370 .0046 Total 27.45 24.70 14.4437 .0002 The comparison between teachers who favored a national testing policy and those who did not favor the policy was performed by using multivariate analysis of variance. The teachers who favored a national testing policy received higher scores than those who did not favor the policy in every subscale (mean differences were .22, .22, .39, and .40 respectively) but the differences were not large enough to be signifi— cant at a - .05 (p < .0642). A similar comparison was made between five groups of teachers, classified by the location of the school in which they were teaching. Slight differences between the means of those five groups of teachers were observed (the mean for Region 1 was 26.36, 63 for Region 2 it was 26.43, for Region 3 it was 25.86, for Region 4 it was 26.68, and for Region 5 it was 27.32) but the differences were not large enough to be significant at the .05 level (p < .2473). Table 4.8 presents cell means for all four subscales of certifi- cate and degree teachers who were classified into four groups according to the area of measurement they thought they knew most. Multivariate analysis of variance was used to determine if there was any significant difference between the means within four subgroups of certificate teach- ers and within four subgroups of degree teachers in each subscale of measurement. It was found that there were no significant differences either in certificate teacher groups or in degree teacher groups in any of the subscales. For example, it was observed that the teachers who thought they knew the most in Planning a Classroom Test did not get the highest score in this subscale when compared with the other three sub- scales. The same result occurred in the other three groups of teachers. A similar comparison was made between four groups of teachers who were classified by the area of measurement they thought they knew least. Cell means and number of teachers in each group are presented in Table 4.9. No significant differences were found in any of the subscales. The data from Tables 4.8 and 4.9 indicated that there was no relation- ship between perceived needs (indicated by the feeling of lacking desir- able knowledge in measurement) and measurement needs (indicated by lower test score). The comparison between means of the entire sample and criterion scores (ideal means) were done by using the Z-test. The analyses were done individually on both total mean and subscale means. Means of the 64 entire sample, criterion scores, Z ratios testing the differences be- tween group means and criterion scores, and the probabilities of the Z ratio are presented in Table 4.10. It was found that the total mean and the mean of each subscale were lower than the criterion scores, and the mean differences were significant at a = .05 level. TABLE 4.8 Presentation of Cell Means for All Four Subject Matter Areas of Certificate and Degree Teachers Who Were Classified into Four Different Groups According to Area of Measurement They Thought They Knew Most Test Score (X) Specialized Area in Measurement Plan Write Analy. Stat. N Plan 5.72 6.38 6.38 6.54 39 Write 5.86 6.37 6.89 6.46 87 Cert. Teachers Analy. 6.22 6.86 6.89 5.94 36 Stat. 5.78 5.94 6.89 6.17 18 Plan 6.41 6.94 7.00 7.41 34 Write 6.05 6.59 6.73 6.92 75 Degree Teachers Analy. 7.03 7.15 7.06 7.06 33 Stat. 7.11 7.26 7.24 7.34 38 65 TABLE 4.9 Presentation of Cell Means for All Four Subject Matter Areas of Certificate and Degree Teachers Who Were Classified into Four Different Groups According to Area of Measurement They Thought They Knew Least Weak Area in Test Score (X) Measurement Plan Write Analy. Stat. N Plan 6.24 6.39 6.71 6.32 38 Write 6.75 6.00 7.00 6.63 8 Cert. Teachers Analy. 5.62 6.48 6.83 6.44 71 Stat. 5.89 6.44 6.75 6.22 63 Plan 6.91 6.91 7.06 7.39 33 Write 7.07 6.93 7.07 6.96 27 Degree Teachers Analy. 6.26 6.84 6.99 6.97 74 Stat. 6.35 6.98 6.74 7.28 46 66 TABLE 4.10 Comparison Between Sample Means and Criterion Scores __ Criterion P Less Variables X Score* 2 Than Plan 6.21 9.75 -38.03 .0000 Write 6.66 9.75 -32.75 .0000 Analy. 6.87 9.75 -34.32 .0000 Stat. 6.74 9.75 -33.04 .0000 Total 26.48 39.00 -57.72 .0000 *Criterion score (ideal mean) is defined as a point midway between the maximum possible score and the expected chance score (for example, Criterion Score of 52 true-false test items = 1/2(52+52/2) - 39). ‘ Summary A descriptive discussion on information from the questionnaire items was presented first. Then the multivariate repeated measures analysis was employed to test the fifteen null hypotheses. The hypoth- eses testing results were as follows: 1. There was no difference in measurement needs between element- ary school teachers and secondary school teachers in Bangkok. 2. Certificate teachers had more measurement needs (indicated by lower score from the test) than degree teachers. 3. There was no difference in measurement needs between teach- ers who had more teaching experience and those who had less teaching experience. 10. 11. 12. 67 There were some differences in measurement needs among the four subject matter areas, Planning a Classroom Test seemed to be the most needed (lowest mean). There was an interaction between level of school and level of teacher education. There was no difference between the mean of certificate secondary school teachers and the mean of certificate elementary school teachers but degree element— ary school teachers had more measurement needs than degree secondary school teachers. There was no interaction between level of school and teach- ing experience. There was no interaction between level of teacher education and teaching experience. There was no three—way interaction among level of school, level of teacher education, and teaching experience. There was no interaction between subject matter and level of school. There was an ordinal interaction between subject matter and level of teacher education. The certificate teachers had more measurement needs in all subscales than degree teachers. The mean difference on Item Analysis subscale, however, was not a significant difference. There was no interaction between subject matter and teach- ing experience. There was no three-way interaction between subject matter, level of school, and level of teacher education. 68 13. There was no three-way interaction between subject matter, level of school, and teaching experience. 14. There was no three-way interaction between subject matter, level of teacher education, and teaching experience. 15. There was no four~way interaction between subject matter, level of school, level of teaching education, and teaching experience. Since the subject matter by teacher education interaction was significant, interpretations of profile were made separately for cer- tificate teachers and for degree teachers. Both groups of teachers got their lowest scores on Planning a Classroom Test subscale. The degree teachers got their highest scores on Test Score Statistics and Marking System subscale, but certificate teachers got their highest scores on the Item Analysis subscale. Degree teachers got higher scores than certificate teachers in all subscales. An interaction was found on the level of school by teacher education interaction. The total score mean indicated that degree teachers in secondary schools got higher scores than degree teachers in elementary schools, but the mean of certificate teachers in secondary schools was slightly lower than the mean of cer- tificate teachers in elementary schools. There was no significant dif— ference between certificate elementary school teachers and certificate secondary school teachers, but a significant difference between degree elementary school teachers and degree secondary school teachers was found at a = .05 level. Analyses were also done to compare measurement needs (indicated by lower scores from the test) among various groups of teachers (defined 69 by the following independent variables: taking a college measurement course, attending the training program in measurement, favoring a national testing program, and region of school), to observe the relation- ship between perceived needs and measurement needs, and to compare the mean differences between sample means and criterion scores. Among teachers who took and did not take a college measurement course, the former group had less measurement needs (indicated by higher scores from the test) than those who did not take a course. There were no significant differences on measurement needs among teachers who attended the training program and those who did not attend the program, among teachers who favored and did not favor the national testing pro- gram, or among teachers who taught in five different regions. A rela- tionship between perceived needs (indicated by the feeling of lacking desirable knowledge in measurement) and measurement needs was not found. There were significant differences between means of the sample and criterion scores both on total mean and subscale means. The results showed that the teachers in Bangkok had measurement needs in all four subject matter areas. CHAPTER V SUMMARY AND CONCLUSIONS Summary This study was aimed at providing data concerning the quality of the teacher as an evaluator to the administrators and educators in Thailand. It was the purpose of the study to find out which groups of teachers actually need in-service training and in which areas of measure- ment the need is the greatest. This study also yields some follow-up information on the effects of the previous in—service programs in measurement and the effects of a measurement course offered by the teacher-training institutions. The population of interest of this study was the public element- ary and secondary school teachers, who are under the Ministry of Educa- tion in Bangkok. The instrument used in the study was a questionnaire concerning the teachers' opinions on national testing and their per- ceived needs in measurement and a true-false test measuring basic know- ledge on educational measurement. The items were selected from the items used in a basic measurement course taught at Michigan State Uni- versity. The items in the pilot test covered basic knowledge in mea- surement and evaluation corresponding to the four subject matters that teachers should know. The four areas are: planning a classroom test, item writing, item analysis, and test score statistics and marking 70 71 system. The pilot test was composed of 20 items in each subject matter area, with a total of 80 items. The instrument was translated into Thai and it was piloted out on forty Thai elementary school teachers and forty Thai secondary school teachers for a total of eighty teachers. The reliability of the pilot test was .40, mean item difficulty (percent of individuals giving an incorrect answer) was .53, and mean item dis- crimination was .15. The items for each subscale were then selected separately for the final test. Within each of the four subject matters, the item discrimination indices and item difficulties were considered in deleting items. The final test contained fifty-two items, thirteen items for each subscale. The final instrument was composed of two parts: 1. A questionnaire concerning the teachers' Opinions on national testing and their perceived needs in measurement. This part contained eight items. 2. Test items measuring basic knowledge in measurement and evaluation. The second part was composed of 52 true-false items of which 20 statements were true and 32 statements were false. The mean item difficulty for Planning a Class- room Test subscale was .50 (varied between .19 to .76), for the Item Writing subscale it was .49 (varied between .16 to .81), for the Item Analysis subscale it was .49 (varied be- tween .09 to .83), and it was .55 (varied between .16 to .80) for the Test Score Statistics and Marking System sub- scale. 72 The final instrument was sent to 540 Thai teachers who were ran- domly selected from twelve strata. The stratification was based on the three variables: level of school - elementary school or secondary school; level of teacher education - teaching certificate holders or bachelor's degree holders; and teaching experience — less than or equal to three years, between four to ten years, or more than ten years. Because of the personal contacts and some help from the school principals, 69% of the responses (374 responses) were returned. Since the number of the returned responses in each group varied between 30 to 33, fourteen responses (3.7% of responses) were randomly thrown out in order to get thirty subjects in each group for the total of 360 subjects. Because of the homogeneity of the population and the fact that only 3.7% of the responses were randomly thrown out, it is believed that any distortion of data is small. The design of this study was a 2 x 2 x 3 factorial design with four repeated measures. The design was crossed and balanced with 30 observations per cell. The multivariate repeated measures analysis was employed to test the research hypotheses. Since the interaction between subject matter and teacher education was significant, profile interpretations were made separately for cer- tificate teachers and for degree teachers. Both groups of teachers received their lowest scores on the Planning a Classroom Test subscale. The degree teachers received their highest scores on the Test Score Statistics and Marking System subscale, but certificate teachers re- ceived their highest scores on the Item Analysis subscale. Degree teachers got higher scores than certificate teachers in every subscale. 73 An interaction between the level of school by teacher education was also significant. The mean of the total score indicated that degree teachers in secondary school received a higher mean score than degree teachers in elementary school, but the mean of certificate teachers in secondary school was slightly lower than the mean of certificate teach- ers in elementary school. There was no significant difference between certificate elementary school teachers and certificate secondary school teachers, but a significant difference between degree teachers in ele- mentary school and degree teachers in secondary school was found at a = .05 level. The F ratio for testing two hypotheses concerning the main ef- fects were significant. There were teacher education main effects and the subject matter main effects. The data from Table 4.1 showed that the certificate teachers had more measurement needs than the degree teachers. It also indicated that measurement needs on Planning a Class- room Test was the highest need, and Item Analysis was the lowest need. However, the general profile could not be made applicable to all groups of teachers because the two-way interaction was significant. Further comparisons were done to find if there were any differ— ences among various groups of teachers (defined by the following inde- pendent variables: taking a college measurement course, attending the training program in measurement, favoring a national testing program, and region of school). Comparisons were also done to observe the rela- tionship between perceived needs and measurement needs, and to compare the mean differences between sample means and criterion scores. 74 The F ratio from Table 4.5 indicated that the teachers who took a college measurement course had less total measurement needs (indicated by higher total scores from the test) than those who did not take a course. The same observations were true for the measurement needs in the Item Writing subscale, in the Item Analysis subscale, and in the Test Score Statistics and Marking System subscale, but not for the Planning a Classroom Test subscale. Although the teachers who took a college measurement course received a higher score on Planning a Class- room Test subscale than those who did not take a course, the difference was not large enough to be significant. It was found that there was no significant difference on measurement needs among teachers who attended the training program in measurement and those who did not attend the program. No significant difference was found among teachers who favored and did not favor the national testing program. It was also observed that there was not a significant difference between the teachers who taught in five different regions. The relationship between perceived needs (indicated by the feeling of lacking desirable knowledge in mea- surement) and measurement needs (indicated by lower test score) was not found. There were significant differences between the means of the entire sample and the criterion scores on both the total mean and subscale means. The analyses indicated that the teachers in Bangkok had measure- ment needs in all four subject matter areas. 75 Conclusions and Implications A cross-tabulation between took/did not take a college measurement course and attended/did not attend the training program in measurement and evaluation showed that eighty-five of 107 teachers who attended the training program took a college measurement course, 22 teachers did not take a course. It was observed that only 19% (69 teachers) of the total sample neither took a college measurement course nor attended the train- ing program in measurement and evaluation. There was a significant difference in measurement needs in favor of the teachers who took a college measurement course as compared to those who did not take a course, and the results of the study also indi- cated that the former group had less measurement needs than the latter group in Item Writing, Item Analysis, and Test Score Statistics and Marking System. It was found that there was no significant difference in test scores between those who took a measurement course and those who did not take a course on Planning a Classroom Test subscale, suggest- ing that this area of measurement might not have been included in the content of the college measurement courses. The results of the study seem to indicate that it would be appropriate to emphasize or include Planning a Classroom Test area in future college measurement courses. Because of the interaction effect between teacher education and subject matter, the measurement needs for each group of teachers were different. Therefore, the subject matter should be arranged according to the needs of a majority of teachers in each training session. The results of this study indicated that the teachers who hold a teaching certificate had more measurement needs than those who hold at least a 76 bachelor's degree in every subject matter area. The area with the highest measurement needs for both certificate teachers and degree teach- ers was Planning a Classroom Test. The teachers who hold a teaching certificate had the lowest measurement needs in the Item Analysis area, but those who hold at least a bachelor's degree had the lowest measure- ment needs in Test Score Statistics and Marking System area. These seem to indicate that the degree teachers had more mathematics background than the certificate teachers, and because of the nature of the subject matter of measurement, with some mathematics involved, the holders of teaching certificates may turn down the invitation to join the training program or to take the measurement course. It is strongly recommended that an introductory course in educational measurement should be a re- quirement in the curriculum of the two-year and four-year teacher training program. Although the elementary school teachers who hold a teaching cer- tificate received a lower mean score than those who hold at least a bachelor's degree in every subscale, the amount of the differences was not large enough to be significant. For secondary school teachers, however, significant differences between those who hold at least a bachelor's degree and those who hold a teaching certificate were found in every subscale, except on the Item Analysis subscale. These results may suggest that the future in-service training program in measurement should be arranged for the elementary school teachers separately from secondary school teachers, and within secondary school teachers, the training program should be arranged for the degree teachers separately from the certificate teachers. The study also suggests that the 77 elementary school teachers who hold at least a bachelor's degree had more measurement needs than the secondary school teachers who hold the same level of education. This may be the result of a lack of interest or because of the heavy teaching loads. Most of the Thai elementary school teachers taught all subjects and for thirty hours a week. They might have little time to study or pay attention to other professional activities. In testing the difference between teachers who attended and did not attend the in-service training program in measurement, no significant difference was found between these two groups. The result was supported by a previous study conducted by Sor-Wasna Pravalpruk (1974). She found that there was no significant difference between the teachers in Khon Kaen, Thailand who had attended and had not attended the in-service program in measurement. This result was probably caused by two factors. First, some of those who did not attend the training program had taken a college measurement course. Another factor was that the teachers had attended in-service programs of limited duration. In the past, most of the in-service programs in measurement and evaluation in Thailand were five-day workshops. The material covered purposes of measurement, curriculum analysis as a blueprint for test construction, types of test items, item analysis, scores and norms, and reporting the test results. The morning sessions were lectured by specialists in measurement, the afternoon sessions were practicums. It is recommended that the period of the future in-service training program should be longer than five days so that the teachers can have enough time to practice and learn the material. 78 The significant differences between the means of the entire sample and criterion scores indicated that the abilities of the Thai teachers in measurement and evaluation were lower than standard. The teachers in Bangkok had measurement needs in all four subject matter areas. Well-organized in-service programs should be offered to those teachers to increase the skills of development of teachers or to prepare teachers for new experiences in measurement and evaluation. A short course in measurement should be offered for the short-term effect. For the long— term effect, however, the teacher-training institutions should have full responsibility for improving the competence of the teachers in measure- ment. The teacher-training curriculum should be re-considered, and the contents of a measurement course should be revised. If continuing professional growth is to be taken seriously, administrators and teach- ers must pool their knowledge and resources and seek to make the in- service program and a college measurement course more responsive to the needs and interests of practicing classroom teachers. It would be ex— pected that these programs might be useful to help school personnel be- come more familiar with test construction and evaluation. Recommendations for Further Study The previous in-service training program seemed to yield little benefit to the teachers in Bangkok. Perhaps this was because the de- sign of the training program did not provide the functions necessary to meet the needs of the participants. Since a well-designed survey could serve as a learning experience for participants, a survey study should be done to provide information for planning any future training program 79 in measurement to achieve the expected outcomes. The questionnaire should be sent to the representatives of teachers and organizations in Bangkok to discover the current attitudes about the training in Bangkok, about needs of the participants, and to identify existing resources. The questions in the questionnaire might be divided into six categories as follows: 1. Attitudes toward popular participation in program design and implementation. How extensively should administrators be involved in the design of programs to upgrade their skills? Who would they select to design a program? 2. Previous experience with training programs and attitudes toward the training program. Which other training programs have they attended? Was the training worthwhile? What type of training do they think is most useful? 3. Content of the training program. Measurement needs might be discovered by the test items. 4. Format of the training program. How long should the train- ing program last and where should it be located? Should the training program be offered during a single session? 5. Resources. Who should conduct the training program? What skills should a trainer have? 6. Techniques and materials. What types of learning situations do they prefer (informal discussion, lecture, workshop, etc.)? What types of support materials would be most useful to them? Another study might be done to investigate the benefit from the training service. Any gain of knowledge after the training should be 80 studied to compare the gain made by the teachers who hold a teaching certificate with the gain made by the teachers who hold at least a bachelor's degree. Pretest and post-test procedures should be used to investigate if there is any measurement growth after the teachers have participated in the program. A follow-up study on the quality of the teacher-made tests should also be done. If there is no improvement in the quality of the teacher- made tests, it might be wasted effort to offer the in-service training program in measurement to the teachers. The quality of an introductory course in educational measurement should also be investigated. The study might be done by mailing a questionnaire to all teacher—training institutions in Thailand to exam- ine whether the contents of a measurement course correspond to the needs and interests of the classroom teachers. A national Test Bureau should be established, to be a center of testing services and to carry on a national testing program and other educational testing programs. Standardized (both achievement and apti- tude) tests should be developed for the purpose of guidance, selection, and diagnosis of student learning, and should be available to all teach- ers and school personnel. National norms and local norms for the stan- dardized tests should also be constructed to allow for further inter— pretation of test results. APPENDIX A THE INSTRUMENT (FIRST EDITION) APPENDIX A THE INSTRUMENT (FIRST EDITION) Directions: This test consists of 80 statements about basic knowledge in measurement and evaluation. You are to decide whether each statement is true or false. Please write the letter "T" in front of the true statements and "F" in front of the false statements. You do not need to identify yourself, but please answer each of the test items as accurately and as honestly as you can. There is no time limit in answering the test items. Part 1: Planning a Classroom Test 1. Useful measurements are necessarily objective. 2. A teacher's skill in constructing tests for a subject depends more upon his general skill in test construction than it does on the quality of his knowledge of that subject. 3. The aspects of achievement that multiple-choice tests can measure are more limited than is the case for short-answer tests. 4. The use of a variety of item types in an examination is likely to improve the validity of the examination. 5. The most valid classroom tests of achievement tend to be those that most students have time to finish. 6. Sampling errors tend to be less serious in essay tests than in objective tests. 7. The choice between essay or objective test forms should be made primarily on the basis of class size. 81 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 82 Since students show a wide range of individual differences, the ideal measurement situation would be achieved if each student could take a different test that was specially designed to test him. True-false test items are easier to write but less efficient than multiple-choice test items. To obtain objective measurement of achievement, it is necessary to use objective test items. If 240 items are available for measuring achievement in a course, a more reliable composite measure of achievement is likely to be obtained if these items are administered at different times as three separate 80—item tests than if they are all administered at the same time as a single test. The number of items to be included in a test should be determined primarily by the amount of material the test must cover. A one-hour objective test ordinarily provides a more extensive sample of a student's achievements than a one-hour essay test. Frequent testing is more beneficial in the lower grades than it is in high school or college. One should choose among essay, true-false, multiple-choice and other item forms depending on the particular mental ability that is to be tested. Good achievement tests include approximately equal numbers of very easy, easy, average, difficult, and very difficult items. Experts agree that cheating can be eliminated by the use of open- book. Either too little or too much testing can lead to unreliable measurements of achievement. Individual differences are more clearly apparent when all students take the same test than when each takes a test specially designed to test him. A test composed entirely of items of moderate difficulty (neither very easy nor very hard) can nevertheless discriminate well among the very best students and among the very poorest students. 83 Part II: Item Writing 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. To be appropriate for inclusion in an achievement test, an item should deal with an idea emphasized in instruction. The item writer should seek to prevent a student from getting the correct answer by a process of eliminating incorrect answers. If textbook wording is followed closely in phrasing multiple— choice test items, students may be able to respond correctly without understanding. The distractors in a multiple-choice item should be plausibly attractive but definitely incorrect. The response "None of the above" makes a good fourth or fifth response to almost any multiple-choice test item. In order to discriminate properly, a multiple-choice test item must provide at least four alternative responses (possible answers). Making some questions optional tends to improve the reliability of essay test scores. Almost any good true test item can be converted into an equally good false item simply by inserting the word "not" in it. Most of the sentences in a well written textbook could be used as the true statements in a true-false test. True statements that do not provide good answers to the stem ques- tion often make good distractors. If a question can not be given an absolutely correct answer, it should not be included in an achievement test. It is better for an item writer to review the items he has written after several days have passed, than ask someone else to review them. Good multiple-choice items can be written using only the correct answer and one incorrect alternative. If a response is stated more carefully, and at greater length than the other responses in a multiple-choice test item, the chances are that it is the correct response. Multiple-choice items which ask the student to pick one incorrect answer from among several correct answers tend to be highly dis- criminating. 36. 37. 38. 39. 40. 84 Multiple-choice items whose stems are stated negatively tend to be more discriminating than those whose stems are stated posi- tively. The item writer should aim to produce items that will be answered correctly by most students of high achievement, and missed by most students of low achievement. Multiple-choice test items that call for only a "best" answer, instead of a perfectly correct answer, tend to be less discrimi- nating and more ambiguous. The responses "All of the above," or "None of the above," are recommended for use in almost all multiple-choice test items. Multiple-choice items can be converted to equally effective true- false items in almost all cases. Part 111: Item Analysis 41. 42. 43. 44. 45. 46. 47. 48. 49. A "medium difficulty" true-false item is answered correctly less often than a "medium difficulty" multiple-choice item. If nine of ten students who score high on a test answer a par- ticular item correctly, while two of ten who score low on the test answer it correctly, the index of discrimination is .70. Item analysis is more useful to a teacher who re-uses items than to one who does not. If extreme groups of 33% instead of 27% are used for item analysis the groups will be more alike in average ability. A wide distribution of item difficulty values in a test is likely to lead to a wide distribution of pupil scores on the test. Most item analyses are based on external criterion measures of what the test is designed to measure. If 12 of 20 students answer a question correctly, all 12 of them should be expected to answer another, easier question correctly. Item analysis data can help the item writer identify and correct sources of weakness in a multiple-choice test item. It is reasonable to regard most objective test items whose indices of discrimination are above .30 as weak and in need of improve- ment. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 85 If six of ten students who score high on a test answer a particu— lar item correctly, while four of ten who score low on the test answer the same item correctly, the index of difficulty of the item is .50. The main reason for using upper and lower groups each including 27% rather than 50% of the total group tested is to reduce the labor of counting responses. If three-fourths of the examinees who take a test answer an item correctly, its index of discrimination is .75. If an item is extremely easy it is likely to be low in discrimina- tion. To determine the index of discrimination of a test item one must first determine its index of difficulty. Ordinarily test papers must be scored before the items can be analyzed. In general, the more difficult an item in a classroom test the higher its power of discrimination is likely to be. The primary goal of item selection, on the basis of indices of discrimination, is to increase test reliability. It is better to select the criterion groups used in item analysis at random than on the basis of total test score. If the scores on Test A are much more variable than the scores on Test B, the difficulty values for the items in Test A are also likely to be more variable than those for the items in Test B. Good classroom test items should have indices of discrimination of .50 or more. Part IV: Test Score Statistics and Marking Systems 61. 62. 63. In a frequency distribution of scores for which the mean is 78 and the median is 65, there must be more extremely high scores than extremely low scores. If two sets of scores have different variances, they must have different standard deviations. More than half of the scores in a typical distribution are located more than one standard deviation away from the mean. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 86 In a set of test scores there are three scores in the fifties: 51, 53, and 59. The percentile ranks of scores 51 and 53 will be more nearly the same than the percentile ranks of scores 53 and 59. If a student's raw score on Test A is larger than his raw score on Test B, his percentile rank on Test A should be larger also. When scores on a test are converted to stanines, some pupils are likely to get stanine scores of -3.5. It is possible to get a correlation coefficient of +1.20. For a group of nine year—olds, the correlation between age in years and I.Q.'s will be precisely zero. Differences from instructor to instructor in marking are inevitable and educationally desirable. It is better for a marking system to report absolute than relative achievement. Percentage marks were intended to report the proportion learned of that which might have been learned. Increasing the number of categories of marks tends to increase the reliability of the marks. If a set of eight scores includes two eights -- two sevens -- two fives and two fours, the median value is six. The distribution of the scores 5, 4, 3, 2, 1 is approximately normal. If in a distribution of 100 scores there are four scores of 28 and 30 scores lower than 28, the percentile rank of 28 is 32. When students are grouped by ability levels the policy of giving a higher proportion of A's in the more able group is justifiable. By using fewer, broader categories in marking a teacher can re— duce the proportion of incorrect marks he issues without seriously reducing the amount of useful information he reports. Stanine marks are likely to be more reliable than five-letter marks. No instructor is entitled to criticize the distribution of marks in another instructor's course. 87 80. The distribution of marks in all classes should be approximately the same regardless of differences in the general levels of ability of the students in the different classes. APPENDIX B THE INSTRUMENT (FINAL EDITION) APPENDIX B THE INSTRUMENT (FINAL EDITION) Directions: You do not need to identify yourself. Please answer each of the ques- tions on this page and the following pages as accurately and as honestly as you can. There is no time limit in answering this questionnaire. PART I: General Information. Please check the appropriate categories. 1. Level of school you teach: / / Elementary school / / Secondary school 2. Level of your education: / / Certificate or lower / / Bachelor degree or higher 3. Teaching experience: / / 0 - 3 years / / 4 - 10 years / / More than 10 years 4. Did you take any measurement course when you studied in college? L____/ Yes [_____/ No 88 89 5. Did you attend in-service training programs in measurement and evaluation? /__/ Yes / / No If the answer is "yes," go to question 5.1 If the answer is "no," go to question 5.2 5.1. Was the training program worthwhile? / / Yes / / No 5.2. If the Ministry of Education offers the training program in measurement and evaluation, will you participate in that program? /__/ Yes [___/ No 6. Are you in favor of national testing? / / Yes / / No 7. What area of measurement do you know most? (check only one) / / Planning a classroom test / / Item writing / / Item analysis / / Test score statistics and marking systems 8. What area of measurement do you know least? (check only one) £::::7 ‘Planning a classroom test / / Item writing / / Item analysis [::::7 Test score statistics and marking systems 90 PART 11: Basic Knowledge in Measurement and Evaluation The following are statements about basic knowledge in measurement and evaluation. You are to decide whether each statement is true or false. Please write the letter "T" in front of the true statements and "F" in front of the false statements. Please try to answer all of the 52 statements. 1. 10. 11. 12. It is necessary to use different test forms to test different abilities. The aspects of achievement that multiple-choice tests can measure are more limited than is the case for short-answer tests. The use of a variety of item types in an examination is likely to improve the validity of the examination. The most valid classroom tests of achievement tend to be those that most students have time to finish. Sampling errors tend to be less serious in essay tests than in objective tests. Since students show a wide range of individual differences, the ideal measurement situation would be achieved if each student could take a different test that was specially designed to test him. To obtain objective measurement of achievement, it is necessary to use objective test items. If 240 items are available for measuring achievement in a course, a more reliable composite measure of achievement is likely to be obtained if these items are administered at different times as three separate 80-item tests than if they are all administered at the same time as a single test. A one-hour objective test ordinarily provides a more extensive sample of a student's achievements than a one-hour essay test. Frequent testing is more beneficial in the lower grades than it is in high school or college. One should choose among essay, true-false, multiple-choice and other item forms depending on the particular mental ability that is to be tested. Good achievement tests include approximately equal numbers of very easy, easy, average, difficult, and very difficult items. 13. 14. 15. l6. l7. l8. 19. 20. 21. 22. 23. 24. 25. 26. 27. 91 A test composed entirely of items of moderate difficulty (neither very easy nor very hard) can nevertheless discriminate well among the very best students, and among the very poorest students. To be appropriate for inclusion in an achievement test an item should deal with an idea emphasized in instruction. If textbook wording is followed closely in phrasing multiple-choice test items, students may be able to respond correctly without understanding. The distractors in a multiple-choice item should be plausibly attractive but definitely incorrect. The response "None of the above" makes a good fourth or fifth response to almost any multiple-choice test item. Making some questions optional tends to improve the reliability of essay test scores. Most of the sentences in a well written textbook could be used as the true statements in a true-false test. True statements that do not provide good answers to the stem ques- tion often make good distractors. If a question can not be given an absolutely correct answer, it should not be included in an achievement test. It is better for an item writer to review the items he has writ- ten after several days have passed, than ask someone else to re- view them. Multiple-choice items which ask the student to pick one incorrect answer from among several correct answers tend to be highly dis- criminating. Multiple-choice items whose stems are stated negatively tend to be more discriminating than those whose stems are stated posi- tively. The item writer should aim to produce items that will be answered correctly by most students of high achievement, and missed by most students of low achievement. The responses "All of the above," or "None of the above," are recommended for use in almost all multiple-choice test items. A "medium difficulty" true-false item is answered correctly more often than a "medium difficulty" multiple-choice item. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 92 If extreme groups of 33% instead of 27% are used for item analy- sis the groups will be more alike in average ability. A wide distribution of item difficulty values in a test is likely to lead to a wide distribution of pupil scores on the test. If 12 of 20 students answer a question correctly, all 12 of them should be expected to answer another, easier question correctly. Item analysis data can help the item writer identify and correct sources of weakness in a multiple-choice test item. If six of ten students who score high on a test answer a particu- lar item correctly, while four of ten who score low on the test answer the same item correctly, the index of difficulty of the item is .50. The main reason for using upper and lower groups each including 27% rather than 50% of the total group tested should be to reduce the labor counting responses. If three-fourths of the examinees who take a test answer an item correctly, its index of discrimination is .75. If an item is extremely easy it is likely to be low in discrimi- nation. To determine the index of discrimination of a test item one must first determine its index of difficulty. The primary goal of item selection, on the basis of indices of discrimination, is to increase test reliability. If the scores on Test A are much more variable than the scores on Test B, the difficulty values for the items in Test A are also likely to be more variable than those for the items in Test B. Good classroom test items should have indices of discrimination of .50 or more. In a frequency distribution of scores for which the mean is 78 and the median is 65, there must be more extremely high scores than extremely low scores. If two sets of scores have different variances, they must have different standard deviations. More than half of the scores in a typical distribution are located more than one standard deviation away from the mean. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 93 If a student's raw score on Test A is larger than his raw score on Test B, his percentile rank on Test A should be larger also. When scores on a test are converted to stanines, some pupils are likely to get stanine scores of -3.5. Differences from instructor to instructor in marking are inevitable and educationally desirable. It is better for a marking system to report absolute than relative achievement. Percentage marks were intended to report the proportion learned of that which might have been learned. If two sets of scores have different means they must have differ- ent variances. When students are grouped by ability levels the policy of giving a higher proportion of A's in the more able group is justifiable. Stanine marks are likely to be more reliable than five-letter marks. No instructor is entitled to criticize the distribution of marks in another instructor's course. The distribution of marks in all classes should be approximately the same regardless of differences in the general levels of ability of the students in the different classes. 94 U uwaaun'mmnu zfl'flmfimfiumsfinuauazn nub : tfluuan '11 fine“: nhfiuw I U uuuaaumuofiuflnwaon tflu b mu Hi: «and o tfiusnuazxfiumfimd‘uqaaufl U U u e: no uaznaud b tflumnufimfiunwmnuzfl'flflLfi'mfiumfi'mauazmnhztfluuamsfinm U S cm nananu . ' . v v ‘ . nononunqmmumnnsumummuunuamu Tnuluam \fiuufl'auazunuaqgavlu . O O ‘ uuuanumu “3313511111117! t ‘I‘R'I’é‘l'm {U mwauuwaoumu 013,111} I v I I c 'uououmzqm1fluaunqulumslumnmwflaauwfiflv " U U U wand 9 1nuazsflumfimfiugnau TUsnnqwnxfluumi'awunu " J " anluduunmmnu I fimvfiuamwno‘m tfhnfiwaxm‘m U I I U o. 1 :fiuflufl'muaou c. mu mu m'rrumm mafimmsfnuamsfinm I D 1.2mm: :nuflntn “Golan [:3 szfiufluuuflnm [3 um I I In. “garden-mu Elam: I. I v I v v I Daumngy'mhmmn mnamn 'mu" 1331191311119 (.9 mmu'n I . I . . v I itfiqmynnwfiaqvmn 'lmnu 'ln-‘lunnuuo (no t I 9 v v n. toandfi'm'wms (.9 nmlnmzluwnfinnvtmfi’umsouw v I I I: n fluazuoumn fivnmafiuamsfinmfilu‘! D r - no U D 1'7 [:3 « fiuazmnndfi D 111.1; I v (. mumux‘iuufi'mms’fnuamsflnunmn (.1. mnsmswfinmfimn:xflnmmuwm's I I v Gnunfiuueaunn‘iwmfiuni’aln? inuamsflnunfin wwwzafim wnfumsaum B {“0 “$8111.? . [:1 afim DIME" [j 'lu‘afi’m 95 I v.0 I I I I v ‘a. wwuaum-sanulnulmeaausmhh 7 4. uwfin'nmuflmnuztfi'mfibmfl'nua'lu U U [:1 mu Haul vniaufl'gg ( tfiannou News Man) I v a [3 1mm: mnwuwmwswmauu I I I v v a. mufimwmuflm‘wz Mmfiums’fnualu Duns tfluuuaaau U 'U Haul nmnffgn (tzanmu flown Mm) [:3 msfi tmnzmowau E] mnwummsafidmou E] sfifltflmfl'ua: uuuuazn'n fl’n man 9 [3 m1 wuuuaaau '0 D 0115 mmzmoaau [:3 m Mmfiunzuuu uazm‘sfl'mmn v I v I 223,1! man z'ihluMmfiumfi’mauazmsflsz Lfluuanns 8mm mmnunwmwanwnwow v I v I a v .9 nom'muaa 31m ua :finfiulmwa mnmfu x 11m Nah) {flu tva'n m‘uamnmfu tfimfiv ' v ' 'lm‘fluumi'owmu " V/ ' avluwtfiwuwomnmfu naznntromnmfutfiulfiw '1; U v V tfluumfi'mnmu " X ' av'lu'm xfiwuwamnufl’u I fihaunvnnsnnu (. X.) o . nj‘lu'a'n t 61.;an :mnfiuaou‘lgfln H'uufd’oa'wmfn (.‘./. ) oo. mfi'nuadflw :vTaxaflm'w tfiuthu'u I . v v . I I . (...) '9. nas’i’nnwnawnsnflmm'Nfi’umtfiunov'luuumvnnnnudunnmvfliu (wummmmu finfiu n'nn'muuu taonnau nhnnuuuuqn-fln nan) a v I v . v (...) In. vaaauuuutaannaufintdam‘im‘lnunmonuaaauuuuMacaquesMmemnmiu q (short answer tests) 0 . I I (...) n. mflummmmu '1 mm (wuuwsfionnuu uu'uqnfln nuufi’un.) lumsanun‘fwdv thmnsflfi’udwmnmduww (validity) ummwnaau (. . .) c . {eaaufinuafiuqnéfl'fimnn 195.0%qu fine: tfiufaaamgnanunnwmumm’vfin ”on I I .0 mu'lmyfl nann‘maaauau tas‘a (~~-) (-.-) (---) (--~) (~«-) (...) (...) (...) (-~-) (-~~) (...) (--«) (--~) (...) ‘0 do .0. 99. GU. 9t. 9". 9‘30 9‘". 9‘1. 96 v v I» unsouflhflh ( essay tests)5h{dannlnasauaqmnnnuaaaufl1flb (objective tests) I I v v tdpvwnnflhtfiuuunaznuflnanuunnanvflhunn nnsflhuawzfld1zmn8nnw nnflhifiuuln U o aouuoaoudyaantflufltfluxonnzwwuehtun U. '0 ' agnosnzlumaaauU1flb nnaavnnslnxfinnonutfiuUTGblunnsfihua o I finfliaaouannfibfihuaahqnénnvnnsteuufiuwndyaq bxo {b nnsfihuaazflaanutdbdb U 0' Ir a v (reliability) lagggflu anniuuvuaaau b‘o no aonxfiuuaaau u out oflba: U U. I U do no uaonntflunfisaau n n§§ luxoandhnnnflvfih unudazwaunihtfiua bto no 55amufl1fihd‘filaannh o dhluv wzanunsnfihuaamqnfihnvnnsxfuuwavflhtfuulnunn o v , o nfiniaaaufihOMdlutoannn 9 dhluvtuufih I v. v I nnsaauunu q Enhzlvuuafinfibfihtfiuuflufls:nuunnnonflht€uudhflbuuu€aflhflnunlu nunfiwunfiu I U '0 nfisda:fihaulaonnosn:aanuoaauuuufihflbnibuuutfiannnu afouuuqn-fln fihfiuaq I u fluonnzfihaussunnwauavnnvnnuln' v . U 0 I I uaaaufihuaflhqn§3fi nosn:flanuauuauavvaaoudmnuunn vnu thunanv unnua: unnunn flnhuauwa 1 «u v v . I I II waaauoflpndndvus:nauldhounfinnudhnonuunnvnuUnunanv (luvnnquluvnuxfiulU) I . . I I It I luanunsnunnnli‘unnsafiuunnnnuanunsns:nnnvflht€uunqptnvnoufluufibsrnnnv II o v flhtfiuunquaoundufluln U fiaaoufihuafibqnédfinosa:5hlud§d§tfibaaavfibtdbnnfiuwdhiaau '0 I I U 0.. . nnannSaUszlunwnnnnsnt€uuqnununlfiluuoaouuuuuiannauanaann ”findszlunna ' U U a Uszlunuao uhtfiuuanwn:anunsnaauuaaaufihqn Tnuu11finnnndnntunlwluxfibnnuu q annoy (nhnounn) luuaaauuuuzfiannou moan:xflunhnaudthnlslfibh1fuutfian v v . nan udlnuunafivuadflutfiunnaoudbn o .1..‘ I in d o 1' a W U.- annou uflua nqn t anaau fitfififiufiflfi utfiufintfian ( nga c luuauau uuutfionaau v v a . I nnsxfiuuuaaauflhfibuanu q walnfihtguufianStfibnnfitfiun11uauu€UU1vn1nnxdbdh (reliability) mavnzuuulfiqvfim :fibnfinnufibfibfivld (...) bb. fifih' ufibwau naswznfi ( (...) bb. nhwnuflsz u ”t tan gqnunqnua uuuxfiannou -uhuqlfilfibfibifibnvav" (...) bu. away: nfia ”hf ‘ “UH (...) M. (...) uh. (...) an. (...) at. (...) at. (...) ab. (...) ad. (...) ad. (...) ad. (...) to. (...) to. (...) (m. 98 U 0" I v0 9 I v finiaudlnwnnnnsfiIn1nsuuawuuanunsnunulnqdvfiuuflafifiunfififififluwazflnnvvn I . v unwsovuovannnuuuutfionnaulnnqun 0 I fl 0 0 n1 b lu Io nuuanfihtfuunqu nnzuuuqn (upper group) unnnnsnaunafiflfl . v I . I oflundwaummuuandvqn 1111mm?! r in go nuuavfinfluunqufl'lmzuuum . v v v v I I (lower group) nounnnnuuatflu1flhfibqnua1 waaauuafihnzflnnnwnuunnnnu thzmm .(o I I v I g «a v I qqununulnmuavnnslunqp bd% qv-nnlun11fiIa1nznuaaauunudsslunqu ¢o% ‘aid 1 - ° - qv-nn I a nannu1nvnu unnsflhnnuwunnnnuuasnnsfinnnuom v v v . v I . . . 0 on n lu ¢ novalunaaunnunnnnnvanduqn nnanunnsnuunuannnnnuuafihnz I tnnflh .d‘ O I ' . o o ‘ unflaudnnu q fluudluuds:flhnanunfisfiuunnn v . I ° I v I v I . I ngn:nananunwnnnnaanuunnvnuuavnaaauuaa:uataunaufimnsanunsnnnudmafl . . U I .9 anunwnfiuununvnaaautuanfihln I I v I . . ~ jnqvnunulnmwavnnslunnanunsn«nun (indicies of discrimination)tflu .0 a I v ' tnmfilunn1fintfianuaaaufifinnnnnnavnnvfin:I‘Lannanu{dadhuavunaoulnqyfiu ' ‘d‘. . v , I ‘d" nnnzuuu flfififlflWSfiUUflafiflflfiflh n. flnn1n1znnuunnn1nnzuuu nsnnnnsauu v a I I . I v v uaaaunflb u. and n1n1nuunnnnunavn1n1uuna:ualunaaaunflb n. nzflh11nszanu I I I . I U u unnunnnnnnnuunnvnuuovannnuunazualuvaaauoflh v. I . I I . nannudfinnsnsflnnonunfinfiuundhun .co dh1u v d g d I I I I nnnsuuuxo u (mean) unvnnsaaunsvn vflnntnnfib we. median flnfilnnfih v . d v I . b‘ uao lunnsaounémds:flnfiuauuovfihtfuu aoulnnsuuuqnunnnfinnnuounav v I flht10ufiaaulnnzuuunfi d I .0 nnnznuu u qn flannnudsflswuunnsgnu (variance)nnnflhuao azuuu b up db v ' I v asnanflnonutdbvtuuunnsgnu (standard deviation) anvflhn1u I I unnnnnnIQndmmavazuuulunnsns:nnuUsnfl (normal distribution) sznnoq luiufldnn; +1n11uIdbnxuuunn1gnusnnnzuuuIain (+lsd from mean) fluid (...) (n. (...) ((. (...) (c. (...) (b. (...) (d. (...) (s. (...) (a. (...) (o. (...) (9. (...) ch. 99 v .v a In a . I nnflht!uunund§nnuaaauaflb n. lnnzuuuqvnanuaaauoflh u. nan anunun v v- U U I. I IUD:tfiulnadlnunnnnsaauuaanuoflh n. unvfiht?uunufi%n:navqpn1nnnunuv V O U U U IUa1Ifiulnaaavtundlnsnnnnsaauuaaauoflb u. man ' IdanzuuudlnannnnsnnaauoflbnqunuflavLfiunznuu stanine nzflflhtguuu1vnu d I anulnnzuuu stanine fl - n.¢ I v I vs nonuunnnnvfihlunfihnnsuavnnslnn:uuu uannnnnstfluavdhfinrdhvlu‘nuanah tfiufivdflnflansnunnnnnnaflnun I lunfissnuvnuuannstiuu nqnnssnuvnunnnufibqnéhatownzfinunvflhIibu (tau v I fifinaouflunflnlntnsn B ) finondaz1nuv1un1nufihqnGhanavflhI€uuTnunn11U£uu I v v v IfluufiunonuanunsnuovflhI€uunudh 1 (sun fifinaauflunanlntunstéulnafi do) u o I v nnslnnzuuutfiuIUastfiuIflunonudhlnuavnqfls:110v1uflhanuuavnn1I€uu1d ' fihtfiuunnsnzq v I g d I I nnnzuuu b qn Unnnzuuuta u (mean) unnnnvfih asuuu u qnfihnsflannonu d ’ ' ’ uU1U17uunn13nu (variance) unnnnvflhnau I I .0 :dpflhIquqnfihnqunnus:fiunnnuanunsn (luuflh130unav n. nunufivflht€uu II a II v nqntnv flhx€uunun n. fiaflhl€uunqpaou) nanlunnsflhtnsnfimnln 1 finnu 0 Ir fl... nnnqfinInsnlnuflafismnnzuuuuovflhIguutownzunaznovlnu unnflvfivnzuuu v v I .. 0 II navflhtfiuunavdu q nan fihaauuovnnuauflht€uufinntn1n A lunqytnv floss: .0 v I I ln€bnn1fls11nnlngjgggj nquhnonnafiunsnsavavun I .0 v I I nnsuuvtnsntflu ¢ tnsn usual"InanIfiudldbfiblnunnnnnnnsuuvInsnLflu c Inna I v v v . luflninuln1nibanilnfisnsmnn1n1:11uuavInsnlufiunfihinufihnnnnsaau fl DI ID I I v v n11nsznnuuav1nsn nzlnunflhLguuunaznovnosnztnnfih (luudh390unan n. 1n 0 v v a v v. I. I A ( nu fibtfiuunav u. ua:nnv*fl.finavln A nova: c nunnu) Tnulunfifivon I U I I IzflunfinuanunsnwavflhLeuunwaznavs:InnfihnIalu ' ad. . . ' unwounszq nfiulnnqmnaauuuuaaunnununsunpue BIBLIOGRAPHY BIBLIOGRAPHY Allen, M. E. "Status of measurement courses for undergraduates in teacher-training institutions." The 13th Yearbook of the National Council on Measurement in Education. 1956: 67-73. Brimm, J. L. and D. J. Tollett. "How do teachers feel about in-service education." Educational Leaderships. 31 (March 1974): 521-525. Conant, J. B. The Education of American Teachers. New York: McGraw- Hill Book Company, 1963. Durost, W. N. "Problems in in-service training of teachers in the use of measurement and evaluation techniques." The 16th Yearbook of the National Council on Measurement in Education. 1959: 31-33. Ebel, R. L. "Standardized achievement tests: Use and limitation." National Elementary School Principal. 40 (1961a): 29-32. Ebel, R. L. "Improving the competence of teachers in educational mea- surement." Clearing House. 36 (October 1961b): 67-71. Ebel, R. L. "Can teachers write good true-false test items?" Journal of Educational Measurement. 12 (Spring 1975): 31-35. Ebel, R. L. Essentials of Educational Measurement. Prentice-Hall, Inc., 1979. Erickson, R. C. and T. L. Wentling. MeasuringiStudent Growth: Tech- niques and Procedures for Occupational Education. Allyn and Bacon, Inc., 1976. Feldhusen, J. F. "Student perceptions of frequent quizzes and post- mortem discussions of tests." Journal of Educational Measurement. 1 (Jan. 1964): 51-54. Finn, J. D. Finn's Multivariance - Univariate and Multivariate Analysis of Variance, Covariance, and Regression. Modified and adapted for use on the CDC-6500 at Michigan State University by Verda M. Scheifley and William H. Schmidt, 1973. Fleming, Margaret. "Standardized tests revisited." School Counselor. 19 (November 1971): 71-72. 100 101 Goslin, D. A. Teachers and Testing, New York: Russell Sage Foundation, 1967. Lien, A. J. Measurement and Evaluation of Learning. W. M. C. Brown Company Publishers, 1971. Lindvall, C. M. Measuring Pupil Achievement and Aptitude. Harcourt, Brace & World, Inc., 1967. Marso, R. N. "Classroom testing procedures, test anxiety and achieve- ment." Journal of Experimental Education. 38 (Spring 1970): 54-58 0 Mayo, S. T. "Preservice preparation of teachers in educational measure- ment." Final report. Chicago: Loyola University, 1967. Mehrens, W. A. "The technology of competency measurement." In R. B. Ingle, M. R. Carroll, & W. J. Gephart (Eds.) Assessment of Student Competence. Bloomington, Indiana: Phi Delta Kappa, 1979. Mehrens, W. A. and I. J. Lehmann. Measurement and Evaluation in Educa- tion and Psychology. New York: Holt, Rinehart and Winston, Inc., 1978. Miller, W. C. "What's wrong with in-service education? It's topless!" Educational Leadership. 35 (October 1977): 31-33. Nie, N. H., C. Hull, J. Jenkins, K. Steinbrenner, and D. Bent. Statist- ical Package for the Social Sciences. McGraw-Hill Book Company, 1975. ' Noll, V. H. "Requirements in educational measurement for prospective teachers." School and Society. 82 (September 1955): 88-90. Noll, V. H. "Problems in the pre-service preparation of teachers in measurement." The 18th Yearbook of the National Council on Measurement in Education. 1961: 35-42. Noll, V. H., D. P. Scannell, and R. C. Craig. Introduction to Educa- tional Measurement. Houghton Mifflin Company, 1979. Olejnik, S. F. "Standardized achievement programs viewed from the per- spective of non-measurement specialist." Paper presented at the annual meeting of the National Council on Measurement in Educa- tion, San Francisco, April, 1979. Office of the National Education Commission. A Study of Primary School- ing in Thailand. Final report. Bangkok, Thailand, 1977. 102 Pravalpruk, S. Comparison among Teachers in Khon Kaen, Thailand, to Determine Their Testing Needs. Dissertation, Michigan State Uni- versity, 1974. Sack, W. M. Measurement Competencies of Educators Defined Through Task Analysis and Differentiated by Teaching Area, Grade Level, and Vocation. Dissertation, Loyola University of Chicago, 1979. Stanley, J. C. Measurement in Today's Schools. Prentice-Hall, Inc., 1964. Stanley, J. C. "ABC's of test construction." In T. M. Covin (Ed.) Classroom Test Construction. MSS Information Corporation, New York, 1974. Stanley, J. C., and K. D. Hopkins. Educational and Psychological Mea- surement and Evaluation. Prentice-Hall, Inc., 1972. Stetz, F. AReport of a Market Survey to Sample Opinions of Sales Repre- sentatives and Users, Concerning the 1973 Stanford Achievement Test. Market Research Report No. 10. New York: The Psychologi- cal Corporation, 1977: 25. Stevenson, M. "The role of the classroom teacher in school testing programs." The 16th Yearbook of the National Council on Measure- ment in Education. 1959: 43-46. Thorndike, R. L., and E. Hagen. Measurement and Evaluation in Psy- chology and Education. John Wiley & Sons, Inc., 1969. Yeh, J. P. "Test use in schools." Washington, D.C.: U. S. Department of Health, Education and Welfare, and National Institute of Edu- cation, 1978. Zigami, P., L. Betz, and D. Jensen. "Teachers' preferences in and per- ceptions of in-service education." Educational Leadership. 34 (April 1977): 545-551.