. .n.:u .u .7 , . u , .v t. 1 5, 4?. 7(l‘bflml..." 13W? , 53L. .2 . . i. . .2... .'“Janet... mmkummaa - awnmfi, «wanna. (mafia. .. .r . 34,.» u 1.»? h .6. . .i S.» 4153...». 11.! . 5.2.2.1. . l... s. . Dq it. . L 31.91 )1 ll 3| E v... : iBSIsqqa This is to certify that the dissertation entitled DATA MINING FOR A WEB-BASED EDUCATIONAL SYSTEM presented by Behrooz Minaei-Bidgoli has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Sciences and ErLcLineering ‘ M 9 Major P'rofessor’s Signature 7/ / / o 3 Date MSU is an Affirmative Action/Equal Opportunity Institution LIBRARIES MICHIGAN STATE UNIVERSITY EAST LANSING, MICH 48824-1048 .— —.-._.—.--.-i-..-.—.- -.-.-.-.-.—.. PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 c:/ClRC/DateDue.lnd¢p.15 DATA l DATA MlNlNG FOR A WEB-BASED EDUCATlONAL SYSTEM By Behrooz Minaei—Bidgoli A DlSSERTATlON Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHlLOSOPHY Department of Computer Science and Engineering 2004 DATA .\ “Eb-based Cd; leescnptn'e stud ,I l»\ HLhktdxcd i.“ ,. H. “-ng ”mm ' “e 15 ad 4; “$2“; “ é an a ABSTRACT DATA MINING FOR A WEB-BASED EDUCATIONAL SYSTEM By Behrooz Minaei-Bidgoli Web-based educational technologies allow educators to study how students learn (descriptive studies) and which learning strategies are most effective (causal/predictive studies). Since web-based educational systems are capable of collecting vast amounts of student profile data, data mining and knowledge discovery techniques can be applied to find interesting relationships between attributes of students, assessments, and the solution strategies adopted by students. The focus of this dissertation is three-fold: 1) to introduce an approach for predicting student performance; 2) to use clustering ensembles to build an optimal framework for clustering web-based assessment resources; and 3) to propose a framework for the discovery of interesting association rules within a web-based educational system. Taken together and used within the online educational setting, the value of these tasks lies in improving student performance and the effective design of the online courses. First, this research presents an approach to classifying student characteristics in order to Predict performance on assessments based on features extracted from logged data in a Web-based educational system. We show that a significant improvement in classification Performance is achieved by using a combination of multiple classifiers. Furthermore, by “learning” an appropriate weighting of the features via a genetic algorithm (GA), we have successfully impr lil-l3°o. Such Classli proud: valuable. ind Second lillS. r ensembles mth the situational TCSU'JKC for the integration . $33le shim imprt' ttttstrapping. Sill“; mli‘ SNific New finally this st Mentions betas; 31“.“‘36 an aligning} sled educational ‘ «kalQhr “1‘ S} \I 553;:wa ‘ em! and m successfully improved the accuracy of the combined classifier performance by another 10-12%. Such classification is the first step towards a “recommendation system” that will provide valuable, individualized feedback to students. Second, this project extends previous theoretical work regarding clustering ensembles with the goal of creating an optimal framework for categorizing web-based educational resources. We propose both non-adaptive and adaptive resampling schemes for the integration of multiple clusterings (independent and dependent). Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. These improvements offer insights into specific associations within the data sets. Finally, this study turns toward developing a technique for discovering interesting associations between student attributes, problem attributes, and solution strategies. We propose an algorithm for the discovery of “interesting” association rules within a web- based educational system. The main focus is on mining interesting contrast rules, which are sets of conjunctive rules describing interesting characteristics of different segments within a population. In the context of web-based educational systems, contrast rules help to identify attributes characterizing patterns of performance disparity between various groups of students. We propose a general formulation of contrast rules as well as a framework for finding such patterns. Examining these contrasts can improve the online educational systems for both teachers and students —— allowing for more accurate assessment and more effective evaluation of the learning process. This dissertation is dedicated to: my parents my wife, my son Mohsen, my daughters Maryam and Zahra, and to whoever serve the Truth for the Truth itself. iv his Mill much .1 titdicaztd themstl \ ts major professor and completion of this “flout his knots}: possible. The knoul tts of my life, O’htr teachers Acknowledgements It is with much appreciation and gratitude that I thank the following individuals who dedicated themselves to the successful completion of this dissertation. Dr. Bill Punch, my major professor and advisor, maintained his post by my side from the inception to the completion of this research project. He gave me guidance throughout my research. Without his knowledge, patience, and support this dissertation would have not been possible. The knowledge and fiiendship I gained from him will definitely influence the rest of my life. Other teachers have influenced my thinking and guided my work. I would also thank professor Anil Jain. It was my honor to be a student in his pattern recognition classes, which defined my interest in the field for many years to come. His professional guidance has been a source of inspiration and vital in the completion of this work. I owe many inspirational ideas to Dr. Pang-Ning Tan whose keen insights and valuable discussions often gave me stimulating ideas in research. Though kept busy by his work, he is always willing to share his time and knowledge with me. For these reasons, I wish he would have been at Michigan State University when I began my research. I also owe many thanks to Dr. Gerd Kortemeyer for his extraordinary support and patience. His productive suggestions and our discussions contributed enormously to this work. From the first time I stepped through the door of Lite Lab in the Division of Science and Mathematics Education, until now, Gerd has allowed me freedom to explore my own directions in research. He supported me materially and morally for many years, and for that I am very grateful. I am grateful to Dr. Esfahanian for taking time to serve on the gttdancc commit insightful sugg'fillt‘fi Science Depart... tat loth as a teacher and l am also gm: .tltuntitt Topths. a Ftlitu Btrn‘man. :z-thaguts Guy Aibt KM) and Dr. I): $5513.33“) COR’JL‘T‘ lofftr my 3pc. its dissertation and tfifitqu. ‘1‘? dtst‘uss; SIC 1» a» . emu SUI‘D)" Lt§gw the guidance committee and overseeing this work. I deeply appreciate his support and insightful suggestions. His great organization of the graduate office in the Computer Science Department helps advance graduate research. I have learned much from him, both as a teacher and as a friend. I am also grateful to my colleagues in the GARAGe Laboratory especially Alexander Topchy, and all my friends in the LON-CAPA developers group: Helen Keefe, Felicia Berryman, Stuart Raeburn, and Alexander Sakharuk. Discussion with our colleagues Guy Albertelli (LON-CAPA head developer), Matthew Brian Hall, Dr. Edwin Kashy, and Dr. Deborah Kashy were particularly useful. The discussions with them substantially contributed to my work and broadened my knowledge. I offer my special thanks to Steven Forbes Tuckey in the Writing Center for editing this dissertation and his proof-reading. I am grateful to him for the countless hours of constructive discussion and his informative revisions of my work. Last but not the least, many thanks go to my wife, without her love, care, encouragement and continuous support, I would not be who I am today. Above all, I thank God the Almighty, for blessing me with the strength, health and will to prevail and finish this work. I and my colleagues in LON—CAPA are thankful to the National Science Foundation for grants supporting this work through the Information Technology Research (ITR 0085921) and the Assessment of Student Achievement (ASA 0243126) programs. Support in earlier years of this project was also received from the Alfred P. Sloan and the Andrew W. Mellon Foundations. We are grateful to our own institution, Michigan State University and to its administrators for over a decade of encouragement and support. vi _" CHAPTER] lVTRt ll Srtrrvg ‘3 DttAXii‘ 13/ iii, 1:: DJ: 133 p” ’34 h. ’3‘ .ll. l3 Cum 1‘3" LU 13*: Li. [3'3 (in, '3‘ R. 1'35 Li 1'35 it. 14 km“ 1;, Lt 14.3 Bu 143 Table of Content List of Tables ................................................................................................................ xiii List of Figures .............................................................................................................. xvi CHAPTER 1 INTRODUCTION 1 1.1 STATEMENT OF THE PROBLEM ................................................................................... l 1.2 DATA MINING ............................................................................................................ 5 1.2.1 What is Data Mining? ....................................................................................... 6 1.2.2 Data Mining Methods ........................................................................................ 8 1.2.3 Predictive tasks .................................................................................................. 9 1.2.4 Descriptive tasks ................................................................................................ 9 1.2. 5 Mixed tasks ...................................................................................................... 10 1.3 ONLINE EDUCATION SYSTEMS ................................................................................. 11 1.3.1 LON—CAPA, System Overview ......................................................................... 12 1.3.2 LON-CAPA T opologt ...................................................................................... 12 1.3.3 Data Distribution in LON-CAPA .................................................................... 15 1.3.4 Resource Variation in LON- CAPA .................................................................. I 6 1.3.5 LON-CAPA Strategy for Data Storage ............................................................ 1 7 1.3.6 Resource Evaluation in LON-CAPA ............................................................... 19 1.4 INTELLIGENT TUTORING SYSTEMS (ITSS) ............................................................... 20 1.4.1 Learning Enhancement in ITSs ....................................................................... 22 1.4.2 Basic Architecture of an ITS ............................................................................ 24 1.4.3 Learning and Cognition Issuesfor ITS Development and Use ....................... 27 1.5 SUMMARY ................................................................................................................ 29 vii CHAPTERZ B.\CT CLAsst’: 2.1 8.; 3.1.1 0‘ s. 1. t. . . I a ‘5 -.l_4 ‘ . est ... P .\u 11 s.‘ ‘\. 1‘ .\.\ 'l‘ 1 T1: 1 TL C Ill \‘ “ \~ ‘s 3.. a; s~ \. 1s 3.. 1" s I :9 Fr {.4 I]. ‘l.4 4 } A; I 2 2 7 it a. as as. l as s.‘ ‘1‘ ‘5 W; «J 3;. CHAPTER 2 BACKGROUND 0N DATA MINING METHODS 30 2.1 CLASSIFICATION AND PREDICTIVE MODELING ........................................................ 30 2.1.1 Bayesian Classifier .......................................................................................... 31 2.1.2 Decision tree-based method ............................................................................ 33 2.1.2.1 What is the best feature for splitting? ................................................................................... 34 2.1.2.1.] Entropy impurity ........................................................................................................... 35 2.1.2.1.2 Gini impurity ................................................................................................................. 36 2.1.2.1.3 Twoing impurity ............................................................................................................ 36 2.1.2.2 How to Avoid Overfitting ..................................................................................................... 37 2.1 .2.2.1 Cross-Validation ............................................................................................................ 37 2.1.2.2.2 Setting a threshold ......................................................................................................... 37 2.1.2.2.3 Pruning .......................................................................................................................... 38 2.1.2.3 Drawbacks of decision tree ................................................................................................... 39 2.1.3 Neural Network Approach ............................................................................... 3 9 2.1.4 k—Nearest Neighbor (kNN) Decision Rule ....................................................... 42 2.1.5 Parzen Window classifier ................................................................................ 43 2.2 CLUSTERING ............................................................................................................. 44 2. 2. 1 Partitional Methods ......................................................................................... 45 2.2.1.1 k-mean Algorithm ................................................................................................................. 45 2.2.1.2 Graph Connectivity ............................................................................................................... 48 2.2.1.3 Nearest Neighbor Method ..................................................................................................... 49 2.2.1.4 Mixture density clustering .................................................................................................... 49 2.2.1.5 Mode Seeking ....................................................................................................................... 50 2.2.1.6 k-medoids .............................................................................................................................. 51 2.2.1.6.] Partitioning Around Medoids (PAM) ............................................................................ 51 2.2. 1.6.2 CLARA ......................................................................................................................... 52 2.2.1.6.3 CLARANS .................................................................................................................... 52 2.2.1.7 Other partitional methods for large data sets ......................................................................... 53 2.2.2 Hierarchical Methods ...................................................................................... 54 viii 2.2.2.1 Traditional Linkage Algorithms ............................................................................................ 55 2.2.2.2 BIRCH .................................................................................................................................. 56 2.2.2.3 CURE .................................................................................................................................... 57 2.3 FEATURE SELECTION AND EXTRACTION .................................................................. 58 2.3.1 Minimizing the cost .......................................................................................... 59 2.3.2 Data Visualization ........................................................................................... 59 2. 3.3 Dimensionality Reduction ............................................................................... 59 2.3.4 Feature Selection ............................................................................................. 60 2.3.5 Feature Extraction ........................................................................................... 61 2.4 SUMMARY ................................................................................................................ 63 CHAPTER 3 DATA REPRESENTATION & ASSESSMENT TOOLS IN LON-CAPA ..64 3.1 DATA ACQUISITION AND EXTRACTING THE FEATURES ........................................... 64 3.1.1 Preprocessing student database ...................................................................... 64 3.1.2 Preprocessing Activity Log .............................................................................. 6 7 3.1.3 Extractable Features ....................................................................................... 68 3.2 FEEDBACK TO THE INSTRUCTOR FROM ONLINE HOMEWORK ................................... 69 3.2.1 Feedback tools ................................................................................................. 69 3.2.2 Student Evaluation ........................................................................................... 71 3. 2. 3 Conceptual Problems ...................................................................................... 7 7 3.2.4 Homework and Examination Problem Evaluation .......................................... 83 3.3 SUMMARY ................................................................................................................ 89 CHAPTER 4 PREDICTING STUDENT PERFORMANCE 91 4.1 DATA SET AND CLASS LABELS ................................................................................. 92 4.2 CLASSIFIERS ............................................................................................................. 95 4.2.1 Non-tree based classifiers ............................................................................... 95 4.2.1.1 Combination of Multiple Classifiers (CMC) ......................................................................... 96 ix 4.2. 1.2 Normalization ....................................................................................................................... 97 4.2.1.3 Comparing 2-fold and lO-fold Cross-Validation .................................................................. 98 4.2.1.4 Results, Error Estimation .................................................................................................... 100 4. 2. 2 Decision T ree-based software ....................................................................... 105 4.2.2.1 C5.0 .................................................................................................................................... 106 4.2.2.2 CART .................................................................................................................................. 109 4.2.2.3 QUEST, CRUISE ............................................................................................................... 1 13 4. 2.3 F inaI Results without optimization ................................................................ 118 4.3 OPTIMIZING THE PREDICTION ACCURACY .............................................................. I 19 4. 3 . 1 Genetic Algorithms (GAs) ............................................................................. I 19 4.3.1.1 What is a Simple GA (SGA)? ............................................................................................. 120 4.3.1.2 Specific use of GAs in pattern classification ....................................................................... 120 4.3.2 Implementation of a GA to optimize the prediction accuracy ....................... 122 4.3.2.1 GA Operators ...................................................................................................................... 123 4.3.2.1.1 Recombination ............................................................................................................ 123 4.3.2.1.2 Crossover ..................................................................................................................... 124 4.3.2.1.3 Mutation ...................................................................................................................... 124 4.3.2.2 Fitness Function .................................................................................................................. 125 4.3.3 Experimental Results of GA Optimization ..................................................... 126 4.4 EXTENDING THE WORK TOWARD MORE LON-CAPA DATA SETS .......................... 131 4.4.1 Experimental Results ..................................................................................... 1 3 4 4.5 SUMMARY .............................................................................................................. 138 CHAPTER 5 ENSEMBLES OF MULTIPLE CLUSTERINGS 140 5.1 INTRODUCTION ....................................................................................................... 141 5.2 TAXONOMY OF DIFFERENT APPROACHES ............................................................... 144 5.3 NON-ADAPTIVE ALGORITHMS ................................................................................ 148 5.3.1 Similarity-based algorithm ............................................................................ 1 4 9 5.3.2 Algorithms based on categorical clustering .................................................. 150 5.4 CONSENSUS FUNCTIONS ......................................................................................... 153 5.4.1 Co—association based functions ..................................................................... 1 5 3 5.4.2 Quadratic Mutual Information Algorithm (QM) .......................................... I53 5. 4.3 Hypergraph partitioning ............................................................................... l 5 4 5. 4. 4 Voting approach ............................................................................................ 1 54 5.5 CRITICAL ISSUES IN RESAMPLING ........................................................................... 155 5.5.1 Variable number of samples .......................................................................... 1 5 5 5.5.2 Repetitive data points (objects) ..................................................................... 1 5 6 5.5.3 Similarity estimation ...................................................................................... 1 5 6 5. 5. 4 Missing labels ................................................................................................ 1 5 7 5. 5. 5 Re-labeling .................................................................................................... 15 7 5.5.6 Adaptation of the k-means algorithm ............................................................ 158 5.6 ADAPTIVE SAMPLING SCHEME ............................................................................... 158 5.7 EXPERIMENTAL STUDY ON NON-ADAPTIVE APPROACHES ..................................... 163 5. 7.1 Data sets ........................................................................................................ 164 5. 7.2 The role of algorithm 's parameters ............................................................... 165 5. 7.3 The Role of Consensus Functions (Bootstrap algorithm) .............................. 166 5. 7.4 Eflect of the Resampling method (Bootstrap vs. Subsampling) ..................... 169 5.8 EMPIRICAL STUDY AND DISCUSSION OF ADAPTIVE APPROACH ............................. 174 5.9 CONCLUDING REMARKS ......................................................................................... 176 CHAPTER 6 ASSOCIATION ANALYSIS IN LON-CAPA 178 6.1 INTRODUCTION ....................................................................................................... 178 6.2 BACKGROUND ........................................................................................................ 1 81 6.2.1 Association analysis ...................................................................................... 181 6.2.2 Data mining for online education systems .................................................... [83 xi 6.2.3 Related work .................................................................................................. 184 6.3 CONTRAST RULES .................................................................................................. 185 6.4 ALGORITHM ............................................................................................................ 190 6.5 EXPERIMENTS ......................................................................................................... 192 6. 5. I Data model .................................................................................................... 193 6. 5.2 Data sets ........................................................................................................ I 96 6. 5. 3 Results ........................................................................................................... I 97 6.5.3.1 Difference of confidences ................................................................................................... 199 6.5.3.2 Difference of Proportions ................................................................................................... 200 6.5.3.3 Chi-square ........................................................................................................................... 200 6.6 CONCLUSION .......................................................................................................... 201 CHAPTER 7 SUMMARY 203 7.1 SUMMARY OF THE WORK ...................................................................................................... 203 7.1.1 Predicting Student Performance ................................................................................ 204 7.1.2 Clustering ensembles ................................................................................................. 205 7.1.3 Interesting association rules ...................................................................................... 206 7.2 FUTURE WORK ...................................................................................................................... 207 APPENDIX A: TREE CLASSIFIERS OUTPUT 208 gig ........................................................................................................................................... 208 CARI ......................................................................................................................................... 213 QU_E_T ...................................................................................................................................... 221 w ..................................................................................................................................... 227 BIBLIOGRAPHY 230 xii lah'e l. l Di'Hf-‘il't'h’.’ TUL'I it 3. III Sdniint't, If 1 Fit) Sinusitis (I: ..omm'irk Set I I U91 C ‘3 .‘I'ILJ'II‘HI‘ 00036 — Dist rmm T.-' l. “‘71“- 4 I 9-CIQ'I \ i 3- 7 u C 4-, 3-(‘11‘5 SI; 123.43 1 ”I ’ -‘tlet, Ti 11:.IC4-4 C()m}_iqdr 2,2' “‘ C'lJtl @565. q. . I Whit 5C : urn-sum,” 85 . 6Q 3LILL\S:~'\‘ “lift thin I. ‘ I L \f’?:’ T» H Aunti‘ 7 I. . urlub’t. I r, 1‘ ‘84 ‘ Lin vl,’ u i tiffJg 7; ”In ..I _ ‘ (Jr 2” ”g I! (62:11?” '42....) l I \ ‘10 C ’9», (”hf-“1f :Qlld ‘ A" {(IIII’IUI ‘ - L’.‘ . f“ 'c ‘4”: . Om t. ‘purl)’ List of Tables Table 1.] Different Specific I TSs and their affects on learning rate ............................... 23 Table 3.1 A sample of a student homework results and submissions ............................... 73 Table 3.2 Statistics table includes general statistics of every problem of the course (Homework Set 1) ............................................................................................................. 85 Table 3.3 Analysis of Examination Problems (N =3 93) DoDiff = Difficulty Index DoDisc = Discrimination Index ....................................................................................... 87 Table 4.1. 9-Class labels regarding students’ grades in course PH Y 1 83__ 5502 ......... 93 Table 4. 2. 3-Class labels regarding students ’ grades in course PH Y] 83 5502 ............ 93 Table 4. 3. 2-class labels regarding students ’ grades in course PH Y1 83 5502 ............ 93 Table 4.4 Comparing Error Rate of classifiers with and without normalization in the case of 3 classes ................................................................................................................ 98 Table 4.5 Comparing Error Rate of classifiers 2-fold and 10-fold Cross- Validation in the case of 3 classes .............................................................................................................. 100 Table 4.6: Comparing the Performance of classifiers, in all cases: 2-Classes, 3 -Classess, and 9-Classes, Using 10-fold Cross- Validation in all cases. ......................................... 105 Table 4. 7 Variable (feature) Importance in 2-Classes Using Gini Criterion ............... 111 Table 4.8 Variable (feature) Importance in 2-Classes, Using Entropy Criterion ....... I 11 Table 4. 9: Comparing the Error Rate in CART, using 10-fold Cross- Validation in learning and testing set. .................................................................................................. I 12 Table 4.10: Comparing the Error Rate in CART, using Leave-One-Out method in learning and testing test. ................................................................................................. 112 Table 4.11: Comparing the Error Rate of all classifiers on PH Y I 83 data set in the cases of 2-Classes, 3-Classes, and 9-Classes, using IO-fold cross-validation, Without Optimization .................................................................................................................... I 18 xiii 115184.13. Compw ltlfllilltl GA in the L. with al. . ................ Table 4.13. F caturt 3535414. [4 oil. r 'r- . me 41)- 02.1mm lJ.I It 416 (Igmriu-r J. w 615 using III I. 4~l Comp-1r; EIII‘I‘IIII “In IIIJ'G'I {Ur L1, tdl.'\{ T‘ 'v “2354/5 Rt" I-EI digit IlgFfd'lU’c; "..n thtCm, IJ E 51;,- Ms a {(dJ'llfI Table 4.12. Comparing the CMC Performance on PH Y I 83 data set Using GA and without GA in the cases of 2-Classes, 3-Classess, and 9-Classes, 95% confidence interval. ........................................................................................................................... 128 Table 4.13. Feature Importance in 3-Classes Using Entropy Criterion ........................ 130 Table 4.14. 14 of LON-CAPA courses at MS U ............................................................. 131 Table 4.15 Characteristics of 14 of MS U courses, which held by LON-CAPA ............. 132 Table 4.16 Comparing the average performance% often runs of classifiers on the given datasets using I 0-fold cross validation, without GA ...................................................... 134 Table 4.1 7 Comparing the classification fusion performance on given datasets, without- GA, using-GA (Mean individual) and improvement, 95% confidence interval .............. 135 Table 4.18 Relative Feature Importance%, Using GA weighting for BS] 1 I 2003 course ......................................................................................................................................... 13 7 Table 4.19 Feature Importance for BS 1 I 1 2003, using decision-tree software CART, applying Gini Criterion ................................................................................................... 1 3 8 Table 5.1 (a) Data points and feature values, N rows and d columns. Every row of this table shows a feature vector corresponding to N points. (b) Partition labels for resampled data, n rows and B columns. ........................................................................................... 151 Table 5.2 An illustrative example of re-labeling difliculty involving five data points and four different clusterings of four bootstrap samples. The numbers represent the labels assigned to the objects and the “.7 ” shows the missing labels of data points in the bootstrapped samples ...................................................................................................... 15 7 Table 5. 3. Consistent re-labeling of 4 partitions of 12 objects ...................................... 160 Table 5. 4. A summary of data sets characteristics ......................................................... 164 Table 5.5 “Star/Galaxy ” data experiments. Average error rate (% over 10 runs) of clustering combination using resampling algorithms with diflerent number of components in combination B, resolutions of components, k, and types of consensus fimctions. ...... 169 Table 5. 6 The average error rate (%) of classical clustering algorithms. An average over 100 independent runs is reported for the k-means algorithms ....................................... I 72 xiv Table 5.7 Summar TGI‘ér‘ 5.8 Siti‘sdfli; narri ions 8, and In. I . r 'rr regarding the c n“. a Table 6.1 A comm Table 6.3 .4 cumin: .Jm 6 4 Chang. 3, V 1 3‘50. 13516 6, 5 LBS; - TJE'JItI' 66 CE‘!\1. 1.25.79 6,? BSJH M.» 6.8 c5311- 1.255669 [83:3 7 Table 5. 7 Summary of the best results of Bootstrap methods ....................................... 172 Table 5.8 Subsampling methods: trade-off among the values of k, the number of partitions B, and the sample size, S. Last column denote the percentage of sample size regarding the entire data set. (Bold represents most optimal) ...................................... I 73 Table 6. I A contingency table of student success vs. study habits for an online course ......................................................................................................................................... 1 79 Table 6.2 A contingency table proportional to table 6.1 ................................................ 185 Table 6.3 A contingency table for the binary case ........................................................ 185 Table 6.4 Characteristics of three MS U courses which used LON-CAPA in fall semester 2003 ................................................................................................................................. 1 9 7 Table 6.5 LBS_2 71 data set, difference of confidences measure .................................. 199 Table 6. 6 CEM_I41 data set, difference of confidences measure ................................ 199 Table 6. 7 BS_111 data set, difference of proportion measure ..................................... 200 Table 6.8 CEM_I 41 data set, chi-square measure ....................................................... 200 Table 6.9 LBS_2 71 data set, difference of confidences measure ................................... 201 XV Figure 1.1 Steps: ' Eflwlf.4nhuv Ewml3 Dunn Effie] 4 Dlrut: . figure” Did/32;". 1.;Jre I 6 C(Jmpun 125' ‘ ' :Ure ..l TH( Bu” F {gm . 3 3 ' It‘lrems .4 Thrt’t' I ~.. ~-.. -., -.. f: #:g‘Jre 7 3.: 51,121C . ‘XI 1‘." Mute} I C‘lmr)‘y I ‘I-Imrej’ . 9 One Far III-Taro it ICf Cfi‘\ \ Jr; PptI 3 P- I [HI ‘5‘"? i1 'I‘Iriic'DtI' I SIIC'C'€\ ‘ Slate List of Figures Figure 1.1 Steps of the KDD Process (F ayyad et al., 1996) ............................................ 6 Figure 1.2 A schema of distributed data in LON-CAPA ................................................ 14 Figure 1.3 Directory listing of user ’s home directory .................................................... 1 7 Figure 1.4 Directory listing of course ’s home directory ................................................ 18 Figure 1.5 Distributions for different learning conditions (Adapted fiom Bloom, 1984)22 Figure 1.6 Components of an Intelligent Tutoring System (ITS) .................................... 24 Figure 2.1 The Bayesian Classification Process (Adapted from Wu et al., 1991) ....... 31 Figure 2.2 A Three Layer F eedforward Neural Network (Lu et al., I 995) .................... 40 Figure 3.1 A sample of stored data in escape sequence code ......................................... 65 Figure 3 .2 Perl scrip code to retrieve stored data ........................................................... 65 Figure 3.3 A sample of retrieved data from activity log ................................................. 65 Figure 3.4 Structure of stored data in activity log and student data base ..................... 66 Figure 3.5 A sample of extracted Activity. log data ......................................................... 67 Figure 3.6 A small excerpt of the performance overview for a small introductory physics class ................................................................................................................................... 72 Figure 3. 7 Single-student view of a problem ................................................................... 74 Figure 3.8 Compiled student responses to a problem .................................................... 75 Figure 3.9 One Early Measure of a Degree of Difliculty ............................................... 76 Figure 3.10 Success (%) in Initial Submission for Selecting the Correct Answer to Each of Six ‘Concept’ Statements .............................................................................................. 78 Figure 3.11 Success Rate on Second and Third Submissions for Answers to Each of Six ‘Concept ’ Statements ....................................................................................................... 79 xvi Rare 3.13 Rand. 1757478313 IItI‘L‘lt" figure 3.14 me'r [festive dzstritI'utm as "less Than " on. Figure 3.15 Grad: (It? ”warm fltde’Itu'ltr Fzgure 3.16 Home F. A sgu'e 4] Grain}: R.-. .. 7 .w’f 4.-.‘ Ctmtgw :3 y T ‘ «(6 qt - [losses . FA.— 1 we 4.3; my «I H N” I .-Gasses.... F; .; r A‘-‘ e 4 4 CL’n‘")ILl'— Fic‘h'r‘ ‘ ~ 4 ’. Crap}, (I‘ {III-'Iiare 8 r up}: or F (are 4 1 - 0 L . O\’ c‘ r I ‘ 4 I (4’6 4 at. I] ‘1‘ LMIIIIII 0f~I'CIl IIIIIII u. d: t a“ I "I 0151-00 . L‘I ‘ “It“; \HIrf‘j (1‘79 F «.- ”I‘m A. u «*9 3 r I 3‘ TI Qto’lirm \Qn.‘ .~ . I ‘5 JR." ”III, I" ~ entsr‘ “In Figure 3.12 Randomly Labeled Conceptual Physics Problem ........................................ 79 Figure 3.13 Vector Addition Concept Problem ............................................................... 81 Figure 3. 14 Upper Section: Success Rate for Each Possible Statement. Lower Section: Relative distribution of Incorrect Choices, with Dark Gray as “greater than Light Gray as “Less Than ” and Clear as “Equal to ” ........................................................................ 83 Figure 3.15 Grades on the first seven homework assignments and on the first two midterm examinations ....................................................................................................... 88 Figure 3.16 Homework vs. Exam Scores. The highest bin has 18 students .................... 89 Figure 4.1 Graph of distribution of grades in course PH Y 1 83 SS02 ............................ 92 Figure 4.2: Comparing Error Rate of classifiers with 10-fold Cross- Validation in the case of 2-Classes ............................................................................................................. 101 Figure 4.3: Table and graph to compare Classifiers ’ Error Rate, 10-fold CV in the case of 2-Classes ..................................................................................................................... 102 Figure 4. 4: Comparing Error Rate of classifiers with 10-fold Cross- Validation in the case of 3-Classes ............................................................................................................. 103 Figure 4.5 Comparing Classifiers ’ Error Rate, 10-fold CV in the case of3-Classes 104 Figure 4.6. Graph of GA Optimized CMC performance in the case of 2-Classes ........ 127 Figure 4. 7. Graph of GA Optimized CMC performance in the case of 3-Classes ......... 127 Figure 4. 8. Graph of GA Optimized CMC performance in the case of 9—Classes ......... 128 Figure 4.9. Chart of comparing CMC performance, using GA and without GA. .......... 129 Figure 4.10. LON-CAPA: BS111 SS03, Grades distribution ......................................... 133 Figure 4.11 GA-Optimized Combination of Multiple Classifiers ' (CMC) performance in the case of 2-Class labels (Passed and Failed) for BS 1 1 I 2003, 200 weight vectors individuals, 500 Generations .......................................................................................... I3 6 Figure 5.1 Diflerent approaches to clustering combination ......................................... 14 6 Figure 5 .2 Taxonomy of different approaches to clustering combination .................... 14 7 Figure 5.3 First algorithms for clustering ensemble, based on co-association matrix and using diflerent similarity-based consensus functions ..................................................... 150 xvii figure 5.4 Algurtt' figure 5.5 Tho p- probabilzttes of at. < t; < 1;; ofthe ads circles and man 31. I Ewe 5.6.4/gttrm'r ' 'v figure 3.; Haltrm'. l J , ‘ ' ‘3ch With 3’"! pa' figure 5, 8 B, ' ; . um Jittery"! tofu lrlISIIt 515"”? 5.9 "HJ: m. frm “~_-‘.L'€n1 €011.86?“ u\‘ “3‘79 3.10 “5’4;- .llandkjfii _...‘ (T611! ( 1 x s J. F'mr g e 6] A CORN ‘1 Conn a"? 6 3 - A c 012': Fix»? 6 4 A (Yum :, “air - ‘ e 63 Set (I, ‘r‘x ‘54-? 6 F r .rr s .5- h "Lire 6 ‘. 3'. if» II IIJI’Nr “LJQI ! I‘ Arfrh "are ~P‘n9 4IIrIA .k. ('8 I E"? 459 ’21: Figure 5.4 Algorithms for clustering ensemble based on categorical clustering .......... 152 Figure 5.5 Two possible decision boundaries for a 2-cluster data set. Sampling probabilities of data points are indicated by gray level intensity at different iterations (to < t, < 1;) of the adaptive sampling. True components in the 2-class mixture are shown as circles and triangles ........................................................................................................ I 6 0 Figure 5.6 Algorithms for adaptive clustering ensembles .............................................. 162 Figure 5. 7 “Halfrings ” data set with 400 patterns (100-300 per class) , “2-Spirals " dataset with 200 patterns (100-100 per class) ................................................................ 163 Figure 5.8 “Iris ” data set. Bootstrapping for fixed consensus function MCLA, different B, and diflerent values of k ............................................................................................ 1 6 7 Figure 5.9 “Halfrings " data set. Experiments using subsampling with k=10 and B=100, different consensus function, and sample sizes S ............................................................ I 70 Figure 5.10 “Star/Galaxy " data set. Experiments using subsampling, with k = 4 and B = 50 and different consensus fimction and sample sizes S. ............................................ 1 71 Figure 5.11 Clustering accuracy for ensembles with adaptive and non-adaptive sampling mechanisms as a junction of ensemble size for some data sets and selected consensus fimctions ......................................................................................................... I 75 Figure 6.1 A contrast rule extracted from Table 6.1 ..................................................... 179 Figure 6.2 A contrast rule extracted from Table 6.1 ..................................................... 180 Figure 6.3 A contrast rule extracted fiom Table 6.1 ..................................................... 180 Figure 6.4 A contrast rule extracted from Table 6.1 ..................................................... 180 Figure 6.5 Set of all possible association rules for Table 6.3. ..................................... 186 Figure 6.6 Formal definition of a contrast rule ........................................................... 186 Figure 6. 7 Mining Contrast Rules (MCR) algorithm for discovering interesting candidate rules ................................................................................................................ 1 91 Figure 6.8 Attribute mining model, Fixed students ’ attributes, Problem attributes, and Linking attributes between students and problem .......................................................... 193 Figure 6.9 Entity Relationship Diagram for a LON-CAPA course .............................. 194 xviii Chapter 1 The es‘er-Imcrc. rapid expansion of Ii ofume. Education 1 situational instruct. mergs. students. a estabiish an onlmc 53m deseloped I. MICIDEIO tan State [n n . .- UTA meture f 0r Chapter 1 Introduction The ever-increasing progress of network-distributed computing and particularly the rapid expansion of the web have had a broad impact on society in a relatively short period of time. Education is on the brink of a new era based on these changes. Online delivery of educational instruction provides the opportunity to bring colleges and universities new energy, students, and revenues. Many leading educational institutions are working to establish an online teaching and learning presence. Several different approaches have been developed to deliver online education in an academic setting. In particular, Michigan State University (MSU) has pioneered some of these systems which provide an infrastructure for online instruction (Multi-Media Physics; CAPA; LectureOnline; PhysNet; Kortemeyer and Bauer, 1999; Kashy et al., 1997, LON-CAPA). This study focuses on the data mining aspects of the latest online educational system developed at MSU, the Learning Online Network with Computer-Assisted Personalized Approach (LON-CAPA). 1.1 Statement of the problem In LON-CAPA, we are involved with two kinds of large data sets: 1) educational resources such as web pages, demonstrations, simulations, and individualized problems designed for use on homework assignments, quizzes, and examinations; and 2) information about I suds. we base 1 information from unease: “19 UII sequence and item, assignment. ‘ The sseb brass. tram students. The: amt-book. but 3".» essentially expcrm' nzermation as to l Rhine “S. It 315.0 [In .3- ,J. t. ,Jtitiidr submiwt information about users who create, modify, assess, or use these resources. In other words, we have two ever-growing pools of data. As the resource pool grows, the information from students who have multiple transactions with these resources also increases. The LON-CAPA system logs any access to these resources as well as the sequence and frequency of access in relation to the successful completion of any assignment. The web browser represents a remarkable enabling tool to get information to and from students. That information can be textual and illustrated, not unlike that presented in a textbook, but also include various simulations representing a modeling of phenomena, essentially experiments on the computer. Its greatest use however is in transmitting information as to the correct or incorrect solutions of various assigned exercises and problems. It also transmits guidance or hints related to the material, sometimes also to the particular submission by a student, and provides the means of communication with fellow students and teaching staff. This study investigates data mining methods for extracting useful and interesting knowledge from the large database of students who are using LON-CAPA educational resources. This study aims to answer the following research questions: 0 How can students be classified based on features extracted from logged data? Do groups of students exist who use these online resources in a similar way? Can we predict for any individual student which group they belong to? Can we use this information to help a student use the resources better, based on the usage of the resource by other students in their groups? . How can the t" tires of PWI" pmblems be err How can data better design 0 students use to homework mor: anomalies in ho: How can data 11 late to solve 1);. educational am will take for : \0? . ‘\ . H“ can data rr 3' ' p [:3ng Cl&\\r How can the online problems that students engage in be classified? How do different types of problems impact students’ achievements? Can the classifications of problems be employed to find patterns of questions that help student success? How can data mining help instructors, problem authors, and course coordinators better design online materials? Can we find sequences of online problems that students use to solve homework problems? Can we help instructors to develop their homework more effectively and efficiently? How can data mining help to detect anomalies in homework problems designed by instructors? How can data mining help find patterns of student behavior that groups of students take to solve their problems? Can we find some associative rules between students' educational activities? Can we help instructors predict the approaches that students will take for some types of problems? How can data mining be used to identify those students who are at risk, especially in very large classes? Can data mining help the instructor provide appropriate advising in a timely manner? The goal of this research is to find similar patterns of use in the data gathered from LON-CAPA, and eventually be able to make predictions as to the most beneficial course of studies for each student based on a minimum number of variables for each student. Based on the current state of the student in their learning sequence, the system could then make suggestions as to how to proceed. Furthermore, through clustering of homework problems as well as the sequences that students take to solve those problems, we hope to help instructors design their curricula more effectively. As more and more students enter the online leamin sillgiow We 5” can be 15654”? 3P This dissertai concepts of data I shows hosts the ho sanstical measure research backgrou: clustering method di‘a tfln’esal proce solution strategies LIYIJ‘II‘IWJiIi'CL and It as: ’ l nrnent to C13“ the online learning environment, databases concerning student access and study patterns will grow. We are going to develop such techniques in order to provide information that can be usefully applied by instructors to increase student learning. This dissertation is organized as follows: The rest of this first chapter provides basic concepts of data mining and then presents a brief system overview of LON-CAPA that shows how the homework and student data are growing exponentially, while the current statistical measures for analyzing these data are insufficient. Chapter 2 introduces the research background: the important algorithms for data classification and some common clustering methods. Chapter 3 provides information about structure of LON-CAPA data, data retrieval process, representing the statistical information about students, problem and solution strategies, and providing assessment tools in LON-CAPA to detect, to understand, and to address student difficulties. Chapter 4 explains the LON-CAPA experiment to classify students and predict their final grades based on features of their logged data. We design, implement, and evaluate a series of pattern classifiers with various parameters in order to compare their performance in a real dataset from the LON- CAPA system. Results of individual classifiers, and their combination as well as error estimates are presented. Since LON-CAPA data are distributed among several servers and distributed data mining requires efficient algorithms form multiple sources and features, chapter 5 represents a framework for clustering ensembles in order to provide an optimal framework for categorizing distributed web-based educational resources. Chapter 6 discusses the methods to find interesting association rules within the students’ databases. We propose a framework for the discovery of interesting association rules sszLHin a “Cb'bJSCQ edscattonal setting. eiectise des;gn 0t and discusses the m 1.2 Data Mini Presentls. the .1 l .1 .' .. ~ its gises rise to 3 .. I'LJ‘1sgnr s-Jlgyt.‘ ~ is anal} x. v gses birth to a nes Dan Mining. M \. AI\ _ | r. I'll-137 I .._. -uu‘.& d";‘ "‘H ‘r Lus Id>c 1's " vdI‘ZI‘ ..Jzation- ln thi ‘3 ‘.‘ QiQ, mfilik’tl\. (1' .F‘IHI‘ “bx I!“‘ N‘; .3 fiv-‘L I V e\:r ‘. .-, _ “$11011 H' 5 al 5} Stern ? . ' {or .51 L.“- 5 StNV at 3:! it; - _. SII‘;~ mien“ In .. 57.“ .‘~| (1:.- {he . 3" lnirsm .:,‘-.‘ ‘\"I-;’2 within a web-based educational system. Taken together and used within the online educational setting, the value of these tasks lies in improving student performance and the effective design of the online courses. Chapter 7 presents the conclusion of the proposal and discusses the importance of future work. 1.2 Data Mining Presently, the amount of data stored in databases is increasing at a tremendous speed. This gives rise to a need for new techniques and tools to aid humans in automatically and intelligently analyzing huge data sets to gather useful information. This growing need gives birth to a new research field called Knowledge Discovery in Databases (KDD) or Data Mining, which has attracted attention from researchers in many different fields including database design, statistics, pattern recognition, machine learning, and data visualization. In this chapter we give a definition of KDD and Data Mining, describing its tasks, methods, and applications. Our motivation in this study is gaining the best technique for extracting useful information from large amounts of data in an online educational system, in general, and from the LON-CAPA system, in particular. The goals for this study are: to obtain an optimal predictive model for students within such systems, help students use the learning resources better, based on the usage of the resource by other students in their groups, help instructors design their curricula more effectively, and provide the information that can be usefully applied by instructors to increase student learning. 1.2.1 Wh Data Mining mmanzmg the R {farm‘s (If (.11 mini ,“uffz‘ms in «1.1:.1" (F. Clean\ I i Selection (if «we “kit‘s to d'Wm h- A“ “(if . . A’7IJ‘1 I xi“. ‘ “q ,"‘~ ..1 C r()c’\‘\\ 1.2.1 What is Data Mining? Data Mining is the process of analyzing data from different perspectives and summarizing the results as useful information. It has been defined as "the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" (Frawley et al., 1992; Fayyad et al., 1996). Interpretation / Evaluation Knowledge [Data Mining] [Transformation ] Preprocessing / . Cleansing : e a Selection E Patterns r 4 : E : Transformed : '.' : Data : E % ‘ Preprocesseti E E _ Data : : 1 Target 2 I '.' 5 III-IIIIIIIII-IIIIIIIII‘IIIIIIIIIII‘IIIIIIIIIII'III Figure 1.] Steps of the KDD Process (Fayyad et al., 1996) The process of data mining uses machine learning, statistics, and visualization techniques to discover and present knowledge in a form that is easily comprehensible. The word “Knowledge” in KDD refers to the discovery of patterns which are extracted from the processed data. A pattern is an expression describing facts in a subset of the data. Thus, the difference between KDD and data mining is that "KDD - data Mira-'1' t patterns fr” (Fagad ct .1 Hon ever. smut most researchers .. u\ ' the KDD prim A . l A 0 Brahman 8; Aan 0 Providing and its u“ in not 5pc. “KDD refers to the overall process of discoverying knowledge from data while data mining refers to application of algorithms for extracting patterns fiom data without the additional steps of the KDD process.” (Fayyad et al., 1996) However, since Data Mining is a crucial and important part of the KDD process, most researchers use both terms interchangeably. Figure 1.] presents the iterative nature of the KDD process. Here we outline some of its basic steps as are mentioned in Brachman & Anad (1996): Providing an understanding of the application domain, the goals of the system and its users, and the relevant prior background and prior knowledge (This step in not specified in this figure.) Selecting a data set, or focusing on a subset of variables or data samples, on which discovery is to be performed Preprocessing and data cleansing, removing the noise, collecting the necessary information for modeling, selecting methods for handling missing data fields, accounting for time sequence information and changes Data reduction and projection, finding appropriate features to represent data, using dimensionality reduction or transformation methods to reduce the number of variables to find invariant representations for data Choosing the data mining task depending on the goal of KDD: clustering, classification, regression, and so forth prg’lei—‘uz‘ . use 1* kn\\‘v\ it d These are the 1.2.2 Data M The obiecti‘n .ie 3:74:15 Predicti“ k. ‘ l.“ i rtmlktir‘t, 0 Selecting methods and algorithms to be used for searching for the patterns in the data 0 Mining the knowledge: searching for patterns of interest 0 Evaluating or interpreting the mined patterns, with a possible return to any previous steps 0 Using this knowledge for promoting the performance of the system and resolving any potential conflicts with previously held beliefs or extracted knowledge These are the steps that all KDD and data mining tasks progress through. 1.2.2 Data Mining Methods The objective of data mining is both prediction and description. That is, to predict unknown or future values of the attributes of interest using other attributes in the databases, while describing the data in a manner understandable and interpretable to humans. Predicting the sale amounts of a new product based on advertising expenditure, or predicting wind velocities as a function of temperature, humidity, air pressure, etc., are examples of tasks with a predictive goal in data mining. Describing the different terrain groupings that emerge in a sampling of satellite imagery is an example of a descriptive goal for a data mining task. The relative importance of description and prediction can vary between different applications. These two goals can be fulfilled by any of a number data mining tasks including: classification, regression, clustering, summarization, dependency modeling, and deviation detection. (Harrell and Frank, 2001; Montgomery et al., 2001) 1.2.3 Predictive tasks The following are general tasks that serve predictive data mining goals: Classification — to segregate items into several predefined classes. Given a collection of training samples, this type of task can be designed to find a model for class attributes as a function of the values of other attributes (Duda et al., 2001). Regression — to predict a value of a given continuously valued variable based on the values of other variables, assuming either a linear or nonlinear model of dependency. These tasks are studied in statistics and neural network fields (Montgomery et al., 200]). Deviation Detection —— to discover the most significant changes in data from previously measured or normative values (Aming et al., 1996; Fayyad et al., 1996). Explicit information outside the data, like integrity constraints or predefined patterns, is used for deviation detection. Arning et al., (1996) approached the problem from the inside of the data, using the implicit redundancy. 1 .2.4 Descriptive tasks Clustering -— to identify a set of categories, or clusters, that describe the data (Jain & Dubes, 1988). Summarization — to find a concise description for a subset of data. Tabulating the mean and standard deviations for all fields is a simple example of Summs and LL“ intcmCI . Deflt‘m' dept“?d ncmori mtXiCl i let e1 i 0 1.2.5 Mixed t2 Trier C are SUN 3‘ ""‘ ’ I .speus. Lsme thes- pr.) .suiL‘Ii‘e m5k3 ‘1’ Arum 1.x! SOYTR‘ lit. which \\ data. Stag/Henri. aS\ a UL mic. S“' " tqqcnr‘ Lid dISCO\ CH timinv (Ix. cm, summarization. There are more sophisticated techniques for summarization and they are usually applied to facilitate automated report generation and interactive data analysis (Fayyad et al., 1996). Dependency modeling — to find a model that describes significant dependencies between variables. For example, probabilistic dependency networks use conditional independence to specify the structural level of the model and probabilities or correlation to specify the strengths (quantitative level) of dependencies (Heckerman, 1996). 1.2.5 Mixed tasks There are some tasks in data mining that have both descriptive and predictive aspects. Using these tasks, we can move from basic descriptive tasks toward higher-order predictive tasks. Here, we indicate two of them: Association Rule Discovery — Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict the occurrence of an item based on patterns found in the data. Sequential Pattern Discovery — Given a set of objects, where each object is associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns followed by event occurrences which are governed by timing constraints found within those patterns. 10 So far ue bn on methods and a tub The rest: mining — next er t: diseased in chap: Data mining t method of analyst necessar} to 1mm. Tzenert Section F 1-3 Online E Sorta! Onim n3 .3! a nwgument l\\’1 as . eknnOiOO So far we briefly described the main concepts of data mining. Chapter two focuses on methods and algorithms of data mining in the context of descriptive and predictive tasks. The research background of both the association rule and sequential pattern mining —- newer techniques in data mining, that deserve a separate discussion - will be discussed in chapter five. Data mining does not take place in a vacuum. In other words, any application of this method of analysis is dependent upon the context in which it takes place. Therefore, it is necessary to know the environment in which we are going to use data mining methods. The next section provides a brief overview of the LON-CAPA system. 1.3 Online Education systems Several Online Education systems1 such as Blackboard, WebCT, Virtual University (VU), and some other similar systems have been developed to focus on course management issues. The objectives of these systems are to present courses and instructional programs through the web and other technologically enhanced media. These new technologies make it possible to offer instruction without the limitations of time and place found in traditional university programs. However, these systems tend to use existing materials and present them as a static package via the Internet. There is another approach, pursued in LON-CAPA, to construct more-or-less new courses using newer network technology. In this model of content creation, college faculty, K-12 teachers, and students interested in collaboration can access a database of hypermedia sofiware 1 See http://wwwedutools.info for an overview of current web-based educational systems. 11 modules that car. CAPA system is I‘ 1.3.1 LON-Cl LOX-CAPA F sadents \\ 1th p * l'" ‘Ifr . g. .' ""‘"‘~43ldil:ttll I". .1 w 33’ a» , :SHSAJIed p‘rk‘t‘i".'w 155C53Ck 0n CHI] . ' \1 :35} , 4- ..n with _. "I t1] 3 homogmmq (£331.: . - s11 “iih Crnn',‘ t " l. .inr.‘ ‘\ te'r. ‘Ork l‘ ‘37: \ C3 6] an fix "3! iii-“A ' . 11111131 .. d'...‘ :5». ""‘Cir. i. For gr}. modules that can be linked and combined (Kortemeyer and Bauer, 1999). The LON- CAPA system is the primary focus of this chapter. 1.3.1 LON-CAPA, System Overview LON-CAPA is a distributed instructional management system, which provides students with personalized problem sets, quizzes, and exams. Personalized (or individualized) homework means that each student sees a slightly different computer- generated problem. LON-CAPA provides students and instructors with immediate feedback on conceptual understanding and correctness of solutions. It also provides faculty the ability to augment their courses with individualized, relevant exercises, and develop and share modular online resources. LON-CAPA aims to put this functionality on a homogeneously distributed platform for creating, sharing, and delivering course content with emphasis on cross—institutional collaboration and intellectual property rights management. 1.3.2 LON-CAPA Topology LON-CAPA is physically built as a geographically distributed network of constantly connected servers. Figure 1.2 shows an overview of this network. All machines in the network are connected with each other through two-way persistent TCP/IP connections. The network has two classes of servers: library servers and access servers. A library server can act as a home server that stores all personal records of users, and is responsible for the initial authentication of users when a session is opened on any server in the network. For authors, it also hosts their construction area and the authoritative copy of 12 I exery resource tr. 110515 smdent SUN access. sen ers in Etery U\L’f tr. departmental or : pLi‘iisliiflg comp-..- :niannation 8CD ix: the student and co nstem has one 111 50?? Ol‘all of their every resource that has been published by that author. An Access Server is a machine that hosts student sessions. Library servers can be used as backups to host sessions when all access servers in the network are overloaded. Every user in LON-CAPA is a member of one domain. Domains could be defined by departmental or institutional boundaries like MSU, FSU, OHIOU, or the name of a publishing company. These domains can be used to limit the flow of personal user information across the network, set access privileges, and enforce royalty schemes. Thus, the student and course data are distributed amongst several repositories. Each user in the system has one library server, which is his/her home server. It stores the authoritative copy of all of their records. 13 The Learr The LearningOnline Network Library Server at 1:1 Access :1 Server / :1 offline replicate - re-route local copies / i=1 \ Internet update LAN '2 :1 an. Instructor :] a [:1 T :1 3 la: a = Special =' Services Clients D} Figure 1.2 A schema of distributed data in LON-CAPA l4 LON-CAPA .‘llSL' production access sen ers art accessed by the st uses mod_perl ins 1.3.3 Data Di. Educational 0 no applets. to 1nd r—r 1 ate prmueed ext resources. LON-(F [0 a . store and retire: . 5’“ Lamond] TCM’J ‘6.ch ‘ "- 1k ilezt L'RL S323»; Hi If“ 3! LS F - or “Hi-SC - 3 Pest br. 72-. 1.31;“; : L's if“ . (rm C . )l Uljd'v . \':L 1‘ i "a‘. t' Aft; } 0:" Ah [x n(;:l LON-CAPA currently runs on Redhat-Linux Intel-compatible hardware. The current MSU production setup consists of several access servers and some library servers. All access servers are set up on a round-robin IP scheme as frontline machines, and are accessed by the students for “user session.” The current implementation of LON-CAPA uses mod _perl inside of the Apache web server sofiware. 1.3.3 Data Distribution in LON-CAPA Educational objects in LON-CAPA range from simple paragraphs of text, movies, and applets, to individualized homework problems. Online educational projects at MSU have produced extensive libraries of resources across disciplines. By combining these resources, LON-CAPA produces a national distributed digital library with mechanisms to store and retrieve these objects. Participants in LON-CAPA can publish their own objects in the common pool. LON-CAPA will allow groups of organizations (departments, universities, schools, commercial businesses) to link their online instructional resources in a common marketplace, thus creating an online economy for instructional resources (lon-capa.org). Internally, all resources are identified primarily by their URL. LON-CAPA does enable faculty to combine and sequence these learning objects at several levels. For example, an instructor from Community College A in Texas can compose a page by combining a text paragraph from University B in Detroit with a movie from College C in California and an online homework problem from Publisher D in New York. Another instructor from High School E in Canada might take that page 15 from Communtt} section. Those in 1.3.4 Resour LON-CAPA : refers to these res i130 “ES Still“ ’Z or SITJ‘CILHC. for UT. 0 A Cont from Community College A and combine it with other pages into a module, unit or section. Those in turn can be combined into whole course packs. 1.3.4 Resource Variation in LON-CAPA LON-CAPA provides three types of resources for organizing a course. LON-CAPA refers to these resources as Content Pages, Problems, and Maps. Maps may be either of two types: Sequences or Pages. LON-CAPA resources may be used to build the outline, or structure, for the presentation of the course to the students. 0 A Content Page displays course content. It is essentially a conventional html page. These resources use the extension “.html”. 0 A Problem resource represents problems for the students to solve, with answers stored in the system. These resources are stored in files that must use the extension “.problem”. 0 A Page is a type of Map which is used to join other resources together into one HTML page. For example, a page of problems will appear as a problem set. These resources are stored in files that must use the extension “.page”. 0 A Sequence is a type of Map, which is used to link other resources together. Sequences are stored in files that must use the extension “.sequence”. Sequences can contain other sequences and pages. Authors create these resources and publish them in library servers. Then, instructors use these resources in online courses. The LON-CAPA system logs any access to these resources as well as the sequence and frequency of access in relation to the successful completion of any assignment. All these accesses are logged. l6 1.3.5 LON-C Internally. th home httpdl For example Figure 1.3 sit lBenteley d» {n 4" «ALca . ‘5 >' ‘ ‘ ~~ _. ~ X ‘.- ‘ -— .I"--- .. .i-r- . , - . '-’ ~- --- .“ -. - I... ".-u ‘ ... . -‘- -- - ‘ o ‘v . I -_.~ ' . ~ - x» - a -_. ' “ -- — - .- -I ~--- . u‘- - . .-.~‘ ..- -‘- ‘7 - . .a -_- " Q‘- '-' ‘ .. ‘-‘- n - o v ‘ 1 ‘-‘ ..- ~-‘ ":~- ‘ --. .- -l" ‘ _“- ."- 1.3.5 LON-CAPA Strategy for Data Storage Internally, the student data is stored in a directory: /home/httpd/lonUsers/domain/ 1 st.char/2nd.char/ 3 rd.char/usemame/ For example /home/httpd/lonUsers/msu/m/i/n/minaeibi/ Figure 1.3 shows a list of a student’s data. Files ending with .db are GDBM files (Berkeley database), while those with a course-ID as name, for example msu_12679c3ed54 3a25msul 1 .db, store performance data for that student in the course. ls —a1F /home/httpd/lonUsers/msu/m/i/n/minaeibi -rw-r--r-- 1 www users 13006 May 15 12:21 activity.log -rw—r ----- 1 www users 12413 Oct 26 2000 coursedescriptions.db -rw-r--r-- 1 www users 11361 Oct 26 2000 coursedescriptions.hist -rw-r ----- 1 www users 13576 Apr 19 17:45 critical.db -rw—r--r-- 1 www users 1302 Apr 19 17:45 critical.hist ~rw-r ----- 1 www users 13512 Apr 19 17:45 email_status.db -rw-r-—r-- 1 www users 1496 Apr 19 17:45 email_status.hist -rw-r--r-- 1 www users 12373 Apr 19 17:45 environment.db -rw-r--r-- 1 www users 169 Apr 19 17:45 environment.hist -rw-r ----- 1 www users 12315 Oct 25 2000 junk.db -rw—r—-r-- 1 www users 1590 Nov 4 1999 junk.hist -rw-r ----- 1 www users 23626 Apr 19 17:45 msu_12679c3ed543a25msu11.db —rw—r—-r—- 1 www users 3363 Apr 19 17:45 msu_12679c3ed543a25msu11.hist -rw-r ----- 1 www users 18497 Dec 21 11:25 msu_1827338c7d339b4msu11.db —rw—r-—r-- 1 www users 3801 Dec 21 11:25 msu_1827338c7d339b4msu11.hist -rw—r ————— 1 www users 12470 Apr 19 17:45 nohist_annotations.db -rw-r ----- 1 www users 765954 Apr 19 17:45 nohist_email.db -rw-r--r-- 1 www users 710631 Apr 19 17:45 nohist_email.hist -rw-r--r-- 1 www users 13 Apr 19 17:45 passwd -rw-r--r-- 1 www users 12802 May 3 13:08 roles.db —rw-r--r-- 1 www users 1316 Apr 12 16:05 roles.hist Figure 1.3 Directory listing of user’s home directory Courses are assigned to users, not vice versa. Internally, courses are handled like users without login privileges. The username is a unique ID, for example msu_12679c3ed543a25msull — every course in every semester has a unique ID, and there is no semester transition. The user-data of the course includes the full name of the course, a pointer to its top-level resource map (“course map”), and any associated deadlines, spreadsheets, etc., as well as a course enrollment list. The latter is somewhat l7 retardant since users. and lookin- - ~'— ' . ---- - on. -— -y . _- -- -- I. - I -. v _. ...... - I. .. r .- --’_- .- —._ -I ..... I. -. -I‘ ..... .- '1 -I” ..... ..- a..- I- .‘ -‘7- In. ‘% 7“ f‘,> . ‘-‘; '- ‘_ f \ 1\ xfi‘ s‘u‘ ‘ . 5 ‘ N ‘ - “‘C- h- H ‘7 I -‘ ‘N" _ ~. > redundant, since in principle, this list could be produced by going through the roles of all users, and looking for the valid role for a student in that course. ls -alF /home/httpd/lonUsers/msu/1/2/6/12679c3ed543a25msu11/ -rw-r ----- 1 www users 17155 Apr 25 16:20 classlist.db —rw—r--r-- 1 www users 60912 Apr 25 16:20 classlist.hist —rw-r ----- 1 www users 12354 Jan 4 16:40 environment.db -rw—r-—r-- 1 www users 82 Jan 4 16:40 environment.hist —rw-r ----- 1 www users 103030 May 15 14:47 nohist_calculatedsheets.db —rw-r ----- 1 www users 13050 May 9 21:04 nohist_expirationdates.db -rw-r--r~- 1 www users 6 Jan 4 16:40 passwd -rw-r ----- 1 www users 17457 May 9 21:04 resourcedata.db -rw-r--r-- 1 www users 8888 May 9 21:04 resourcedata.hist Figure 1.4 Directory listing of course’s home directory An example of course data is shown in Figure 1.4. classlist is the list of students in the course, environment includes the course’s full name, etc, and resourceda ta are deadlines, etc. The parameters for homework problems are stored in these files. To identify a specific instance of a resource, LON-CAPA uses symbols or “symbs.” These identifiers are built from the URL of the map, the resource number of the resource in the map, and the URL of the resource itself. The latter is somewhat redundant, but might help if maps change. An example is msu/korte/parts/partl.sequence 19 msu/korte/tests/partlz.problem The respective map entry is l8 SymbS 3’ e U' specific to a cert: | ofthe system. 1.3.6 Resour 0118 01‘ 1158 1‘. intimation cont: tesorrce pool on pages. demtmxtm' time‘sork assign ' 1;“; \- s1...ts.its that c?" “I ectr. ‘ e In pIUIIit [01" - ~ «11115 311410 Cd Symbs are used by the random number generator, as well as to store and restore data specific to a certain instance of a problem. More details of the stored data and their exact structures will be explained in chapter three, when we will describe the data acquisition of the system. 1.3.6 Resource Evaluation in LON-CAPA One of the most challenging aspects of the system is to provide instructors with information concerning the quality and effectiveness of the various materials in the resource pool on student understanding of concepts. These materials can include web pages, demonstrations, simulations, and individualized problems designed for use on homework assignments, quizzes, and examinations. The system generates a range of statistics that can be useful in evaluating the degree to which individual problems are effective in promoting formative learning for students. For example, each exam problem contains attached metadata that catalog its degree of difficulty and discrimination for students at different phases in their education (i.e., introductory college courses, advanced college courses, and so on). To evaluate resource pool materials, a standardized format is required so that materials from different sources can be compared. This helps resource users to select the most effective materials available. LON-CAPA has also provided questionnaires which are completed by faculty and students who use the educational materials to assess the quality and efficacy of resources. In addition to providing the questionnaires and using the statistical reports, we investigate here methods to find criteria for classifying students and grouping problems by examining logged data such as: time spent on a particular resource, resources visited 19 (other web page sutzsticallyl and gathered and sort and choose hem: ethire demonstrs resoLHCC p001 git stscents. There has bee the such method CAPA cats anal} \t 5»...ch01118“ o ‘ .i 1 The following [31‘s 1 application WW is not to dc of u" ‘ ‘1 AliseiiiDA ,.nt tuton (other web pages), due date for each homework, the difficulty of problems (observed statistically) and others. Thus, floods of data on individual usage patterns need to be gathered and sorted — especially as students go through multiple steps to solve problems, and choose between multiple representations of the same educational objects like video lecture demonstrations, a derivation, a worked example, case-studies, and etc. As the resource pool grows, multiple content representations will be available for use by students. There has been an increasing demand for automated methods of resource evaluation. One such method is data mining, which is the focus of this research. Since the LON- CAPA data analyses are specific to the field of education, it is important to recognize the general context of using artificial intelligence in education. The following section presents a brief review of intelligent tutoring systems — one typical application of artificial intelligence in the field of education. Note that herein the purpose is not to develop an intelligent tutoring system; instead we apply the main ideas of intelligent tutoring systems in an online environment, and implement data mining methods to improve the performance of the educational web-based system, LON-CAPA. 1.4 Intelligent Tutoring Systems (lTSs) Intelligent tutoring systems are computer-based instructional systems that attempt to determine information about a student’s learning status, and use that information to dynamically adapt the instruction to fit the student’s needs. Examples of educational researchers who have investigated this area of inquiry are numerous: Urban—Lurain, 1996; Petrushin, 1995; Benyon and Murray, 1993; Winkkels, 1992; Farr and Psotka, 20 1992'. Yemeni." 3‘ Gauthier. 1953." l breed rotors. bi" houledge. The stetegies speerty One of the f. tBitnnt 195m tsdztitiualtzed in: 13.1 style of instr 131nm 19st). 1 least) through r, hi‘t‘idualtzed \K: For many years. 1992; Venezky and Osin, 1991; Larkin and Cabay, 1991; Goodyear, 1990; Frasson and Gauthier, 1988; Wenger, 1987; Yazdani, 1987. ITSs are often known as knowledge- based tutors, because they have separate knowledge bases for different domain knowledge. The knowledge bases specify what to teach and different instructional strategies specify how to teach (Murray, 1996). One of the fundamental assumptions in ITS design is from an important experiment (Bloom, 1956) in learning theory and cognitive psychology, which states that individualized instruction is far superior to class-room style learning. Both the content and style of instruction can be continuously adapted to best meet the needs of a student (Bloom, 1984). Educational psychologists report that students learn best “by doing”, learn through their mistakes, and learn by constructing knowledge in a very individualized way (Kafaei and Resnik, 1996; Ginsburg and Opper, 1979; Bruner, 1966). For many years, researchers have argued that individualized learning offers the most effective and cognitively efficient learning for most students (Juel, 1996; Woolf, 1987). Intelligent tutoring systems epitomize the principle of individualized instruction. Previous studies have found that intelligent tutoring systems can be highly effective Ieaming aids (Shute and Regine, 1990). Shute (1991) evaluates several intelligent tutoring systems to judge how they live up to the main promise of providing more effective and efficient learning in relation to traditional instructional techniques. Results of such studies show that ITSs do accelerate learning. 21 1.4.1 Learni r In one study least effectiye ndnidsali/ed. l" \ . using three for? alti‘u idualized tutu. hum: 81 8: ~- 1.4.1 Learning Enhancement in ITSs In one study, Bloom (1984) states that conventional teaching methods provide the least effective method of Ieaming. As instruction becomes more focused and individualized, learning is enhanced. He compares students’ scores on achievement tests using three forms of instruction: conventional teaching, mastery teaching, and individualized tutoring. Mastery teaching is an instructional technique whereby a teacher supplements a lecture with diagnostic tests to determine where students are having problems, and adjusts the instruction accordingly. The results of this comparison are shown in Figure 1.5. Students receiving conventional teaching scored in the 50th percentile, students receiving mastery teaching scored in the 84‘h percentile, while students receiving individualized tutoring scored in the 98th percentile. r I I I I I Individualized Tutoring (1 : l) Mastery Learning (1 : 30) \ Number ~ - of Students _ Conventional (l : 30) \ l l l l L 50% 84% 98% Percentiles: Summative Achievement Scores Figure 1.5 Distributions for different learning conditions (Adapted from Bloom, 1984) 22 Bloom repli. ditierent domains etiectn'e educstw Since llSs . trstrdction. man; :rsmction and tr 5395 The me. .‘ t~ Ecqulth‘YV to CI" i“‘-> ~ ‘ .yhmm ~ 1.6.. :1“ .5 :JC Timid in ihe 5". 1101'] 3CC01d_f liter Table 1.1 N.‘ ITS \\ USP tun". ( \ \\ Smllhiown (Si Sherlock / I" E r— D Q. ’ 7 I 8‘ 3 I: J .‘9/ Ulor ' Z- / -/ rl / /-/ (Sh Bloom replicates these results four times with three different age groups for two different domains, and thus, provides concrete evidence that tutoring is one of the most effective educational delivery methods available. Since ITSs attempt to provide more effective Ieaming through individualized instruction, many computer-assisted instruction techniques exist that can present instruction and interact with students in a tutor-like fashion, individually or in small groups. The incorporation of artificial intelligence techniques and expert systems technology to computer-assisted instruction systems gave rise to intelligent tutoring systems — i.e., systems that model the learner’s understanding of a topic and adapt instruction accordingly. A few examples of systematically controlled evaluations of ITSs reported in the literature are shown in Table 1.1. Table 1.1 Different Specific ITSs and their affects on learning rate ITS Literature Objective progress LISP tutor (Anderson, 1990) “5mm“ 1:18P 1/3-2/3 less time programming . (Shute and Glaser, Teach scientific inquiry 1/2 time, same S‘m‘hmw“ 1990) skills knowledge Sherlock (Lesgold etal.,1990) Avionics troubleshooting “5 time, same knowledge Pascal ITS (Shute, 1991) Teach Pascal 1/3 time, same programmrng knowledge Stat Lady (Shute et al., 1993) Instruct statlstrcal More performance procedures Geometry (Anderson et al., Teach geometry theorems Better solving Tutor 1985) 23 Shute and Po the rotors do aeeel snuld be evaluate cases. indlylduals leaning from t: :ndnldsalized tut. mouse the mean ~ 1.42 Basic A There ls no st- ion the literature . 1913;1'32danl. 19‘ 1:01.131. the pedagl ......a.e. These 111.. 81 Figure ]. Shute and Poksta (1996) examine the results of these evaluations, which show that the tutors do accelerate learning with no degradation in outcome performance. The tutors should be evaluated with respect to the promises of ITSs — speed and effectiveness. In all cases, individuals using ITSs learned faster, and performed at least as well as those learning from traditional teaching environments. The results show that these individualized tutors could not only reduce the variance of outcome scores, but also increase the mean outcome dramatically. 1.4.2 Basic Architecture of an ITS There is no standard architecture for an ITS. Nevertheless, four components emerge from the literature as part of an ITS (Wasson, I997; Costa, 1992; Polson and Richardson, 1988; Yazdani, 1987; Wenger, 1987; Sleeman and Brown, 1982). These are the student model, the pedagogical module, the expert model, and the communication module or interface. These four components and their interactions are illustrated in Figure 1.6. ‘ Expert Model Student < > Pedagogical \ J, Communication A l Figure 1.6 Components of an Intelligent Tutoring System (ITS) 24 The Student v model tracks ho‘» incorrect respons. model is to pros g2? .ered- should h Hem®gg uformtton about tresertt is control. Layout to this cont," Pilot “Uniting the p. LEI-om t noon the to m%$m R3 l0 leg aleLNC ens] ”It" _ The Common; ..e and the \\ ‘73-‘er dmmemt‘ The student model stores information of each individual learner. For example, such a model tracks how well a student is performing on the material being taught or records incorrect responses that may indicate misconceptions. Since the purpose of the student model is to provide data for the pedagogical module of the system, all of the information gathered should be usable by the tutor. The pedagogical module provides a model of the teaching process. For example, information about when to review, when to present a new topic, and which topic to present is controlled by this module. As mentioned earlier, the student model is used as input to this component, so the pedagogical decisions reflect the differing needs of each student. The expert model contains the domain knowledge, which is the information being taught to the learner. However, it is more than just a representation of the data; it is a model of how someone skilled in a particular domain represents the knowledge. By using an expert model, the tutor can compare the learner's solution to the expert's solution, pinpointing the places where the learner has difficulties. This component contains information the tutor is teaching, and is the most important since without it, there would be nothing to teach the student. Generally, this aspect of ITS requires significant knowledge engineering to represent a domain so that other parts of the tutor can access it. The communication module controls interactions with a student, including the dialogue and the screen layouts. For example, it determines how the material should be presented to the student in the most effective way. 25 Thege four C motel. and the C mditldualized 6&3 structure of each t ITS. Current resea: lfilflilal sense oft for areas of intell: The domain is expert model I. ' The system m.; ' Til-e ‘* s LOT mil): 0 The interface r [0 Present in for C ,4. ~ \. .‘t M'Lure “‘1 l l } “Illl‘qlrn . .14 “in ‘ShJ’ ‘gzi‘r‘ These four components — the student model, the pedagogical module, the expert model, and the communication module — shared by all ITSs, interact to provide the individualized educational experience promised by technology. The orientation or structure of each of these modules, however, varies in form depending on the particular ITS. Current research trends focus on making tutoring systems truly “intelligent,” in the artificial sense of the word. The evolution of ITSs demands more controlled research in four areas of intelligence: the domain expert, student model, tutor, and interface. 0 The domain knowledge must be understood by the computer well enough for the expert model to draw inferences or solve problems in the domain. 0 The system must be able to deduce a student’s approximation of that knowledge. 0 The tutor must be intelligent to the point where it can reduce differences between the expert and student performance. 0 The interface must possess intelligence in order to determine the most effective way to present information to the student. For ITSs to have a great impact on education, these and other issues must be resolved. To take advantage of newer, more effective instructional techniques, ITSs of the future will have to allow for increased student initiative and inter-student collaboration (Shute and Psotka, 1996). ITSs must also assess Ieaming as it transfers to authentic tasks, not standardized tests, and establish connections across fields so that topics are not learned in isolation. A more fruitful approach for ITS development may be to develop specific cognitive tools, for a given domain or applicable across domains. 26 Such a transition explain. cntique. ; 1.4.3 Learning‘ There are 501‘; he exeloptnent a he may tots ants in et al. le'tiftllt. Lear} | ‘ i It???“ 4' v " . -L...ull".§ 15 Ex trderstand am new One key findxr 5’? knows“ LEG b “a; ., .ay-o *nhxt ‘.3 ml that kn 3171:!“ ,3 ,, ~ ‘4 mdt ltmm u. mister litera‘ C. NT 3" . N63") \ a c «r Irdn\lt\'l ‘ i :56 [O in l » 4:91} their WW}! For exam @2511 0 ‘ a C(JDCC mi ‘~~'~.5."""'1 ”(\JLJD . llqys , W (1999): ‘ an ‘ w‘ ‘7‘. u “'.‘ Ofr i"? H “A; - ‘9. ‘12 '1! Such a transition would allow future ITSs to be everywhere, as embedded assistants that explain, critique, provide online support, coach, and perform other ITS activities. 1.4.3 Learning and Cognition Issues for ITS Development and Use There are some findings in the areas of cognition and Ieaming processes that impact the development and use of intelligent tutoring systems. Many recent findings are paving the way towards improving our understanding of learners and learning (Bransford, Brown et al., 2000). Learners have preconceptions about how the world works. If their initial understanding is not referenced or activated during the Ieaming process, they may fail to understand any new concepts or information. One key finding regarding competence in a domain is the need to have a more than a deep knowledge base of information related to that domain. One must also be able to understand that knowledge within the context of a conceptual framework — the ability to organize that knowledge in a manner that facilitates its use. A key finding in the Ieaming and transfer literature is that organizing information into a conceptual framework allows for greater transfer of knowledge. By developing a conceptual framework, students are able to apply their knowledge in new situations and to learn related information more quickly. For example, a student who has learned problem solving for one topic in the context of a conceptual framework will use that ability to guide the acquisition, of new information for a different topic within the same framework. This fact is explained by Hume (1999): "When we have lived any time, and have been accustomed to the uniformity of nature, we acquire a general habit, by which we always transfer the known to the unknown, and conceive the latter to resemble the former.” 27 A relations}. stations. Trans: and what is tests. must eoneeise of Recent TCSL‘s‘.‘ lemeen tasks is tSmgley and An. see i text editor all ~ - ' .u.lUIS more EPIC eaters predicted I} aertss editors that 5" e‘ ‘ ..mres. Singlex EILTMR“ w l\l. \lkal Ck’n‘l A relationship exists between the learning and transfer of knowledge to new situations. Transferring is usually a function of the relationships between what is learned and what is tested. For students to transfer knowledge successfully across domains, they must conceive of their knowledge base in terms of continuous, rather than discrete steps. Recent research by Singley and Anderson indicates that the transfer of knowledge between tasks is a firnction of the degree to which the tasks share cognitive elements (Singley and Anderson, 1989). In their study, Singley and Anderson taught students several text editors, one after the other. They found that students learned subsequent text editors more rapidly and that the number of procedural elements shared by the two text editors predicted the amount of transfer. Their results showed that there was large transfer across editors that were very different in surface structures but had common abstract structures. Singley and Anderson were able to generate similar results for the transfer of mathematical competence across multiple domains. Emerging computer-based technologies hold great promise as a means of supporting and promoting Ieaming. There are several ways that such technology can be used to help meet the challenges of developing effective learning environments (El-Sheikh, 2001): o Bringing real-world problems to the Ieaming environment. 0 Providing “scaffolding” support to students during the Ieaming process. Scaffolding allows students to participate in complex cognitive experiences, such as model-based learning, that is more difficult without technical support. 0 Increasing opportunities for learners to receive feedback and guidance from sofiware tutors and learning environments. 28 0 But. and- 0 Exp. Learning e understanding '0 tiese new learn. can facilitate le.. hing these new ' 1.5 Summa This researef mol'lled'gT llt‘tm j The purpose is to ’- flied by ll‘htr‘g 0 Building local and global communities of teachers, administrators, students, and other interested learners. 0 Expanding opportunities for teachers’ learning. Learning environments need to be developed and implemented with a full understanding of the principles of learning and deveIOpmental psychology. In addition, these new learning environments need to be assessed carefully, including how their use can facilitate Ieaming, as well as the cognitive, social, and learning consequences of using these new tools. 1.5 Summary This research addresses data mining methods for extracting useful and interesting knowledge from the large data sets of students using LON-CAPA educational resources. The purpose is to develop techniques that will provide information that can be usefully applied by instructors to increase student learning, detect anomalies in homework problems, design the curricula more effectively, predict the approaches that students will take for some types of problems, and provide appropriate advising for students in a timely manner, etc. This introductory chapter provided an overview of the LON-CAPA system, the context in which we are going to use data mining methods. In addition, a brief introduction to Intelligent Tutoring Systems provided examples of expert systems and artificial intelligence in educational software. Following this, it is necessary to analyze data mining methods that can be applied within this context in greater detail. 29 Chapter 1 In the prexi toerses on metl tau mining \V; Chapter 2 Background on Data Mining Methods In the previous chapter we described the basic concepts of data mining. This chapter focuses on methods and algorithms in the context of descriptive and predictive tasks of data mining. We describe clustering methods in data mining, and follow this with a study of the classification methods developed in related research while extending them for predictive purposes. The research background for both association rule and sequential pattern mining will be presented in chapter five. 2.1 Classification and Predictive Modeling Classification is used to find a model that segregates data into predefined classes. Thus classification is based on the features present in the data. The result is a description of the present data and a better understanding of each class in the database. Thus classification provides a model for describing firture data (Duda et al., 2001; McLachlan, 1992; Weiss and Kulikowski, 1991; Hand, 1987). Prediction helps users make a decision. Predictive modeling for knowledge discovery in databases predicts unknown or future values of some attributes of interest based on the values of other attributes in a database (Masand and Shapiro, 1996). Different methodologies have been used for classification and developing predictive modeling including Bayesian inference (Kontkanen et al., 1996), neural net approaches (Lange, 1996), decision tree-based methods (Quinlan, 1986) and genetic algorithms-based approaches (Punch et al., 1995). 30 2.1.1 Baye One of the Bayesian class. assumes that :- thou-2h this {lbs classifier where (lam et al,‘ jinn: fzgare ll can be :eIern is sense. pat-em is tons; ..t-.;‘, ., \lbbtlltalltln 5L eaten a, For t aid each part-en p"!.. l "“tnl l \ u. l i ‘. 2.1.1 Bayesian Classifier One of the major statistical methods in data mining is Bayesian inference. The naive Bayesian classifier provides a simple and effective approach to classifier learning. It assumes that all class-conditional probability densities are completely specified. Even though this assumption is often violated in real world data sets, a naive Bayesian classifier (where a small number of parameters are not completely specified) is employed (Jain et al., 2000; Duda et al., 2001; Wu et al., 1991). The Bayes classifier shown in Figure 2.1 can be explained as follows: A set of patterns aj, j = 1, ...,n, is given, and every pattern is sensed by a sensing device which is capable of capturing the features. Each pattern is considered in terms of a measurement vector x,. A pattern a; belongs to a classification set a)” which includes all the possible classes that can be assigned to pattern a}. For the sake of simplicity, all feature measurements are considered identical and each pattern belongs only to one of the m-possible classes ‘0.» i = 1, ...,.m Bayesian Classification g 1 Evaluation ’———" Likelihood g2 . . .- Pattem xj Functions . Maxrmum 0961510" ’ Feature - Selector $ Extraction v a' n I .— . gr {1—1, ...n} g... {1:1 .m} Figure 2.1 The Bayesian Classification Process (Adapted from Wu et al., 1991) To classify a pattern into one of the m classes, a feature space is constructed according to the measurement vector x, which is considered to be a measurement of true 31 taluc’S damiiii ‘ functions esttnt Ba) 65 d3“: calculated aet‘tt.’ Etentuall): i CiLRSll‘ leation. Th prooabilzt} or likel The quadrant art-rm ' ui.On mtlhtll i S 53 GaUSstan m values damaged by random noisy data. The class-conditional probability density firnctions estimated from training data represent the uncertainty in discovered knowledge. p(x|o),.), i = l,...,m. (2.1) Bayes decision theory states that the a-posteriori probability that an event may be calculated according to the following equation: p(w. Ix): p(X|w,-)p(w.) , l. : l,...,m. (2.2) pm Eventually, the decision criteria can be applied for classification. To gain the optimal solution, the maximum likelihood classification or the Bayesian minimum error decision rule is applied. It is obtained by minimizing the misclassification and errors in classification. Thus, a pattern is classified into class a), with the highest posteriori probability or likelihood: g, = max]. {gt},j =1,...,m. (2,3) The quadratic discriminant function using the Bayesian approach is the most common method in supervised parametric classifiers. If the feature vectors are assumed to be Gaussian in distribution, the parameters of the Gaussians are estimated using maximum likelihood estimations. The discriminant function decision rule and the a- posteriori probabilities for each classification are calculated for each sample test, x, using the following equation (Duda et al., 2001): l 7- -. l gi(x):_§(x_lui) 2: (x—p,)—§ln|zil+lnp(a),) (2-4) 9 32 where r is MmaXm“ die sample to. SOl'JllOll. the til role is applied p’i_i.'572(';rl P’L‘l’l assitication 2.1.2 Decisii Decision in The oeeisioii ire or \\i »: X In the if: torrespond [Q a $6 Set ofdata 1 Lu ' :hé HAQ ‘ ~~ speed} e b\ th A, ‘ ~ 6 PM}? ireir fifth-1-2.. ~ " ““6 nllL‘S i ‘36: [in he attrzh l)“ .u; where x is a dxl vector representing any point in the d—dimensional feature space, ,ui (also a d x 1 vector) is the sample mean of the ith class training sample, and Z, (d x d) is the sample covariance matrix of the ith class training sample. To obtain the optimal solution, the maximum likelihood classification or the Bayesian minimum error decision rule is applied. The sample is then assigned to the class that produces the highest a- posteriori probability. It is obtained by minimizing the misclassification and errors in classification. 2.1.2 Decision tree-based method Decision tree-based methods are popular methods for use in a data mining context. The decision tree classifier uses a hierarchical or layered approach to classification. Each vertex in the tree represents a single test or decision. The outgoing edges of a vertex correspond to all possible outcomes of the test at that vertex. These outcomes partition the set of data into several subsets, which are identified by every leaf in the tree. A leaf of the tree specifies the expected value of the categorical attribute for the records described by the path from the root to that leaf. Learned trees can also be represented as sets of if- then-else rules. (Mitchell, 1997) An instance is classified by starting at the root node of the tree. At each level of the tree the attributes of an instance are matched to a number of mutually exclusive nodes. The leaf nodes assign an instance to a class. The classification of an instance therefore involves a sequence of tests where each successive test narrows the interpretation. The sequence of tests for the classifier is determined during a training period. Given some new data, the ideal solution would test all possible sequences of actions on the attributes 33 of the new it misclassifieatzi Tree-base; they are partn methods are It ‘ and errors in th. item the data e. Most alga: flgofilhm that L. tension trees. I”: siecessor C45 (1 each "3‘ .s‘ &.}“]A i -‘ I' “C “it i‘Z‘fi-ii its“ Untteis Ent Fir-.4. QsC- this n”, of the new data in order to find the sequence resulting in the minimum number of misclassifications. Tree—based classifiers have an important role in pattern recognition research because they are particularly useful with non-metric data (Duda et al., 2001). Decision tree methods are robust to errors, including both errors in classifying the training examples and errors in the attribute values that describe these examples. Decision tree can be used when the data contain missing attribute values. (Mitchell, 1997) Most algorithms that have been developed for decision trees are based on a core algorithm that uses a top-down, recursive, greedy search on the space of all possible decision trees. This approach is implemented by ID3 algorithmZ (Quinlan, 1986) and its successor C4.5 (Quinlan, 1993). C45 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, and rule derivation. The rest of this section discusses some important issues in decision trees classifiers. 2.1.2.1 What is the best feature for splitting? The first question that arises in all tree-based algorithms concerns which properties are tested in each node? In other words, which attribute is the “most informative” for the classifier? We would like to select the attribute that is the most informative of the attributes not yet considered in the path from the root. This establishes what a "Good" decision tree is. Entropy is used to measure a node’s information. Claude Shannon (1984) introduced this notion in “Information Theory”. Based on entropy, a statistical property 21D3 got this name because it was the third version of “interactive dichotomizer” procedure. 34 called informs examples in re. 2.12.1.1 Entr Entropy c? Emil-1C node ‘\ l’\/ called information gain measures how well a given attribute separates the training examples in relation to their target classes. 2.1.2.1.1 Entropy impurity Entropy characterizes the impurity of an arbitrary collection of examples S at a specific node N. Sometimes (Duda et al., 2001) the impurity of a node N is denoted by W). Entroy(S) :- i(N) z —ZP(a)j)log2 P(a)j) (2.5) where P((oj) is the fraction of examples at node N that go to category to]. . If all the patterns are from the same category the impurity is 0, otherwise it is positive; if all categories are equally distributed at node N then the impurity has its greatest value 1. The key question then is, on the decision path from the root to node N, what features and their values should we select for the test at node N when property query T is used? A heuristic is suggested to select a query that decreases the impurity as much as possible (Duda et al., 2001). MN)=itNi—PiitNii—(l—PiiitN.) (2.6) where N L and N R are the left and right descendent nodes, and the i(NL) and i(NR) are their impurities respectively, and PL is fraction of patterns at node N that will go to N 1. when the property query T is used. The goal of the heuristic is to maximize Ai , thus 35 minimizing [in query. 2.1.2.1.2 Gini One can i enrentin. Gini : rile. Essentiall} the parent node I1 ‘11 minimizing the impurities corresponds to an information gain which is provided by the query. 2.1.2.1.2 Gini impurity One can rank and order each splitting rule on the basis of the quality-of-split criterion. Gini is the default rule in CART because it is often the most efficient splitting rule. Essentially, it measures how well the splitting rule separates the classes contained in the parent node (Duda et al., 2001 ). i(N) = ZP(wj)P(w,-) =1‘ 21’2“”) (2.7) It” I As shown in the equation, it is strongly peaked when probabilities are equal. So what is Gini trying to do? Gini attempts to separate classes by focusing on one class at a time. It will always favor working on the largest or, if you use costs or weights, the most "important" class in a node. 2.1.2.1.3 Twoing impurity An alternative criterion also available in CART is Twoing impurity. The philosophy of Twoing is different from that of Gini. Rather than initially pulling out a single class, Twoing first segments the classes into two groups, attempting to find groups that together add up to 50 percent of the data. Twoing then searches for a split to separate the two subgroups (Duda et al., 2001). This is an ideal split. It is unlikely that any real-world database would allow you to cleanly separate four important classes into two subgroups in this way. However, splits that approach this ideal might be possible, and these are the splits that Twoing seeks to find. 36 2.1.22 Hov If we eo' impmt}' then l single training noisy problems 01'. ll‘it‘ other 11.: not be achieied tree-based class: 2.12.2.1 Cros Crossq 111 d “133 lDJda et a 1‘1“. ‘ ”5‘1 l C 111 ‘ ill: _ ‘ 1‘ gel ll): 1419- “*L and {h .n \ ski. :h 21 ifyi 2.1.2.2 How to Avoid Overfitting If we continue to grow the tree until each leaf node corresponds to the lowest impurity then the data is typically overfit. In some cases every node corresponds to a single training input. In such cases we cannot expect an appropriate generalization in noisy problems having high Bayes error (Duda et al. 2001; Russell and Norvig, 1997). On the other hand if we stop splitting early, then a high performance tree classifier will not be achieved. There are several approaches to avoid overfitting in the training phase of tree-based classification: 2.1.2.2.1 Cross-Validation Cross-validation is a technique to eliminate the occurrence of overfitting. The main idea of cross-validation is to estimate how well the current hypothesis will predict unseen data (Duda et al. 2001; Russell and Norvig, 1997). This is done by randomly dividing the data into two subsets, training and test. Usually, the test subset is a fraction of all of the data, i.e., 10%. The hypothesis induced from the training phase is tested on the rest of data to get the prediction performance. This should be repeated on different subsets of data, and then the result averaged. Cross-validation should be used in order to select a tree with good prediction performance. 2.1.2.2.2 Setting a threshold Another method for overfitting avoidance is to consider a small threshold value in minimizing the impurity. We stop splitting when the impurity at a node is reduced by less than the considered threshold. The benefit of this method over the cross-validation is that 37 the tree is in“ 2.1.22.3 Pru The print“ approach. call: tree to be cant sub-tree rooted . classification or it"h—e resuiting ; I. .‘ .. a duttnell. l99‘i (’3 ‘4- the tree is trained using all the training data. Another benefit of this method is that leaf nodes can lie at different levels of the tree. 2.1.2.2.3 Pruning The principal alternative of stop-splitting is pruning (Duda et al., 2001). One approach, called reduced-error pruning (Quinlan, 1987), sets each node in the decision tree to be candidate for pruning. “Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set.” (Mitchell, 1997) In C45, Quinlan (1993) applied a successful technique for finding high accuracy hypotheses during the pruning process, which is called rule post pruning. It involves the following steps: 1. Induce the decision tree from the training set, growing the tree until the training data is fully fitted as well as possible, allowing overfitting to occur. 2. Convert the learned tree into an equivalent set of rules by creating a rule corresponding to a path from the root to a leaf node. 3. Prune each rule by deleting any preconditions that lead to promoting the estimated accuracy. 4. Sort the pruned rules by their estimated accuracy, and set them in a sequence for classifying the instances. 38 llhy 5h“ that records C‘ "Prtjer tlte’ si‘I’ 2.1.2.3 DraV The draiih Single data poin tie decision. bo. anode is split decision tree cl “ ‘ l U JCCISIOII ll"' \\ ‘5'“1’ l- ’ ~ 3.11. Kiltkhcl‘ 2.1.3 Neur: \ lie llru': .996). The e3 :13“.- w ' unskilo. Illsl l. 43th In SK \' .‘Uf‘t‘n ’ 1%. l99