.. 1. and: . . ._.:u.v .Wuz. . . , . E.- ”13.4.“ .r ...x1.. «3.).» a...» w... .1,“ .V..w..r1..._. 1dr..- 1:17.21; .3, r n.2,... ..1....,“.4 pin... .1 31...“. .erl... c a 2... 11d?! .. u .1... . 3:. . 1..... : .fl...én..wl.:.. ,1: I . . .213)?” (4-3.4. , 5.9.5:..33. .31.. ,3 (.7003 - LIBRARY Michigan State University dissertation entitled This is to certify that the FUSION-BASED VIDEO SEGMENTATION AND SUMMARIZATION presented by John K. Dixon has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Science 8- Engineering Age wMajg/Prb/fessor’s Signature Date MSU is an Affirmative ActiorVEqual Opportunity Institurlon PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. | DATE DUE DATE DUE DATE DUE 6/01 cJCIRC/DateDuepBS—pts FUSION-BASED VIDEO SEGMENTATION AND SUMMARIZATION By John K. Dixon A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science & Engineering 2003 ABSTRACT FUSION-BASED VIDEO SEGMENTATION AND SUMMARIZATION By John K. Dixon This thesis examines the problem of video segmentation and summarization from a results fusion perspective. Many techniques have been developed for the segmentation and summarization of digital video. The variety of methods is partially due to the fact that different methods work better on different classes of content. Global histogram- based segmentation works best on color video with clean cuts and global intensity changes; local histogram-based segmentation is less sensitive to region changes in the video and therefore works better when scenes consisting of similar content are shot from different angles; DOT-based segmentation algorithms attempt are less sensitive to abrupt intensity changes due to lighting effects such as camera flashes; edge-based segmentation algorithms work well when high quality edge information can be ex- tracted from the video sequence, motion-based summarization works best on video with moving cameras and a minimum of disjoint motion. Results fusion combines the properties of these varying algorithms into a common framework that can benefit from the advantages of each disparate approach. Recognizing that there is no single best solution for each of these problems has led to this work in integrating the variety of existing algorithms using results fusion methods. The work is divided into four parts. The thesis begins with an in—depth study of the various video segmentation methods. This chapter categorizes the existing shot segmentation and summarization methods, noting their strengths and weaknesses. Next, results fusion based algorithms and implementations from a variety of fields are reviewed and studied so as to understand the methods that can be applied to video segmentation and summarization. This chapter examines results fusion research from the document retrieval and biometric communities and with an eye towards applica- tion to the video domain. The third part of this work presents the results of applying results fusion for video segmentation. This section compares and contrasts individual algorithms with the results fusion implementations. Finally, it is demonstrated that the results fusion methodology used for video segmentation can be extended to video summarization. . Thesis Supervisor: Dr. Charles B. Owen Professor, Michigan State University This research was supported in part by The MITRE Corporation. To my mother Faye M. Dixon and brother Ian J. Dixon iv ACKNOWLEDGMENTS TO my mother, Faye M. Dixon, thank you for all the support and love that you have shown me throughout my life. You have been a blessing. and an angel from heaven. I could not have completed this thesis without you. There have not been words created to express the love and appreciation I have for you. To my aunt, Dr. Brenda McClain, thank you for all the encouragement and financial support throughout my life. When I thought there was no way, you always provided one. You always have been an inspiration. I could not have finished this thesis without your love and generosity. To my advisor, Dr. Charles Owen, thank you for being the best advisor a student could possibly have. You are the consummate professional and mentor. Thank you for your untiring support and encouragement in getting this thesis completed. I look forward to continuing to work with you in the future. TABLE OF CONTENTS LIST OF FIGURES 1 Introduction 1.1 Questions addressed by this thesis ...................... 1.1.1 Shot Segmentation ............................. 1.1.2 Summarization ............................... 1.1.3 Results Fusion ............................... 1.2 Contributions of this thesis ................... 1.3 Definitions ............................. 1.4 Stucture of this thesis ...................... 2 Related Work 2.1 Video Systems .......................... 2.2 Segmentation ........................... 2.2.1 Uncompressed domain techniques ............... 2.2.2 Compressed Techniques .................... 2.3 Summarization .......................... 2.3.1 Still-Image Representation ................... 2.3.2 Moving Image Representation ................ .. 2.4 Other Techniques ......................... 2.5 Results Eision .......................... 2.5.1 Sum Rule ............................ 2.5.2 Voting Strategies ........................ 2.5.3 Probabilistic Strategies ..................... 2.5.4 Machine Learning and Data Mining .............. 2.6 Summary ............................. 3 Shot Detection Methods 3.0.1 Gradual Shot Boundary Detection Issues ........... 3.1 Shot Boundaries ......................... 3.1.1 Global Color Histogram .................... 3.1.2 Region-based or Local Histograms .............. 3.1.3 Edge Features .......................... 3.1.4 DCT Coefficients ........................ 3.2 Summary ............................. 4 Results Fusion 4.1 Levels of Results Fusion ..................... 4.1.1 Abstract Level Fhsion ..................... 4.1.2 Measurement Level Fusion ................... vi 4.1.3 Feature Extraction Level Fusion ...................... 67 4.1.4 Output Level Fusion ............................ 73 4.1.5 Fusion Level Comparisons ......................... 76 4.2 Results Fusion for Video Shot Segmentation . . .0 ............. 77 4.2.1 Support Vector Machines ......................... 77 4.2.2 Decision Trees ............................... 82 4.2.3 Rulesets ................................... 83 4.2.4 Neural Networks .............................. 84 4.3 Features .................................... 86 4.4 Results Fusion Shot Segmentation ...................... 87 4.5 Baseline Testing Methods .......................... 90 4.5.1 Boolean Logic ................................ 90 4.5.2 Majority Voting .............................. 92 4.6 Cross Validation ............................... 92 4.7 Summary ................................... 93 5 Experimental Evaluation 94 5.1 Video Corpus ................................. 94 5.2 Performance measures ............................ 97 5.3 Training Data ................................. 100 5.4 Filtering .................................... 101 5.5 Existing Method Results ........................... 102 5.5.1 Single Threshold .............................. 102 5.5.2 Adaptive Threshold ..................... ' ....... 103 5.5.3 Majority Voting Method .......................... 112 5.5.4 Boolean Logic ................................ 114 5.6 Results Rision Engine Results ........................ 120 5.6.1 Decision Ti'ees and Rulesets ........................ 120 5.6.2 Feed-Forward Neural Network ....................... 121 5.6.3 SVM ..................................... 125 5.6.4 Results Fusion Method Comparison .................... 129 5.7 Timing ..................................... 133 5.8 Conclusion ................................... 137 6 Extensions to Summarization 139 6.1 Unstructured Video .............................. 141 6.2 Features .................................... 142 6.3 Camera Motion ................................ 143 6.4 Motion-based keyframe extraction ...................... 145 6.5 Results Fusion ................................ 151 6.6 Summary ................................... 154 7 Summary 155 7.1 Future Work ................................. 158 vii APPENDICES 159 A Appendix A 160 A.1 Television Programs ................ . ............. 160 A.2 Movies ..................................... 161 AB Cartoons .................................... 161 A.4 Music Videos ................................. 163 viii 1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3 2.4 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 LIST OF FIGURES Proposed Results Fusion System ....................... 7 Cut Between Successive Frames ....................... 14 Dissolve Video Edit .............................. 15 Fade In Video Effect ............................. 16 Wipe Video Edit ............................... 17 Typical Video Structure ........................... 21 Uncompressed Shot Segmentation Techniques ............... 23 Typical Video Structure ........................... 35 R111 Image (352x240) and its DC image (44x30) .............. 37 Still Flame Segmented into Regions ..................... 55 Still Frame and Edge Image ......................... 57 Still and DCT Image ............................. 58 Browne Boolean Logic Method ....................... 66 Zhong Multi-stage Shot Detection ...................... 69 Multiple Separating Hyperplanes in 2D space ............... 78 The decision boundary and optimal hyperplane in 2D space [78]. The support vectors in red denote the margin of the largest. separation ‘ between the two classes. ......................... 79 SVM Theory with slack variables .5,- and E]- [78] .............. 81 SVM Transformation [78] .......................... 81 Two-Layer Neural Network Architecture .................. 84 Neural Network Transfer Functions ..................... 85 Results Fusion Video Segmentation Methods ................ 88 Partial Decision TTee Output of C50 .................... 89 Partial Ruleset Output of C50 ....................... 89 Neural Network Implementation ....................... 90 Boolean Logic Method ............................ 91 Majority Voting Method ........................... 92 Video Corpus ................................. 95 Video Scripting Tool ............................. 96 Confusion Matrix ............................... 98 Color Histogram Precision vs. Recall Graph ................ 104 Local Histogram Precision vs. Recall Graph ................ 105 DCT Precision vs. Recall Graph ....................... 106 Single Threshold Precision and Recall Table ................ 107 Single Threshold Precision Recall Graph .................. 108 Single Thresholds used for each method .................. 109 RKelly A Music Video Sequence ....................... 109 ix 5.11 Adaptive Threshold Precision vs. Recall Table ............... 111 5.12 Adaptive Threshold Precision vs. Recall Graph .............. 113 5.13 Majority Voting Precision vs. Recall Table ................. 114 5.14 Majority Voting vs. Adaptive Local Histogram Precision vs. Recall Table 115 5.15 Majority Voting Precision vs. Recall Graph ................. 116 5.16 Boolean Logic Precision vs. Recall Table .................. 117 5.17 Boolean Logic vs. Adaptive Local Histogram Precision vs. Recall Table . 118 5.18 Boolean Logic Precision vs. Recall Graph .................. 119 5.19 Decision Tree and Ruleset Precision vs. Recall Table ........... 121 5.20 Decision Tfee, Ruleset, and Single Modality Decision TIee/Ruleset Table 122 5.21 Decision Tree, Ruleset, and Adaptive Local Histogram Table ....... 123 5.22 Decision TTee and Ruleset vs. Adaptive Local Histogram Precision vs. Recall Graph ............................... 124 5.23 Results Fusion Neural Network and Single Modality Neural Netowrk Table 126 5.24 Results Fusion Neural Network and Adaptive Local Histogram Table . . 127 5.25 Results Fusion Neural Network vs. Adaptive Local Histogram Precision vs. Recall Graph .............................. 128 5.26 Results Fusion SVM and Single Modality SVM Table ........... 130 5.27 Results Fusion SVM and Adaptive Local Histogram Table ........ 131 5.28 Results Fusion SVM vs. Adaptive Local Histogram Precision vs. Recall Graph ................................... 132 5.29 Results Fusion Methods Table ........................ 134 5.30 Results Fusion Methods Precision vs. Recall Graph ............ 135 6.1 Video System Components .......................... 140 6.2 UAV Video Sequence ............................. 142 6.3 Results Fusion for Video Summarization .................. 144 6.4 Motion Vectors from a Panning Video Sequence .............. 146 6.5 Camera motion over a scene ......................... 146 6.6 Possible keyframe overlap. .......................... 147 6.7 Example of polygon overlap ......................... 150 6.8 Motion-based keyframe extraction interface ................ 152 A.1 24:12am to 1am local shot effect ....................... 161 A.2 Blade 2 action sequence ........................... 162 A3 The Royal T enenbaums shot sequence .................... 162 A.4 The Family Guy: Da Boom explosion sequence ............... 163 A5 R Kelly: If gradual transition sequence ................... 164 Chapter 1 Introduction Locating content in digital video continues to be a significant problem as massive quantities of digital video accumulate in corporate libraries, public archives, and home collections. A key element of any method that attempts to index digital video is effective shot segmentation and summarization of the video. Shot-based segmen- tation is the first step in determining the structure of the video by breaking it into the components that were originally edited together to form the final product. Sum- marization presents a pictorial summary of an underlying video sequence in a more compact form, eliminating the massive inter-frame redundancy present in digital video and film. Some researchers combine summarization and an additional process called abstraction into one process. Abstraction is the process of creating a series of still or moving images that is Shorter in length than the original video and preserves the essential meaning of the video [88]. This thesis considers summarization and abstrac- tion to be separate and unique operations and defines abstraction as the process of reducing the redundancy created in the summarization process due to repeated scenes or Shots, maintaining the original structure of the video. Most video content does not consist of a single continuous recording or filming, but rather a set of discrete recordings that have been edited together. Shot-based segmentation seeks to decompose this edited structure into the components used to construct the video. Shot boundaries can be created by various methods. The most elementary shot boundary is the hard cut. Other methods consist of production edits such as fades, cross-fades, dissolves, and wipes. Some shot boundaries occur between two frames of video, while others occur between multiple frames. In order for a Shot boundary detection method to be effective, it must be [as accurate as possible, with few false positives (incorrectly identified shot boundaries) and false negatives (missed shot boundaries). Indexing methods require minimization of redundancy for effective performance. Were a complete video sequence added on a frame-by-frame basis to an image database, the search mechanism would be forced to contend with hundreds of se- quential frames with nearly identical content, making differentiation of results very difficult. Ideally, an indexing method would be operating on a compact summariza- tion of the content with only salient elements subject to analysis. Additionally, given the limited performance of indexing methods and the questionable ability of humans to pose exact queries, it is essential that results be presented in a way that allows for very fast browsing by the user with a minimum of redundant results presented. Great progress has been made on shot segmentation and summarization of digital video. However, it is common for much of this research to focus on Specific classes of video or limited content corpuses. Major projects have analyzed news broadcasts, music videos, and late night comedians. Additionally, much of this work tends to focus on small sets of video content and has not been tested on a large video test suite. The current research has answered many questions about how to analyze video where the structure is known in advance. However, any general solution must work for a wide variety of content without the input of manually collected structural knowledge. This fact has been well known in the mature document analysis and the biometric communities for many years [5, 61, 48, 67, 70, 117, 139, 146]. 1.1 Questions addressed by this thesis This thesis examines the problem of creating new algorithms for Shot segmentation and video summarization that improve on the performance of existing methods. The specific approach we apply is to examine the wide range of existing segmentation and summarization methods and blend representative algorithms into a composite system using results fusion. The primary goal is to build a system that can utilize the strengths of the various algorithms, where appropriate to the underlying content, while avoiding the weaknesses when the method is inappropriate to the content. 1.1.1 Shot Segmentation There has been considerable research on video segmentation techniques [10, 28, 47, 74, 86, 95, 157, 155]. Each of these segmentation techniques is successful at determining shot boundaries for specific classes of video. Color histogram-based algorithms build models of the dynamic change in color distribution between successive frames and are successful when applied to video sequences where the color distribution changes abruptly between shots [47, 95, 102, 157, 156]. Model-based algorithms use learning algorithms to construct models of the temporal nature of transitions and are effective when a set of transitions is known and expected in the content [11, 115]. Motion- based algorithms attempt to track content motion through image sequences and are effective when object or camera locations change Significantly between shots [95, 144]. The wide variety of techniques forms a toolbox for the system designer from which an appropriate segmentation algorithm can be chosen for a given class of video. When the type of video is known a priori, any type of technique can be employed with relative success. However, if the type of video is unknown before analysis, certain assumptions cannot be made about the video and the best choice of algorithms cannot be predetermined, nor can the appropriate parameterization of the algorithms be made (setting thresholds and intervals for example). Research efforts, to date, have predominately focused on the independent implementation of individual segmentation algorithms [105, 106, 155, 157]. Limited research has attempted to combine shot boundary detection algorithms into a composite system [15, 119, 151, 158]. What constitutes good shot segmentation? In developing novel methods for video shot segmentation it is imperative to know what characteristics comprise good shot segmentation. Within a Shot, a frame may differ from its neighboring frames by either camera and object movement, focal length changes, or lighting changes 3 [110]. A good shot segmentation algorithm Should disregard frame changes within a shot. In addition, accuracy is an important factor in any shot segmentation method. The algorithm should attempt to maximize correct detections and minimize false and missed detections. In our view, missed detections are more costly than false detec- tions. Missed shot boundaries can never be recovered, however false detections can be corrected during the summarization and abstraction process, which can eliminate the possible redundancy. This thesis addresses the question about the degree to which the various segmenta- tion algorithms can be integrated to create a composite technique. The new composite technique should be sensitive to the characteristics of the video under analysis and, therefore, applicable to a larger set of content without Specific per-video tuning. This research necessarily began with a study of the effectiveness of the various segmen- tation algorithms on a wide variety of video. The classes of video that will be used for our study are commercial films, home video, news broadcasts, raw news footage, surveillance video, and Unmanned Aerial Vehicle (UAV) military video. The video includes both edited and unedited footage. The results of our algorithm analysis provide important information as to the characteristics that each segmentation tech- nique exhibits for a given type of video. These characteristics are then utilized when developing a novel adaptive shot-based segmentation algorithm. Our approach to tackling the problem of Shot segmentation for a wide variety of video classes adopts results fusion techniques from the document retrieval and biometric community. Results fusion can be thought of as a decision function that has multiple inputs and produces an output that is based on the characteristics of the inputs. This thesis has developed a fusion-based Shot segmentation algorithm that fuses together several key features of digital video and performs Shot segmentation based on the characteristics of those features. The features were chosen based on their ability to capture important information contained in the video sequence. The key features that are used to characterize a video segment are color, texture, motion, and compressed image characteristics. In our approach, shot segmentation is treated as a binary classification problem in which each frame in a video sequence is or is not considered as a shot boundary. Results fusion strategies using Support Vector Machines (SVMS), Decision Trees, Rulesets, and Neural Networks are used to fuse the multiple features to determine a higher-quality, more accurate, segmentation. Prior research has proven that these methods produce good performance for solving binary classification problems [19, 39, 75, 108, 111, 139, 149]. 1.1.2 Summarization This thesis also demonstrates that results fusion and adaptation techniques can be applied to video summarization. One of the critical tools of any indexing and browsing environment is effective summarization. Video to be indexed must be presented to an indexing system with a minimum of redundancy so as to avoid redundant retrieval results and to maximize the disparity in the indexing space. Likewise, search results must be presented to human users as compact summaries that allow users to quickly browse through the candidate choices and choose the correct result or adapt the search as quickly as possible. Again, many different approaches for video summarization exist. This toolbox of approaches is utilized as the basis for an adaptive solution for video summarization that draws on the strengths of the different approaches in different application classes. Video abstraction is the mapping of an entire video segment into a smaller number of representative images [157]. It has been recognized that representing a complete video shot with a single image is an important step towards representing video in a compact meaningful form [131]. These images may be extracted frames from the actual video sequence or composite images constructed from the sequence using salient stills [134] methods or image mosaics [63, 97, 96]. Although these images are single frames, they do not represent one discrete moment in time. Moreover, these images represent the aggregation of temporal changes that occur within a moving image sequence with the salient features preserved [16]. This abstraction has traditionally been done manually in film and video libraries. The huge volumes of video data accumulating today require fully automated techniques to reduce the role of human involvement as much as possible. However, in some instances this representation 5 may not be enough to capture the dynamic action in complex shots. As a result, researchers have experimented with developing moving image abstracts of shots. There has been a considerable amount of research on automated video abstraction techniques [23, 24, 25, 55, 73, 85, 97, 96, 112, 126, 127]. Summarization involves reducing the redundancy created in the abstraction process due to repeated scenes or Shots, maintaining the original structure of the video. Many techniques exist for developing video summaries. These techniques include moving images abstracts, video skims, and keyframe extraction. What constitutes good summarization? Good summarization should seek to eliminate redundancy and present the video in a compact form that allows for maximum user retention and comprehension. It should briefly and concisely present the contents of the original video [118]. Its length should be shorter than the original video sequence, but how much shorter? It should only focus on the content that is important to the user, but what is important? One of the main problems with any summarization technique is that the answers to the previous questions vary from user to user. In some cases, users may need to view a few still images of the video sequence, and in others, users may need to view a short clip segment extracted from the longer video. There exists no optimal summary form. Additionally, it is difficult to measure the performance of any summarization technique in terms of a quantifiable result. The best we can do is to compare the summary to what a human user would consider optimal. This thesis examines the integration of various summarization techniques to de- velop an effective composite method that, again, is generally applicable to a wide class of content. Determining the most effective summarization technique for a given video source is a difficult research problem. The summarization method Should present the user with the most effective means of organizing the data for maximum understanding and saliency. Understanding the summarization output is not a well-defined concept and is likely a user-dependent concept. Additionally, the summarization method must be able to adapt to the underlying video content, presenting the user with the most effective interface for viewing the content. Moreover, a summarization method Feature Extraction . Module 1 Results Fusion Figure 1.1: Proposed Results Fusion System should present a concise summary that provides an overview of the video, reducing redundant information created by repeated scenes or alternating shots. 1.1.3 - Results Fusion This thesis focuses on the composition of existing algorithms into a single composite algorithm with considerably improved performance and greater general applicability. The approach that has been chosen is to combine the feature output of multiple algorithms using results fusion. Results fusion (also referred to as sensor fusion in the robotics community, classifier fusion in information retrieval, and multimodality fusion in biometrics) is the process of combining multiple evidence or sensors to improve the performance and reliability over utilizing a single evidence or sensor. The performance of any system is predicated upon the reliability of the sensor that is used. If a single sensor is used, it may be subject to errors or unpredictable behavior in certain environments. The result of such a sensor could be unreliable. Results fusion attempts to address this problem by utilizing multiple sensors that capture different aspects of the data under analysis. Each sensor itself may not provide superb performance, however the appropriate combination of these sensors may produce a reliable and quality result. An important feature of results fusion methods is the ability of the results fu- sion engine to generalize or adapt to novel patterns or input data [133]. In order to obtain a system that generalizes well, researchers have experimented with using multiple classifiers with each classifier generalizing differently, utilizing separate fea- tures, modalities, or representations. In order to use all the information available, the results or features of the multiple classifiers must be combined to make a final decision. Several methods exist for results fusion. As an example, considerable research on results fusion has come from the area of document retrieval [6, 44, 61, 81, 82, 83, 94, 123]. In this area, researchers attempt to combine multiple representations of queries and documents or multiple retrieval techniques. Research in this field has shown that significant improvements can be achieved by combining multiple evidence [6, 83]. Re- sults fusion has also become popular for personal identification and authentification, where different methods (fingerprints, retinal scans, and other biometric measures) have limited individual reliability, but allow for greater reliability when used in con- junction, reducing the false acceptance of imposter cases [9, 48, 117, 139]. Research in handwriting and character recognition attempts to combine multiple line segments via results fusion to identify handwritten numerals [80, 145]. Additionally, research in the sensor fusion community has indicated receiving improved performance and reliability over using a single sensor. Researchers in this community attempt to fuse information retrieved from metal detectors, ground penetrating radar, and thermal infrared imagers for the detection of land mines [27, 121]. There are important similarities between the composition of multiple methods in the document filtering community and in creating a fusion-based video segmentation technique. Just as some document filtering algorithms require different document representations [70] to improve retrieval performance, the various video segmentation algorithms utilize different abstract video representations (motion maps, histograms, etc.). Additionally, just as some document filtering algorithms fuse results from previous algorithms to improve performance, the results of various video segmentation algorithms can be fused to create an improved, unified result. In general, there is no guarantee that the fusion of multiple strategies will improve performance over individual methods. For example, if an accurate sensor is fused with one that generates random results, then no improvement is realized. However, empirical evidence from the document analysis and biometric community indicate that fusion is beneficial and improves performance. For example, if the optimal strategy is unknown, the fusion of multiple methods can be advantageous even if the fusion results are worse than the best individual strategy. 1.1.3.1 Document Analysis Some document analysis researchers attempt to combine multiple representations of queries and documents or multiple retrieval techniques. An example application that incorporates results from multiple input sources is the metasearch engine. A metasearch engine is a system that provides access to multiple existing search engines. Its primary goal is to collect and reorganize the results of user queries by multiple search engines. There has been considerable research regarding the effectiveness and performance of metasearch engines [8, 21, 34, 49, 52, 60, 79, 99]. The result merging step of a metasearch engine combines the query results of several search engines into a single result. A metasearch engine usually associates a weight or similarity measure to each of the retrieved documents and returns a ranked list of documents based on this value [99]. These weights can be derived from an adjustment of the document rank value of the local search engines or by defining a global value based on all of the retrieved documents. This approach only focuses on the results of multiple searches and not a combination of the multiple search engines functionalities. Metasearch engine research is complicated by the fact that differing search engines produce rank results that are heterogeneous and not easily compared. Document retrieval methods have also shown increased performance when com- bining results from various document representations. Katzer, et a1. [70] compared text document retrieval performance using different document representation meth- ods. Their results demonstrated that the different document representations retrieved different sets of relevant documents and that performing information retrieval with multiple document representations improved retrieval performance over using a single method. Additionally, Bartell, et at. [6] and Shaw, et al. [123] combined multiple re- trieval algorithms and obtained better performance than using a single method. Hull, et a1 [61] combined probability estimates of multiple classifiers to improve document filtering performance. As the document retrieval methods rely on various document representations, the various video segmentation algorithms rely on different characteristics of the underly- ing video as abstracted into some intermediate representation, such as a feature vector or representative image. Color-based methods create histograms of the frame content color distribution and compute distance metrics between these histograms to search for shot boundaries [47, 86, 95, 102, 156, 157]. Model-based methods create Hidden Markov Models (HMM) of each possible state and transition in a video sequence that are used to locate transitions as temporal events [11, 115]. Edge-based methods utilize derived edge maps of the frame content to search for Shot segmentation boundaries [86, 152]. Motion-based methods rely on derived velocity and displacement vectors to compute the amount of motion between video frames to determine shot boundaries [31]. 1. 1.3.2 Biometrics Many researchers have focused on the fusion of face and voice data to improve per- sonal identity verification [9, 17, 67, 117, 146]. Several studies have shown that using a multi-modality biometric system can improve on the incompleteness of any Single model biometric system [117]. Yacoub [9, 146] uses multi-modal biometric features to fuse face and voice information together via a supervising expert. Person identifica- tion is treated as a binary classification problem, with each user either belonging to the imposter class or the client class. Given the scores from the face and voice iden- tification modules, the supervising expert finds the optimal function that separates the two classes to make verification decisions. Genoud, et a1. [48] combined several Speaker verification methods to improve per- 10 formance. The decision functions of each verification method are weighted with a con- fidence measure. The average confidence measure is tested against a global threshold to determine speaker authentication. Their research concluded that increased perfor- mance can be achieved by combining multiple methods. Ross, et al. [117] fused the results from face, fingerprint, and hand geometry analysis to increase the performance of a biometric system. In their research, the decision function of multiple biometric systems is consolidated via a summation rule, which takes a weighted average of the individual scores. From their research it was concluded that increased performance could be obtained by using multiple modalities. The results of this research suggest that increased shot segmentation and sum- marization performance can be achieved by fusing multiple algorithms over using a single method. 1.2 Contributions of this thesis This thesis contributes to the overall research in digital video in seven distinct ways. Results obtained by the document analysis and biometric communities using fusion of multiple methods to increase performance suggest that an improvement of shot seg- mentation and summarization could be achieved by fusing together multiple methods. To achieve this goal, we select appropriate shot segmentation algorithms and exam- ine a variety of fusion-based techniques, constructing a composite method for shot segmentation. We then extend the general ideas and results developed in this work to sum- marization. The goal has been not only to develop an improved method for video summarization, but also to demonstrate the extensibility of the general concept of results fusion. Some of the techniques that we have used to extract the features and determine Shot boundary detection are not necessarily new, however our implementation of them is. Additionally, we test our newly developed algorithms on a wide variety of different video classes. To date, much of the research in the field of digital video has been done 11 using small test samples of structured content. Few researchers have attempted to test algorithms on large video suites that encompass the complexity and variety of different classes of content. The classes of video that were used for our study include commercial films, home video, news broadcasts, raw news footage, surveillance video, and Unmanned Aerial Vehicle (UAV) military video. To summarize, the contributions in this thesis are: e A Decision Tree and Ruleset based results fusion engine for shot segmentation and summarization that improves performance over using a single algorithm. 0 Neural Network results fusion for shot segmentation and summarization that improves performance over using a single algorithm. 0 Support Vector Machine—based results fusion for Shot and video summarization that fuses key features from video and determines a best segmentation with extensions to summarization that improves performance over using a single algorithm. 0 Feature extraction and analysis for results fusion. 0 Experimental validation on a large and varied video test suite. 0 Adaptability to detect Shot boundaries when receiving unreliable data from one or more modalities. e A novel keyframe-based video summarization technique based on camera mo— tion. When video enters the proposed system, it is analyzed by each feature module. The feature extraction modules are used to extract low-level image features. These feature modules extract the necessary features and output a measure for each video frame representing the difference metric between successive frames. The outputs of the features modules are combined using a results fusion engine. The output of the results fusion engine is a decision as to whether the current frame under analysis is a Shot boundary. Figure 1.1 shows a graphical representation of the proposed system. 12 1.3 Definitions This section describes some of the terms and symbols that appear throughout this thesis. Images in this thesis are presented in color. 0 out: An abrupt boundary that occurs where there is a change of shots between two consecutive frames. Figure 1.2 shows a graphical representation of a cut between pairs of frames in a music video sequence. 0 dissolve: The simultaneous occurrence of a fade-in and fade-out, with both effects superimposed over a Span of frames [28]. Figure 1.3 depicts a dissolve video edit. 0 fade: A gradual transition of video content to or from black (or some other fixed color frame). A fade-out occurs when there is a gradual change fade to a black screen, while a fade-in occurs when there is a gradual fade from a black screen. Figure 1.4 is a graphical representation of a fade-in video effect. 0 keyframe: A still image that best represents the content of a video sequence in an abstract manner [157]. There is no clear criterion for selecting keyframes and systems vary considerably on what is considered the best choice for keyframe selection or construction. 0 scene: A logical grouping of shots focusing on certain objects of interest [109]. 0 shot: A sequence of frames captured as a single continuous action in time and Space [109]. e wipe: A transition from one shot to another by selectively uncovering a con- tiguous region of the image, often rectangular. The effect is often like a virtual line passing across the image, clearing one picture while it brings in another occurring over a Span of frames. Figure 1.5 Shows a graphical representation of a wipe video effect. 13 Cut Frame Figure 1.2: Cut Between Successive Frames Figure 1.3: Dissolve Video Edit Figure 1.4: Fade In Video Effect 16 Figure 1.5: Wipe Video Edit 1.4 Stucture of this thesis The remainder of the thesis is organized as follows. Chapter 2 reviews existing shot segmentation, summarization, and results fusion methods, noting strengths and weak- nesses. Chapter 3 examines the shot detection methods implemented in this thesis. Chapter 4 examines result fusion-based algorithms and implementations for video segmentation. Experimental evaluation and testing is detailed in Chapter 5. Chap- ter 6 discusses how our novel results fusion-based shot segmentation methods can be extended for video summarization. A summary and proposed future work is detailed in Chapter 7. 18 Chapter 2 Related Work The widespread distribution and storage of digital video has presented many chal- lenges. The challenges arise because of the nature and characteristics of digital video. It is an inherently voluminous and redundant medium. In order to effectively utilize this medium it must be transformed into a form that is searchable, manageable, and structured. Temporal video shot segmentation is the first step towards achieving this goal. Any method that attempts to index, browse, retrieve, or parse digital video must be segmented. The goal of video Shot segmentation is to present a longer video sequence as a set of smaller more manageable segments called Shots. A shot can be characterized as a sequence of frames captured as a single continuous action in time and space [109]. Each shot is then mapped into a smaller number of representative images via an abstraction process. The abstraction process creates a series of still or moving images that is shorter in length than the original video and preserves the es- sential meaning of the video [88]. Summarization attempts to reduce the redundancy created in the abstraction process due to repeated scenes or shots, maintaining the original structure of the video. Results fusion is the process of combining multiple sensors or classifiers. It is considered a general problem in various application domains such as face recognition, text categorization, person authentification, and optical character recognition. The premise behind results fusion techniques is that better accuracy and reliability can be obtained by fusing multiple evidence or sensors over using a Single evidence or 19 sensor. If a single sensor is used, it may be subject to errors or unpredictable behavior in certain environments. The result of such a sensor could be unreliable. Results fusion attempts to address this problem by utilizing multiple sensors that capture different aspects of the data under analysis. Each sensor itself may not provide superb performance, however the appropriate combination of these sensors may produce a reliable and quality result. The purpose of this chapter is to give an overview of the existing methods for video segmentation and summarization. The performance and limitations of each algorithm are discussed and compared. Additionally, this chapter will discuss the current research and methods for results fusion. 2.1 Video Systems There has been much research in the development of composite video systems that allow users to store, search, and browse digital video [56, 105, 106, 141, 140]. The Fischlar project at Dublin City University is a visual indexing system that allows users to store and browse television programs. The Informedia I and II Project at Carnegie Mellon University utilizes Speech information, image analysis, and natural language processing on over a terabyte of video data to facilitate video search, navigation, and retrieval [56, 141, 140]. Video segmentation and summarization are important aspects of these systems. Video segmentation is the first step in any digital video analysis system. Its goal is to capture the underlying structure of the video sequence and divide the video stream into logical subunits [156]. Most segmentation algorithms operate on the shot level; however there has been some research on segmenting video on the story level. Figure 2.1 illustrates the hierarchy of a typical video. The second step in any composite digital video system is the summarization of a video sequence. The goal of summarization is to reduce the redundancy in the segmentation process caused by repeated scenes or long shots. Both segmentation and summarization work together as the foundation of any composite video system. 20 Figure 2.1: Typical Video Structure 2.2 Segmentation A typical video sequence is composed of scenes, which are composed of shots, which are composed of frames at the lowest level. Figure 2.1 shows the structure of a typical video sequence. Most common segmentation algorithms rely on low-level image features at the shot level to partition the video. Attempting to partition the video at the scene or story level is difficult; there is no standard or universal definition of scenes or stories. A shot is an unbroken sequence of frames taken from one source [74]. There are two basic types of shot transitions: abrupt and gradual. Abrupt transitions, called cuts, occur when a frame from a subsequent shot immediately follows a frame from the previous shot. Gradual transitions consist of slow change between frames from one shot to frames of a different shot. These types of transitions include cross-dissolves, fade-ins, fade-outs, and other graphical editing effects such as wipes [47]. A fade-in is the gradual increase of intensity starting from one frame to the next. A fade-out is a slow decrease in brightness from one frame to the next. A cross-dissolve is when one frame is superimposed on another, and while one frame gets dimmer, the other frame gets brighter. A dissolve can be considered an overlapping of a fade-in and a 21 fade-out [28]. Gradual transitions are more difficult to detect because camera and object motion can inhibit the accurate detection of gradual transitions, causing false positives. There have been several research projects comparing and evaluating the perfor- mance of Shot detection techniques. Koprinska, et a1. [74] provides a survey of the existing approaches in the compressed and uncompressed domain. Dailianas [28] com- pared several segmentation algorithms across different types of video. Lienhart [86] evaluated the performance of various existing shot detection algorithms on a diverse set of video sequences with respect to the accuracy of each detection method, and the recognition of cuts, fades, and dissolves. Lupatini, et al. [95] compared and evaluated the performance of three classes of shot detection algorithms: histogram- based, motion-based, and contour-based. Boreczsky and Rowe [10] compared various compressed and uncompressed video shot detection algorithms. All of these tem- poral video segmentation algorithms can be categorized as either a compressed or uncompressed domain technique. 2.2.1 Uncompressed domain techniques The majority of segmentation algorithms operate in the uncompressed domain. Typi- cally, a similarity measure between successive frames is defined and compared against a predetermined threshold. A cut is determined when the distance value between two images falls below this predetermined threshold. Gradual transitions can be found by using complex thresholding techniques [156] or using a cumulative difference measure. Figure 2.2 categorizes the various uncompressed Shot segmentation techniques. The uncompressed algorithms can be organized into the following categories. 2.2.1 . 1 Pixel Differences One way to detect the possible changes between successive frames is to compare the corresponding pixel values between the two frames and count how many pixels have changed. If the number of changed pixels is above a predetermined threshold, a shot 22 8:68:88 nomawpnoawom comm vmmmmaaoonb “Nd “5&3 .032 8m 9.98 . £359.80 53» as 539 v.83» 8.8 30:9, :_ 30:23.5 , as 958 8m acaan 333500 62 .8» 228-20 8m 82w: . am 533sz 8m Esau—3V m_ox_a 635:0 as 5:8; «MMMMWMHLMXE .o .3522 SEE. 0 . :oaos. . - 3m mimxaéxv 6m sass: aeosfie :3 22:2,: , am 825 8:236 as .2859 as 533. A8 9.98 63.28 5252 3.: 3 2392 23m 30:20.20 2m: 3929.... 80:92.5 35¢ .06 h >952 :82... 3:20 33 Em 228.2 . . . 2232 so saw «858 «8:2: «89603.: «858 _ h 9:235 U A E0 ' Sagan—.5 U flies a anoav 3m: 0 U 088235 .26 :5 3805885 h 5:25:58 . 23 is detected. Koprinska [74] calculates the absolute sum of pixel differences between two successive frames as: i=1 2le Iii-(1,31) -P.-+1(x. y)| X - Y D(i,i + 1) = (2.1) where Pi(:c, y) and P,+1(:c, y) represent pixel intensity values at coordinates (x, y). A cut is determined if the difference value D(i, i+ 1) is above a predetermined threshold. One potential problem with this implementation is extreme sensitivity to camera motion. If the camera moves a few pixels between successive frames, a large number of pixels will be counted as being changed. Zhang, et al. attempted to reduce this effect with the use of a smoothing filter [156]. Before each pixel comparison the candidate pixel is replaced with the average value of the pixels within a 32:3 neighborhood. Additionally, this filter reduces noise in the input images. 2.2.1.2 Statistical Differences Zhang, et al. uses a likelihood ratio to compare successive frames based on the assumption of uniform second-order statistics over regions in each frame [156]. In this algorithm each frame is subdivided into 1: blocks and the corresponding blocks are compared based on the statistical characteristics of their intensity values. The likelihood ratio that two blocks come from different scenes can be expressed as [156]: _ [§%‘— + (WW ,\ _ k Si * Si+l (2.2) where m,- and mi“ are the mean intensity values for the two blocks 1: and S,- and 8,-4.1 are the respective variances in consecutive frames i and i + 1. The number of blocks whose likelihood ratio exceeds a threshold T1 is counted as follows: . . 1 If A]; > T1 D(i, i + 1, k) = (2.3) 0 otherwise If the number of changed blocks exceeds a second threshold T2 a cut is declared. One advantage that this method has over the pixel difference method is that it improves the tolerance against noise associated with camera and object movement. Addition- ally, the likelihood ratio has a broader dynamic range than does the percentage used 24 with the pixel difference method [156]. This broader dynamic range makes it easier to choose a threshold t to distinguish between changed and unchanged areas. One problem with the likelihood ratio algorithm is that it is possible for two different cor- responding blocks across a shot boundary to have the same density function, causing no out to be detected. Another problem with this method is that, compared to the pixel difference method, its performance is slower as a result of the complexity of the statistically-based formula. 2.2.1.3 Histograms The most commonly used structure for color-based segmentation is the histogram. Ranges of colors are divided into buckets and the number of pixels assigned to each bucket forms the color histogram of the video frame. Histogram-based algorithm techniques use a distance metric between histograms as a Similarity measure. The basic assumption is that content does not abruptly change within, but across shots [86]. Once an image has been represented as a histogram there are various distance metrics that can be used. Some of the most commonly used distance metrics are: e Chi-square: k . X2 = f; (Hi(])]{:+f[;1(3)) (2.4) e Intersection: 2i min(Hi(i)a Hi+l (2)) Intersection(H,-,H,-+1) = 1 — N (2.5) 0 Absolute Bin Difference: ABD