ROBUST GLOBAL MOTION COMPENSATION AND ITS APPLICATIONS By Seyed Morteza Safdarnejad A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering - Doctor of Philosophy 2017 ABSTRACT ROBUST GLOBAL MOTION COMPENSATION AND ITS APPLICATIONS By Seyed Morteza Safdarnejad This thesis presents algorithms for robust global motion compensation (GMC). GMC algorithms are used to remove camera motion and transform the video such that in the resultant video, the background appears static and the only motion rises from foreground objects. Many computer vision algorithms are tailored for static cameras, and using GMC as a pre-processing module, it is possible to apply these algorithms on videos from moving cameras. For instance, motion-based video analysis is strongly affected by camera motion. If camera motion is not compensated, it interferes with the motion of interest, such as motion of human, and renders the analysis problem to be more challenging. Generally, in sequential schemes, GMC estimates the homography transformation between two consecutive frames by matching keypoints on the frames, and maps the second frame to the first frame.Then, by accumulating these transformations, a composite transformation is calculated which maps each frame to the global coordinate. However, existing GMC algorithms are sensitive to existence of foreground motion and fail easily in the case of considerable foreground motion or ambiguous and low texture background. To address the challenges in GMC, first, we propose a Robust Global Motion Compensation (RGMC) algorithm which explicitly suppresses the foreground effect and utilizes a comprehensive probabilistic verification model to find the best mappings between consecutive frames. Despite the robustness offered by RGMC, we further identify the problem of temporal drift of the estimation, due to accumulation of errors in estimation of mappings between consecutive coordinates. Furthermore, to address the issues of sequential GMC, we propose a Temporally Robust Global Motion Compensation (TRGMC) algorithm which by joint alignment of input frames, estimates accurate and temporally consistent transformations to the global coordinates. Joint alignment not only leads to the temporal consistency of GMC, but also improves GMC stability by using redundancy of the information. Many applications can benefit from a reliable and accurate GMC algorithm. We first briefly look into these applications. Then, among the many applications, we investigate the problem of sequence alignment, and propose an alignment algorithm for non-overlapping sequences, enabled by performance of TRGMC. Given the transformation to a global coordinate, offered by TRGMC, and the capability of background reconstruction using TRGMC results, we are able to align sequences even if the spatial overlap between the sequences is minimal or nonexistent. To this end, we first spatially align the sequences such that extrapolated backgrounds are aligned well and trajectories of moving objects are spatially smooth in the global coordinate. Next, we temporally align the sequences based on the smoothness of spatio-temporal trajectory of moving objects across the fields of view of different cameras. Copyright by SEYED MORTEZA SAFDARNEJAD 2017 This thesis is dedicated to Samira, my beautiful considerate wife, whom without her help and encouragement I would have achieved far less in my life. v ACKNOWLEDGMENTS Firstly, I would like to express my gratitude to my thesis advisor, Dr. Xiaoming Liu whose knowledge, understanding and dedication added considerably to my Ph.D. experience. Throughout my Ph.D., all the time he has spent on one to one dedicated discussions, thoughtful and helpful comments, and careful review of papers has prepared me better for this dissertation and a successful career in the future. Also, I am extremely thankful of my advisor Dr. Lalita Udpa for her support and help during every single step of my Ph.D., and all what I had the chance to learn from her. Without her support, my graduate studies path would have been much more difficult. I am very grateful that I had the opportunity to learn from two advisors, by closely working with Dr. Liu and Dr. Udpa. Dr. Liu and Dr. Udpa definitely provided me with the skills that I needed to successfully complete my thesis. I would also like to thank the remainder of my committee members, Dr. Satish Udpa and Dr. Arun Ross for their valuable comments and contributions along the way. I thank my fellow labmates both in computer vision lab, and non-destructive evaluation lab, for all the fun we had during the past years, and every thing I learned from them during discussions and meetings. Last but not the least, I would like to thank my family: my parents and my brothers and sister for supporting me spiritually throughout writing this thesis and my life in general. I do not know how to thank my awesome wife, Samira, whom without her nothing in my life would have been as exciting and meaningful. Thanks for always being there for me. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter 1 Introduction and Contributions . . . . . . . . . . . . . . . . . . . . . . . 1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Background . . . . . 2.1 Homography transformation 2.2 RANSAC . . . . . . . . . . 2.3 Congealing . . . . . . . . . 1 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . 8 . 10 . 11 Chapter 3 Robust Global Motion Compensation . 3.1 Previous Work . . . . . . . . . . . . . . . . . . 3.2 Proposed Method . . . . . . . . . . . . . . . . 3.2.1 Foreground Suppression . . . . . . . . 3.2.2 Homography Verification Model . . . . 3.3 Experimental Results . . . . . . . . . . . . . . 3.3.1 Dataset . . . . . . . . . . . . . . . . . 3.3.2 Parameters . . . . . . . . . . . . . . . 3.3.3 Evaluation metric . . . . . . . . . . . . 3.3.4 Accuracy assessment . . . . . . . . . . 3.3.5 Computational cost . . . . . . . . . . . 3.3.6 Qualitative evaluation . . . . . . . . . . 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 14 16 17 19 25 25 26 27 27 28 28 31 Chapter 4 Temporally Robust Global Motion Compensation 4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Proposed TRGMC Algorithm . . . . . . . . . . . . . . . . 4.2.1 Formulation of keypoint-based congealing . . . . . 4.2.2 Optimization solution . . . . . . . . . . . . . . . . 4.2.3 Weight assignment . . . . . . . . . . . . . . . . . 4.2.4 Initialization . . . . . . . . . . . . . . . . . . . . . 4.2.5 Outlier handling . . . . . . . . . . . . . . . . . . . 4.2.6 Alignment of non-keyframes . . . . . . . . . . . . 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . 4.3.1 Baselines and details . . . . . . . . . . . . . . . . 4.3.2 Datasets and metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 35 39 40 42 43 44 45 46 47 47 48 . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 52 52 52 Chapter 5 Global Motion Compensation Applications . . 5.1 Motion panorama . . . . . . . . . . . . . . . . . . . . . 5.2 Raster scan of scenes/Image mosaic . . . . . . . . . . . 5.3 Background reconstruction . . . . . . . . . . . . . . . . 5.4 Foreground segmentation . . . . . . . . . . . . . . . . . 5.5 Human action recognition . . . . . . . . . . . . . . . . . 5.6 Multi-object tracking (MOT) . . . . . . . . . . . . . . . 5.7 Spatio-temporal alignment of non-overlapping sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 55 56 56 58 58 Spatio-Temporal Alignment of Non-Overlapping Sequences from Independently Panning Cameras . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Jointly moving cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Cameras following similar trajectories . . . . . . . . . . . . . . . . . . . 6.2.3 Stationary cameras at different views . . . . . . . . . . . . . . . . . . . . 6.2.4 Time synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Spatial alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2.1 Global motion compensation . . . . . . . . . . . . . . . . . . . 6.3.2.2 Spatial alignment of overlapping backgrounds . . . . . . . . . 6.3.2.3 Spatial alignment of non-overlapping sequences . . . . . . . . 6.3.3 Temporal alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3.1 Estimation of temporal offset . . . . . . . . . . . . . . . . . . 6.3.3.2 Motion trajectory filtering . . . . . . . . . . . . . . . . . . . . Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60 63 63 63 64 64 65 66 66 67 69 69 74 75 76 77 77 78 82 83 83 84 . . . . . 85 85 88 88 90 4.4 4.3.3 Quantitative evaluation . . . . . 4.3.4 Qualitative evaluation . . . . . . 4.3.5 Computational efficiency . . . . 4.3.6 Accuracy vs. efficiency trade-off Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6 6.1 6.2 6.3 6.4 6.5 Chapter 7 Conclusion and Future Work . . . 7.1 Conclusions and discussions . . . . . . . . 7.2 Future work . . . . . . . . . . . . . . . . . 7.2.1 Speed-up TRGMC . . . . . . . . . 7.2.2 Joint alignment and outlier rejection viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                                 LIST OF TABLES Table 3.1: Impact of different settings on average BRE for each algorithm. DT and LT denote default (τs = 1000) and lowered (τs = 100) detection threshold in SURF algorithm, respectively. For RGMC τs = 100 is used and 3 different setting of (TC , TM ) are reported. D-M and D-E denote default setting of (TC , TM ) = (50, 100) with motion history and error handling turned off, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 4.1: Comparison of GMC algorithms on quantitative dataset (*GT: Ground truth, BF: Backward-Forward, B: Backward). . . . . . . . . . . . . . . . 49 Table 4.2: Comparison of GMC algorithms on qualitative dataset. . . . . . . . . . . 52 Table 6.1: Temporal and spatial alignment error in seconds and pixels, respectively, for real (R) and synthetic (S) sequences. . . . . . . . . . . . . . . . . . . 77 Table A.1: Comparison of multiple datasets for action recognition (AR), scene understanding (SU), and genre categorization (GC). . . . . . . . . . . . . . 98 Table A.2: Performances (genre categorization accuracy) of different baseline algorithms on SVW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Table A.3: Performance of different combinations of trajectory descriptors on SVW. . 106 x LIST OF FIGURES Figure 3.1: RGMC algorithm flowchart: (a) color indicates various motion vector clusters, (b) the merged cluster of background, (c) the motion history, and (d) the motion compensated video. . . . . . . . . . . . . . . . . . . . 17 Figure 3.2: (a) Motion history Mt , (b) Mask M = I(Mt > τ), (c) edge matching for an accurate θt that matches the background, (d) edge matching for an inaccurate θt that matches the foreground. . . . . . . . . . . . . . . . . . 21 Figure 3.3: (a-b) Two consecutive frames and the matched quadruplet by the labeler, (c) the absolute difference of two frames matched via the quadruplet in (a,b), (d) manually labeled foreground mask. . . . . . . . . . . . . . . . . 24 Figure 3.4: Empirical and fitted distributions for (a) E ∼ N(0.52, 0.04), (b) ∆s ∼ N(0, 2 × 10−5 ), (c) ∆α ∼ N(0, 2 × 10−3 ), (d) tx ∼ Laplace(0, 1.50), (e) ty ∼ Laplace(0, 0.95), and (f) H ∼ N(2.1, 0.25) . . . . . . . . . . . . . . 25 Figure 3.5: Sample frames of the test videos in (a) SVW, (b) HMDB51, and (c) Holleywood2 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.6: Each row shows GMC results of two consecutive frames from video ID S17, S19, and S9 by (a) manual labeling, (b) MLESAC, (c) HEASK, and (d) RGMC. In (a), colorful pixels show the pixels that are different between overlaid frames. In (b-d), the pixel brightness indicates the difference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.7: Per-video BRE using the optimal setting for each algorithm compared with ground truth (GT) matching BRE. . . . . . . . . . . . . . . . . . . 30 Figure 3.8: A 40-frame sequence of gymnastics backflips in textureless background stitched using (a) MLESAC, (b) HEASK, and (c) RGMC. Consistency of the background shows the superiority of RGMC. For HEASK, stitching up to frame #10 is shown, after which the stitching drastically fails. . . . . 30 Figure 4.1: Schematic diagrams of proposed TRGMC and existing sequential GMC algorithms, and resultant motion panorama for a video shot by panning the camera up and down. Background continuity breaks easily in the case of the sequential GMC [88]. . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 4.2: Flowchart of the TRGMC algorithm. . . . . . . . . . . . . . . . . . . . . 39 Figure 4.3: The notation used in TRGMC. . . . . . . . . . . . . . . . . . . . . . . . 41 xi Figure 4.4: Comparison of the ratios of background-foreground matches for (a) sequential GMC and (b) TRGMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 4.5: (a) The input frame, (b) the reliability map, with the red color showing higher reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 4.6: Average BRE of frame pairs versus the time difference between the two frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 4.7: Top view of the frames and links (a) before and (b) after TRGMC. The parallel links in (b) show successful spatial alignment of keypoints. Average of frames (c) before and (d) after TRGMC. For better visibility, we show up to 15 links emanated per frame. . . . . . . . . . . . . . . . . . . 50 Figure 4.8: Composite image formed by overlaying the frame n on frame 1 for several videos after TRGMC. Left to right, top to bottom, n is equal to 144, 489, 912, 93, respectively. In the overlap region the difference between the frames is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 4.9: Error and efficiency vs. the keyframe selection step, ∆ f . . . . . . . . . . . 53 Figure 5.1: Temporal overlay of frames from different videos processed by TRGMC. Trajectory of the center of image plane over time is overlaid on each plot to show the camera motion pattern, where color changes from blue to red with progression of time. . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 5.2: Background reconstruction results. . . . . . . . . . . . . . . . . . . . . . 56 Figure 5.3: Foreground segmentation: (a) Input frame, (b) reconstructed background, (c) difference of (a,b) on (a). . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.4: Dense trajectories of the (a) original video, and (b) TRGMC-processed video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.5: Multi-player tracking using [3] for a football video with camera panning to the right, before (top) and after processing by TRGMC (bottom). . . . . 59 Figure 6.1: (a) Top view of spatio-temporal FOV of two moving cameras capturing sequences S1 and S2 ; Non-overlapping sequences (NOS) may not even cover some common spatial region over the progression of time, i.e, no overall spatial overlap will exist. (b) Spatio-temporal alignment of NOS results in displaying sequences from multiple freely panning cameras in a common coordinate and at the correct time shift. . . . . . . . . . . . . 61 xii Figure 6.2: Various scenarios in spatio-temporal alignment of sequences: (a) jointly moving cameras, (b) independently moving cameras at different times following similar trajectories, (c) stationary cameras with different viewpoints, (d) the proposed independently panning cameras with non-overlapping sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 6.3: Flowchart of our spatio-temporal alignment algorithm. First, spatial alignment is performed by background reconstruction for each sequence (a) and aligning the backgrounds (b). Second, given the spatial alignment parameters, keypoint trajectories (c) are mapped to the world coordinate and the best temporal alignment in terms of continuity of moving object trajectories is found (d). Finally, spatio-temporal alignment parameters are used for displaying the sequence in a world coordinate system and at the correct time shift (e). . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6.4: Spatial alignment of non-overlapping sequences using background extrapolation and smoothness of object trajectories. . . . . . . . . . . . . . 70 Figure 6.5: Trajectories, tracks, and fitted space-time curve to the tracks from 3 videos. 74 Figure 6.6: Spatial alignment of non-overlapping sequences. Top to bottom: reconstructed backgrounds of two sequences with negligible overlap, extrapolated backgrounds, and aligned background with trajectory of moving objects overlaid on the background. . . . . . . . . . . . . . . . . . . . . . 78 Figure 6.7: Each row shows spatio-temporal alignment results on a set of real NOS, with some overall spatial overlap. For each sequence, input frames at the estimated time shift and trajectories of moving objects in the world coordinate are shown. The input frames are transformed to the world coordinate to make a composite image via alpha blending. . . . . . . . . 79 Figure 6.8: Results for two synthetic NOS from an accident footage (S2). (a) Trajectories of moving objects, (b) aligned input frames, (c) original frame where the synthetic frames are cropped. . . . . . . . . . . . . . . . . . . 80 Figure 6.9: Each row shows spatio-temporal alignment results on a set of real NOS, with no overall spatial overlap. For each sequence, input frames at the estimated time shift and trajectories of moving objects in the world coordinate are shown. The input frames are transformed to the world coordinate to make a composite image via alpha blending. . . . . . . . . . . . . . . 81 xiii Figure 7.1: Comparison of the links made by (a) TRGMC with (b) the links which might be made via the proposed speed-up scheme. Chosen links by minimum spanning tree are shown in red and links between adjacent frames are shown in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure A.1: Sample frames from all 30 sports categories of SVW. . . . . . . . . . . . 94 Figure A.2: SVW challenges: (a) Related equipment does not exist, (b) Background is cluttered and uncorrelated with the sport, (c) Uncommon camera angles increase the intra-class variations, (d) Multiple sports co-exist (1: Hurdling, 2: Long jump, 3: Cycling). . . . . . . . . . . . . . . . . . . . . 100 Figure A.3: Annotated actions categories ([343, 359, Forearm], [380, 400, Set], [438, 454, Spike]) within a video from Volleyball genre category. Since distinct actions from the same sport genre may share a common field, visual appearance alone is not enough for action recognition in SVW. . . . . . . 101 Figure A.4: Distribution of (a) number of participents in videos, (b) aspects of the action field and (c) camera views angles, in 30 categories. Irrelevant field is a field that from its appearance, the sports category cannot be deduced (e.g., practicing in the backyard). Shared field refers to the condition in which from just field appearance, more than one sports category might be inferred (e.g., track and field sports). Unique field is the one that just from field context, the corresponding sports category can be conjectured (e.g., Bowling tracks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure A.5: Confusion matrices of (a) context-based and (b) motion-based categorization algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xiv LIST OF ALGORITHMS Algorithm 1: Robust Global Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Algorithm 2: TRGMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 xv Chapter 1 Introduction and Contributions Due to the boom of smartphones and the ever increasing amount of videos, video analysis has received much attention in computer vision. A variety of problems is defined for video analysis including activity recognition [45, 51, 52, 71, 81, 122], event/action detection [25, 47, 125], video categorization [29,73,118,126,129,130], video saliency detection [21,56,84,91,94,123,131,132], etc. Effective motion analysis is the gist of many vision problems, e.g., action recognition, video annotation and video surveillance. On the other hand, as the video analysis research is maturing, era of designing algorithms based on staged videos has passed and datasets of unconstrained realworld videos are emerging. However, unconstrained videos bring in new challenges in video analysis. For instance, motionbased video analysis is highly affected by camera motion. Thus, global motion compensation algorithms (GMC) are used to remove intentional (due to camera pan/tilt/zoom) and unwanted (e.g., due to hand shaking) camera motion. GMC is utilized in applications such as video stitching, or as pre-processing for motion-based video analysis. Due to its importance, this dissertation focuses on GMC and how it might be used in different applications. The term “global motion compensation" is also used in video coding literature, where background motion is estimated roughly to enhance the video compression performance [40, 98], in some compression formats such as MPEG-4. Normally, GMC estimates the homography transformation between two consecutive frames by matching keypoints on the frames, and maps the second frame to a global coordinate. To remedy outliers in keypoint matches, robust techniques are proposed for homography estimation, e.g., 1 RANSAC [35], by assuming the number of outliers to the correct homography is less than inliers. However, in the presence of predominant foreground, i.e., moving objects and people, a larger proportion of the putative matches are mismatches. Predominant foreground may result from a higher percentage of coverage by foreground pixels, or occlusion, textureless and non-informative background, blurred background (e.g., camera following the foreground motion), or a combination of these reasons. In presence of predominant foreground, the common variations of RANSAC have little chance of selecting a minimal set of background keypoints by random sub-sampling in a limited number of iterations. Despite its importance, the predominant foreground problem has been overlooked in both video stabilization and GMC algorithms. Since GMC estimates homography between consecutive frames and then uses a cascade of homographies to map the current frame to the global motion-compensated coordinate, failure in GMC at a single frame affects all the subsequent frames. This renders the predominant foreground problem very common and significant. Thus, GMC robustness is highly desirable. GMC problem is also aggravated as speed of foreground motion increases, e.g., in sports videos. To address the predominant foreground problem, we propose a robust GMC (RGMC) method for suppressing foreground keypoint matches and mismatches, enabling a reliable homography estimation in presence of predominant foreground and/or textureless background. Also, we propose a novel and efficient probabilistic model for homography verification that considers keypoint matching error and consistency of the image edges after warping, and benefits from motion history gleaned from prior matched frames. We demonstrate the superiority of RGMC on challenging videos from three video datasets, when compared with state-of-the-art methods. Our further investigations reveal that the sequential processing scheme causes frequent GMC failures for multiple reasons: 1) Sequential GMC is only as strong as the weakest pair of consecutive frames. A single frame with high blur or dominant foreground motion can cause the rest of the 2 video to fail. 2) Generally, multiple planes exist in the scene. The common assumption of a single homography will accumulate residual errors into remarkable errors. 3) Even if the error of consecutive frames is in a sub-pixel scale, due to the multiplication of several homography matrices, the error can be significant over time [74]. These problems are especially severe when processing long videos and/or when the camera motion becomes more complicated. For instance, when the camera pans to left and right repeatedly, or severe camera vibration exists, the GMC error is obvious by exhibiting discontinuity on the background.Although RGMC introduces robustness to the failures, it still suffers from accumulation of error. To address the issues of sequential GMC, we propose a temporally robust global motion compensation (TRGMC) algorithm which by joint alignment of input frames, estimates accurate and temporally consistent transformations to the global motion compensated coordinate. TRGMC densely connects pairs of frames, by matching local keypoints. Joint alignment (a.k.a. congealing) of these frames is formulated as an optimization problem where the transformation of each frame is updated iteratively, such that for each link interconnecting a keypoint pair, the spatial coordinates of two end points are identical. This novel keypoint-based congealing, built upon succinct keypoint coordinates instead of high-dimensional appearance features, is the core of TRGMC. Joint alignment not only leads to the temporal consistency of GMC, but also improves GMC stability by using redundancy of the information. The improved stability is crucial for GMC, especially in the presence of considerable foreground motion, motion blur, non-rigid motion like water, or lowtexture background. The joint alignment scheme also provides capabilities such as coarse-to-fine alignment, i.e., alignment of the keyframes followed by non-keyframes, and appropriate weighting of keypoints matches, which cannot be naturally integrated in sequential GMC. Our quantitative experiments reveal that TRGMC pushes the alignment error close to human performance. Many applications may benefit from an accurate and robust global motion compensation algo3 rithm. We briefly review these applications, namely human action recognition, motion panorama creation, multi-object tracking for moving camera and when visual cues are insufficient for reliable tracking, and spatio-temporal alignment of video sequences. Furthermore, among many potential applications, we deeply investigate the problem of spatiotemporal alignment of multiple video sequences, captured by freely panning handheld cameras. We identify and tackle a novel scenario of this problem referred to as Non-Overlapping Sequences (NOS). NOS are captured by multiple freely panning handheld cameras whose field of views might even have no direct spatial overlap. However, over the progression of time, there are nearby regions in the scene that are observed by the cameras independently and probably at distinct time instants. This assumption is less restrictive than common region being observed by field of view of different cameras over progression of time, and obviously much less restrictive than the common requirement of direct spatial overlap between frames from different cameras. With the popularity of mobile sensors, NOS rise when multiple cooperative users capture a public event to create a panoramic video, or when consolidating multiple footages of an incident or crime scene into a single video. This enables reconstruction of events or crime scenes captured by amateur users. To tackle this novel scenario, we first spatially align the sequences by reconstructing the background of each sequence using TRGMC algorithm and then registering these backgrounds, even if the backgrounds are not overlapping. To do this, first, reconstructed background images are extrapolated. Then, a cost function is defined and minimized such that while extrapolated backgrounds are aligned well, trajectory of moving objects leaving field of view of one camera and entering field of view of another camera are spatially smooth. Given the spatial alignment, we temporally synchronize the sequences, such that the trajectories of moving objects (e.g., cars or pedestrians) across sequences are consistent with the prediction of when a moving object leaving the field of view of a camera, would appear in the field of view of another camera. 4 Finally, to develop algorithms for analyzing user-generated videos, unconstrained and representative datasets are of great significance. For this purpose, we collected a dataset of Sports Videos in the Wild (SVW), consisting of videos captured by users of a leading sports training smartphone app (Coach’s Eye R ) while practicing a sport or watching a game. The dataset contains 4000+ videos selected by reviewing ∼85, 000 videos and consists of 30 sports categories and 44 actions. Videos of sports practice, which frequently happens outside the typical sports field, have huge intra-class variations due to background clutter, unrepresentative environment, existence of different training equipment and most importantly, imperfect actions. On the other hand, using smartphones for video capturing by ordinary people, in comparison to videos captured by professional crew for broadcasting, leads to challenges due to camera vibration and motion, occlusion, view point variation, and poor illumination. Given various manual labels, this dataset can be used for a wide range of computer vision applications, such as action recognition, action detection, genre categorization, and spatio-temporal alignment. On the sport genre categorization problem, we also design the evaluation protocol and evaluate three different methods to provide baselines for future works. 1.1 Organization The remainder of this thesis is outlined as follows. In Chapter 2, we present the related background and theory for global motion compensation. Chapters 3 presents robust global motion compensation (RGMC). The further refinements to GMC by TRGMC algorithm are discussed in Chapter 4. In Chapter 5, we briefly review potential applications of GMC, and in Chapter 6, we focus on one of these applications and propose an algorithms for spatio-temporal alignment of non-overlapping sequences. In Chapter 7 we present the conclusion and the proposed future work. Details on the 5 collected amateur sports videos dataset, SVW, is presented in Appendix A. 1.2 Contributions In this thesis, the challenging problem of global motion compensation for real world videos is addressed. There are many factors rendering this problem challenging. The following contributions are made in consideration of these factors. • To address the important challenge of foreground occlusion in global motion compensation, a novel sequential global motion compensation algorithm is proposed. Namely, Robust Global Motion Compensation (RGMC) explicitly suppresses the foreground effect on estimation of the homography between the consecutive frames. Further, to evaluate each candidate homography, a novel probabilistic verification model is proposed which integrates motion history, edge matching, and point matching scores for homography evaluation. • A novel joint alignment algorithm named Temporally Robust Global Motion Compensation (TRGMC) is proposed. Benefiting from the joint alignment, TRGMC further avoids the temporal drift problem. Also, further robustness is achieved as unlike sequential schemes, failure in alignment of a single pair of consecutive frames will not affect all the upcoming frames. • Among many potential applications of RGMC and TRGMC, we further investigate the problem of sequence alignment. Based on the capability of transferring the frames to a global coordinate and also background reconstruction, we identify a novel scenario in sequence alignment and propose an algorithm for it. Namely, the proposed algorithm is capable of performing spatio-temporal alignment of non-overlapping sequences from freely panning 6 cameras. • A dataset of Sports Videos in the Wild (SVW) is collected, which consists of videos captured by users of a leading sports training smartphone app while practicing a sport or watching a game. SVW is more unconstrained than existing human action datasets, especially sports dataset. On the sport genre categorization problem, we also design the evaluation protocol and evaluate three different methods to provide baselines for future works. 7 Chapter 2 Background In this chapter, an overview of related theory and mathematics for global motion compensation is presented. Our proposed algorithms rely on estimating homography transformation. So, we first cover this transformation. A robust algorithm for estimation of homography transformation from noisy keypoint matches is called random sample consensus (RANSAC) [35]. We also review RANSAC algorithm in this section. Finally, congealing as a technique for alignment of a stack of images is closely related to our proposed temporally robust GMC. Basics of congealing are also presented in this section. 2.1 Homography transformation Homography estimation is a key step in many computer vision applications. Assuming a pinhole camera model, two images taken from different viewpoints from the same planar scene are related by a homography transformation. Under homography H, the 3D point coordinates of camera 1, i.e. X1 , are related to the 3D point coordinates of camera 2, i.e. X2 via, X2 = HX1 . (2.1) Similarly, if a camera has pure rotation around its optical center, the images are related to each other through homography transformation. 8 In the homogeneous image coordinates x1 and x2 , there is scale ambiguity, and thus we have, x2 ∼ Hx1 . (2.2) There are a wide range of techniques for homography estimation. A good survey on these techniques can be found in [1]. H is a 3 × 3 matrix, but as it is defined only up to a scale, the total number of degrees of freedom is 8. As each point correspondence between two images provides two constraints, four point correspondences are enough to find the homography describing the transformation between two given images. A simple homography estimation algorithm is Direct Linear Transform (DLT). First, the relationship between the two corresponding points is written as     u x         c v = H y         1 1 (2.3)       h1 , h2 , h3  u x             where c is a non-zero constant, H = h4 , h5 , h6 , and v and y represent x2 and x1 , re            h7 , h8 , h9 1 1 spectively. After some manipulation, we get the following equations, −h1 x − h2 y − h3 + (h7 x + h8 y + h9 )u = 0, (2.4) −h4 x − h5 y − h6 + (h7 x + h8 y + h9 )v = 0. (2.5) 9 Rewriting in the matrix form, we have, Ah = 0,  (2.6)  −x, −y, −1, 0, 0, 0, ux, uy, u where A =   and h = h1 , h2 , h3 , h4 , h5 , h6 , h7 , h8 , h9 . Thus, by 0, 0, 0, −x, −y, −1, vx, vy, v solving the equation 2.6, DLT algorithm finds the homography. Each point correspondence makes T up for two rows in A, so if there are at least four corresponding points available, the resultant 8 × 9 matrix A may be used and the 1D null space of A is the solution space for h. 2.2 RANSAC The Random Sample Consensus (RANSAC), which first was introduced by Fischler and Bolles [35], is an algorithm for fitting a mathematical model to experimental data. Specifically, when data contains outliers, RANSAC fits the model by detecting the outliers and fitting to the inliers. A data item is considered as an outlier if it does not fit to the true model reflecting the true set of parameters. Interestingly, the percentage of outliers for which RANSAC can find a proper model can be larger than 50%, which is the breakdown point for many other techniques. In the context of homography estimation from keypoint correspondences, RANSAC is very suitable as false matches due to appearance ambiguities in keypoint matching are outliers to the homography model describing the relationship between the points from the two images. RANSAC is an iterative algorithm with two steps: 1. Hypothesize: A sample subset containing minimal data items, e.g. 4 keypoint correspondences in the case of homography estimation, is randomly selected from the input dataset. Using only this subset, the model is estimated. This is in contrary to methods such as least square robust estimators which use all the available data, possibly with different weights, to 10 estimate the model. 2. Test: RANSAC identifies the elements in the input dataset which are consistent with the model, as the inliers to the model, or the consensus set. The iterative procedure is repeated until the probability of finding a better consensus set drops beyond a certain threshold. Over time, many different variations of RANSAC have been proposed. RANSAC can be sensitive to the choice of the correct noise threshold that defines which data points fit a model instantiated with a certain set of parameters. If the threshold is too large, then all the hypotheses may be ranked equally good. In contract, if the noise threshold is too small, the estimated parameters tend to be unstable, i.e. by adding or removing a single data item to the set of inliers, the estimate of the parameters may change considerably. For instance, to partially compensate for this undesirable effect, Torr et al. proposed MLESAC (Maximum Likelihood Estimation SAmple and Consensus) [107]. Instead of ranking each consensus set based on its cardinality, MLESAC evaluates quality of the consensus set by calculating its likelihood. 2.3 Congealing Congealing refers to the problem of unsupervised alignment of an ensemble of images. Generally, the parametric nature of misalignment (translation, similarity, affine, etc. ) should be known in advance and images should have similar content and appearance. The seminal work of LearnedMiller [53] utilizes a sum of entropy of ensemble of images as the cost function. To mitigate the sensitivity issue of this method, Cox et al. [18] propose a SSD (sum of squared differences) cost function, optimized via Gauss-Newton optimization method. The misalignment function ξ is 11 defined over a stack of N images, arg min ξ (Φ) (2.7) Φ where Φ = {θ1 , θ2 , ..., θN−1 } is the set of N − 1 warp parameters vectors corresponding to the images in the stack. Parametric warp function for the pixel coordinate x is denoted by W (x; θ ). In the least squared congealing method of Cox et al. [18], the misalignment of image i, Ii , relative to the rest of the images in the stack is defined as, N ξi (θ ) = [I j − Ii (θ )]2 . ∑ (2.8) j=1; j=i The nonlinear Eqn. 2.8 is difficult to minimize, so, it is linearized by taking the first order Taylor expansion series around Ii (θ ), and the increment ∆θ is estimated using, arg min ∆θ where ∂ I j (θ )T ∂θ Ij + ∑ j=1; j=i ∂ I j (θ )T ∆θ − Ii ∂θ are the steepest descent images calculated by ∂ I j (θ )T ∂θ 2 (2.9) = ∂W ∂θ ∇I j (θ ). The solution to Eqn. 2.9 is given by, ∆θ = H −1 N ∂ I j (θ ) I j (θ ) − Ii) j=1; j=i ∂ θ ∑ (2.10) where ∂ I j (θ ) ∂ I j (θ )T H= ∂θ ∂θ (2.11) is referred to as pseudo-Hessian. So, iteratively solving for ∆θ and updating θ until convergence is obtained, will lead to the set of aligning warp parameters. 12 Chapter 3 Robust Global Motion Compensation The objective of global motion compensation (GMC) is to remove intentional (due to camera pan/tilt/zoom) and unwanted (e.g., due to hand shaking) camera motion. GMC is utilized in applications such as video stitching, or as pre-processing for motion-based video analysis. Effective motion analysis is the gist of many vision problems, e.g., action recognition, video annotation and video surveillance. For instance, in action recognition as an important computer vision problem, motion analysis via dense trajectories has shown superior performance [87, 114, 116]. However, the moving camera often interferes with the motion of human, thus it is desired to compensate for camera motion. Note that a related problem is video stabilization, which aims to remove unwanted camera motion, while GMC removes both intentional and unwanted camera motion [24]. Normally, GMC estimates the homography transformation between two consecutive frames by matching keypoints on the frames, and maps the second frame to a global coordinate. To remedy outliers in keypoint matches, robust techniques are proposed for homography estimation, e.g., RANSAC [35], by assuming the number of outliers to the correct homography is much less than inliers. However, in the presence of predominant foreground, i.e., moving objects and people, a larger proportion of the putative matches are mismatches. Predominant foreground may result from a higher percentage of coverage by foreground pixels, or occlusion, textureless and non-informative background, blurred background (e.g., camera following the foreground motion), or a combination of these reasons. In presence of predominant foreground, the common variations of RANSAC have little chance of selecting a minimal 13 set of background keypoints by random sub-sampling in a limited number of iterations. Despite its importance, the predominant foreground problem has been overlooked in both video stabilization and GMC algorithms. Even for algorithms designed explicitly for robustness to foreground motion [24, 30, 63], predominant foreground is reported to cause failure. Since GMC estimates homography between consecutive frames and then uses a cascade of homographies to map the current frame to the global motion-compensated coordinate, failure in GMC at a single frame affects all the subsequent frames. This renders the predominant foreground problem very common and significant. Thus, GMC robustness is highly desirable. GMC problem is also aggravated as speed of foreground motion increases, e.g., in sports videos. We qualitatively investigate 500 videos from Sports Videos in Wild (SVW) dataset [89], and observe 35% failure, i.e., background instability, by the baseline method of MLESAC [107], in contrast to 5.1% failure for the proposed method. This demonstrates that the robustness problem is very common and severe for real-world videos. The main contribution of this chapter is a robust GMC (RGMC) method for suppressing foreground keypoint matches and mismatches, enabling a reliable homography estimation in presence of predominant foreground and textureless background. Also, we propose a novel and efficient probabilistic model for homography verification that considers keypoint matching error and consistency of the image edges after warping, and benefits from motion history gleaned from prior matched frames. We demonstrate the superiority of RGMC on challenging videos from three video datasets, when compared with state-of-the-art methods. 3.1 Previous Work Due to existence of outliers, robust techniques are widely used for homography estimation, e.g., RANSAC [35] and its variants such as Locally-Optimized RANSAC [17], MLESAC [107] and 14 Guided-MLESAC [106]. While RANSAC aims to maximize the number of inliers, MLESAC searches the best hypothesis that maximizes the likelihood via RANSAC, assuming that the inliers are Gaussian distributed and outliers are distributed randomly. To handle the same outlier issue, [57] directly rejects unreliable keypoint matches. However, in case of predominant foreground, problematic matches from the foreground are not unreliable in terms of appearance. Recent works focus on estimating the best or multiple homographies in case of multi-plane background [6, 69, 105, 109, 133]. For instance, Uemura et al. [109] segment each frame using color MeanShift algorithm to multiple regions denoting different planes in the background and find the dominant plane for homography estimation. Using RANSAC and based on the number of inliers for estimated homography for each region, the dominant background planes is found and used for final homography estimation. In contrast, we segment the frame to foreground and background regions by analyzing motion vector clusters, and remove foreground for robust GMC. Many works concentrate on rejecting mismatches from point correspondence, by relying on the assumption that similarity of the mismatched key-points is not enough. For instance, in [57], keypoint mapping functions from frame I to I and reverse, denoted as f and f are learned. If a point mapped according to f and then f is actually mapped back to its original coordinates, then the associated key-point is considered a good match. However, in case of moving foreground, matches can be reliable in terms of appearance similarity, but considered mismatch due to inconsistency with background transformation. Yan et al. [124] propose a probabilistic framework to combine keypoint matching and appearance similarity to enhance estimation robustness. To model the latter, correlation coefficient between pixels is used. Despite the improved estimation accuracy, for textureless background the performance deteriorates. For large foreground, [124] tends to remove foreground, instead of background, motion. In contrast, we use edge matching as an appearance similarity measure with 15 a higher sensitivity and lower computational costs. Motion history-based foreground suppression minimizes its interference with homography estimation. Also, we use motion history to reduce the tendency of compensating the foreground motion. If camera motion is modeled as 2D translation, simpler methods can be used for GMC. In [14], video stabilization is conducted using the cross-correlation between horizontal and vertical projection of the consecutive frames, by assuming that the largest variation between frames is due to 2D translation. [24] uses the same idea to estimate 2D translation. To improve the robustness to moving foreground, a RANSAC-like approach on projections of bands of the image is utilized. However, [24] fails if the foreground object is too large or the background is textureless, and the simplistic model of 2D translation is easily violated in real-world videos. Thus, we design our RGMC algorithm to minimize the effect of textureless background and large foreground on homography estimation. 3.2 Proposed Method The main objective of Robust Global Motion Compensation (RGMC) algorithm is to be robust to the presence of predominant foreground. Thus, it is critical to suppress the foreground and rely on keypoint matches of the background for global motion estimation. We perform f oreground suppression by clustering motion vectors computed from keypoint matches and identifying potential clusters corresponding to the background, which are merged to provide a set of background keypoints for final homography estimation. As a key enabler for RGMC, a novel and reliable homography verification model is presented to consider keypoint matching error and consistency of the edges of images after transformation, and benefit from motion history gleaned from previous frames. Fig. 3.1 shows the flowchart of the RGMC algorithm, with details presented in the following two subsections. 16 Error detected Input frames Key-point detection and matching f(θ) = 2.2 f(θ) = 1.9 f(θ) = 1.2 f(θ) = 0.9 f(θ) = 3.7 (a) (a) Cluster analysis Homography verification (c) Merging background clusters (d) (b) Error handling θ Updating motion history f(θ) = 0.6 (b) (c) (d) Figure 3.1: RGMC algorithm flowchart: (a) color indicates various motion vector clusters, (b) the merged cluster of background, (c) the motion history, and (d) the motion compensated video. 3.2.1 Foreground Suppression We use SURF [10] algorithm for keypoint detection and description. To detect sufficient background keypoints, the Fast-Hessian keypoint detection threshold, τs , is decreased drastically. This helps in the cases of nearly uniform and textureless background, or blurred background due to rapid camera motion (e.g., videos shot by smartphones). However, this also implies that more keypoints will reside on the foreground, which calls for an effective foreground suppression. Cluster analysis For foreground suppression, the motion vectors resulting from keypoint matches between consecutive frames are clustered. Since motion vectors on the background result from camera motion and are more consistent than foreground motion vectors, clustering will likely lead to some candidate regions from the background (see Fig. 3.1 (a)). Each cluster is analyzed separately by random subsampling of matches in that cluster and evaluating the resultant homography against the cost function, discussed in Sec. 3.2.2. Merging background clusters Due to the zooming or motion corresponding to different planes of the background, and not knowing the optimal number of clusters a priori, we allow an over- 17 Algorithm 1: Robust Global Motion Compensation Data: Frames It and It−1 and keypoints matches D, prior homography θt−1 and CFV f (θt−1 ) and f (θt−2 ) Result: Estimated homography θt and motion history Mt 1 Compute the set of motion vectors V from D; 2 repeat 3 Cluster D into Di (i ∈ {1, .., K}) based on V, set fi = ∞; 4 for i=1 to K do 5 while Number of iterations < TC do 6 Randomly select four matching keypoints Q from Di ; 7 if H(Q) > pH,0.9 then 8 Find homography θˆt ; 9 if At least λ % of keypoints in Di are inliers for θˆt then 10 Calculate the cost function fˆ via Eqn. 3.10; 11 fi ← min( fˆ, fi ). 12 13 14 15 16 17 18 19 20 21 22 23 24 ¯ i by randomly selecting a maximum of C matches for each cluster; Regularize Di to D Sort the fi ’s in an ascending order and find the sorting index j(i), set mi = ∞, (i ∈ {0, .., K}), i = 0; repeat j(i) ¯ k; i ← i + 1 and merge the top i clusters: Mi = k= j(1) D while Number of iterations < TM do Randomly select four matching keypoints Q from Mi ; if H(Q) > pH,0.9 then Find homography θˆ and calculate the cost function fˆ via Eqn. 3.10; if fˆ < mi , then θi ← θˆ and mi ← fˆ. until mi > mi−1 ∧ i < K; θt = θi−1 , f (θt ) = mi−1 ; until f (θt ) < η( f (θt−1 ) + f (θt−2 ))/2 ∨ Number of iterations < TE ; Update motion history via Eqn. 3.6 and output θt and f (θt ). clustering of K clusters. Thus, background motion vectors may be assigned to multiple clusters. To merge background clusters, based on the estimated homography and cost function value (CFV) of each cluster, a subset of the best clusters are selected to be merged in a greedy algorithm (Fig. 3.1(b)). Prior to merging, the set of keypoints belonging to each cluster are regularized by randomly selecting a maximum of C pairs for each cluster. Given that the keypoint matches in background cluster are similar, the regularization has negligible impact on the RGMC accuracy, 18 but remedies the case when part of the foreground (generally with a higher number of matches) is mistakenly merged to the background clusters. Error handling For GMC applications such as video stitching or pre-processing for motion analysis, failed compensation and homography estimation for a single frame deteriorates the overall performance drastically. Since the context in consecutive frames are similar, we utilize the historical values of the cost function to assist the error handling. If the minimum CFV of homography estimation at the current frame pair is significantly higher than those of previous pairs, we repeat the estimation process with the hope that the randomness in the algorithm will recover the error. Note that the significance of foreground suppression would be more obvious when plenty of keypoints belong to the foreground, while a few belong to the background. For instance, if foreground has 200 keypoints and background has 10, a RANSAC-like algorithm needs to run 450, 000 iterations to ensure a 90% probability of selecting a quadruplet of keypoints from background. However, by analyzing each cluster separately, RGMC efficiently focuses on background matches. Algorithm 1 summarizes the proposed RGMC algorithm. Details of the homography verification model used in the algorithm will be presented next. 3.2.2 Homography Verification Model To evaluate the estimated homography from a quadruplet of keypoints matches, we derive a cost function that unifies the keypoint matching score, edge matching score, and the information from compensating previous frames. Denote the matching frames as It−1 and It , their candidate homography as θt , and the set of keypoint matches under study as D. In Bayesian framework, similar to [124], θt can be estimated by maximizing p(θt |D, It , It−1 , θt−1 ) = p(D, It , It−1 |θt , θt−1 )p(θt |θt−1 ) , p(D, It , It−1 |θt−1 ) 19 (3.1) where θt−1 is the obtained prior homography of frames It−1 and It−2 . The p(θt |θt−1 ) is the conditional probability of θt given the prior homography θt−1 . The denominator of Eqn. 3.1 is constant w.r.t. θt . By expanding the likelihood term, the homography can be verified using p(θt |D, It , It−1 , θt−1 ) ∝ p(D|It , It−1 , θt , θt−1 )p(It , It−1 |θt , θt−1 )p(θt |θt−1 ). (3.2) The term p(D|It , It−1 , θt , θt−1 ) = p(D|It , It−1 , θt ) and represents how well the keypoint matches D extracted from It and It−1 are matched by θt . Knowing It is independent from θt−1 , the term p(It , It−1 |θt , θt−1 ) = p(It , It−1 |θt ), and reflects how well the frame It transformed under θt , denoted as It|θt , matches It−1 . Thus, the homography is estimated by minimizing, θt∗ = arg min[−ln(p(D|It , It−1 , θt )) − ln(p(It , It−1 |θt )) − ln(p(θt |θt−1 ))]. (3.3) θt Keypoint matching error Based on the analysis of Yan et al. [124], the keypoint matching error for inliers, p(Din |It , It−1 , θt ), is better represented by a Laplacian model than the conventional Gaussian model. Denote (xRi , yiR ) and (xTi , yiT ) as the ith matching keypoint coordinates of It and i , yi ), transformation of (xi , yi ) under It−1 respectively, transformation of (xRi , yiR ) under θt as (xRT RT T T i − xi | + |yi − yi |. We use the same θt−1 as (xTi R , yiT R ) , and di ← |xTi R − xRi | + |yiT R − yiR | + |xRT T RT T method as [124] to compute the keypoint matching error, Nin 1 − di e b 4 i=1 16b p(Din |It , It−1 , θt ) = ∏ (3.4) where Nin is the number of inliers and the scale b is the Laplacian distribution parameter. Denoting γi as an indicator variable for inlier/outlier and considering that an outlier has a uniform distribution over the entire area of the frame, which is denoted as S, we have 20 (a) (b) (c) (d) Figure 3.2: (a) Motion history Mt , (b) Mask M = I(Mt > τ), (c) edge matching for an accurate θt that matches the background, (d) edge matching for an inaccurate θt that matches the foreground. |D| p(D|It , It−1 , θt ) = ∏ [γi i=1 1 1 − di e b + (1 − γi ) 2 ]. 4 16b S (3.5) Appearance consistency The appearance consistency under θt transformation, p(It , It−1 |θt ), is normally computed via pixel-based correlation [124]. We propose edge-based matching for multiple reasons. First, the pixel-based matching score is not sensitive enough for textureless background, e.g., a homography with error of few pixels displacement leads to similar scores as a perfect match. In contrast, the tolerance for error is much lower by matching the edges, which results in more accurate homography models. Although low-texture images produce few and generally noisy edge pixels, our experiments show that edge matching outperforms pixel-based correlation, even in low-texture conditions, similar to the results reported in [119]. Second, when stitching video frames based on global motion compensation, errors typically occur in mis-matched edges 21 at the boundary of the frames. These errors are very distracting for viewers’ visual perception, and they are more likely to be remedied by edge-based appearance matching. Finally, in pixel matching, time-consuming image warping is needed for computing It|θt . Edge matching only needs to warp edge pixels in It , leading to a typical 10× speed-up over pixel matching. To assure that the edge matching score reflects how well the background, not foreground, of the two frames match, we iteratively update a motion history Mt (see Fig. 3.1 (c)) as, Mt ← αMt−1 + (1 − α)|It−1 − It|θt |, (3.6) where α is a weighting scalar within 0 and 1, and |.| denotes the element-wise absolute value operator. We define the edge matching score (EMS) as, E(I1 , I2 , R) = where Φ is edge detection operator, 2 Φ(I1 ) Φ(I2 ) R 1 , Φ(I1 ) R 1 + Φ(I2 ) R 1 + c (3.7) is element-wise multiplication, R denotes the mask specify- ing the region of interest for EMS calculation, · 1 computes the L1 matrix norm, and c(= 0.001) is a constant to avoid division by zero. E(I1 , I2 , R) ranges between 0 and 1 with 1 representing a perfect match. In Eqn. 3.3, we use E(It−1 , It|θt , M), where M = I(Mt > τ) is obtained by thresholding the motion history and I(·) is an indicator function. Fig. 3.2 shows a motion history and edge matching results for two candidate θt ’s. We will later discuss how the probability model for E is obtained. Conditional homography distribution Based on our experiments, and also prior work [24] on YouTube Action Dataset [62], the largest variation between consecutive video frames is due to 2D translation. Thus, to utilize the prior information of θt−1 for a stable homography estimation, we 22 decompose the homography model into translation, scale, and rotation models [112]. Denote the absolute difference in components of θt and θt−1 after decomposition as tx and ty for translation, ∆s for scale and ∆α for rotation angle. Assuming independence among components, we define p(θt |θt−1 ) = p(tx )p(ty )p(∆s)p(∆α). (3.8) Quadruplet filtering RGMC evaluates a large number of quadruplets of keypoint matches, and computes their EMS. To improve the efficiency, we filter the candidate quadruplets before the optimization of Eqn. 3.3. Intuitively, if the keypoint in the quadruplet are spatially close to each other, it is less likely to have an accurate estimate of θt , because homography estimation is more sensitive to the accuracy of keypoint locations. Also, background keypoints have generally a higher spatial dispersion than the foreground keypoints. Thus, only if the entropy (or dispersion) of a candidate quadruplet is above a threshold, we fully evaluate the cost function. Specifically, we use m-spacing estimate of entropy [54], similar to [24], as H= 1 n−m n ∑ ln( m (xi+m − xi)), n i=1 (3.9) where m is the spacing parameter (set to 1) and n is number of points. We first sort the x values prior to using them in Eqn. 3.9. Entropy estimates of x and y coordinates of the quadruplet are calculated separately and the minimum of them is the entropy of the quadruplet. Model training Having presented the Bayesian framework, we now introduce our empirical approach to learn the various probability models. For this learning, we manually stitch 250 pairs of consecutive frames to find the best homography estimate. The labeler uses our developed GUI to match four background keypoints in two consecutive frames and fine tune the matches to visually minimize the background stitching error. The labeler also specifies a foreground mask, represent23 (a) (b) (c) (d) Figure 3.3: (a-b) Two consecutive frames and the matched quadruplet by the labeler, (c) the absolute difference of two frames matched via the quadruplet in (a,b), (d) manually labeled foreground mask. ing the region resulted from foreground movement. Fig. 3.3 shows two consecutive video frames and the manually matched quadruplets. From the manually labeled sequences, we find the empirical distribution of E, tx , ty , ∆s, ∆α, and H. As shown in Fig. 3.4, E, ∆s, ∆α, and H are well approximated by a normal distribution. For H distribution, 10% percentile (pH,0.9 ), reflecting the value that 90% of observed point entropies are larger than, is also shown. For tx and ty , Laplacian distribution is more appropriate. By plugging the probability models to Eqn. 3.3 and ignoring the constants, the final cost function is, N f (θt ) = in ∑i=1 2 out (E(It−1 , It|θ , M) − µE )2 + ∑N i=1 ln(S ) + + Nin + Nout 2σE2 di b (∆s − µ∆s )2 2 2σ∆s (∆α − µ∆α )2 T+ 2 2σ∆α 24 T + |∆tx − µ∆tx | btx T + |∆ty − µ∆ty | bty T, (3.10) 2.5 1.5 1 8 60 6 40 0 0 0.5 E 1 −0.02 0 ∆s 0 0.02 0.3 −0.2 0 ∆α 0.2 0.8 0.4 0.2 p(H) p(ty) p(tx) 4 2 20 0.5 0 80 p(∆ α) p(∆s) p(E) 2 0.6 0.4 0.2 0.1 0.2 0 −10 0 t 10 0 −5 0 t y x 5 0 1 2 3 4 H Figure 3.4: Empirical and fitted distributions for (a) E ∼ N(0.52, 0.04), (b) ∆s ∼ N(0, 2×10−5 ), (c) ∆α ∼ N(0, 2×10−3 ), (d) tx ∼ Laplace(0, 1.50), (e) ty ∼ Laplace(0, 0.95), and (f) H ∼ N(2.1, 0.25) . where Nout is number of outliers and x T = min(x, T ) restricts the impact of prior information. Since keypoint matching error is dependent on the number of keypoints, we normalize it with the total number of keypoints. The homography θt is estimated by θt∗ = arg min( f (θt )). 3.3 (3.11) Experimental Results This section presents the experimental results of RGMC, and its comparison with our implementations of the RANSAC variation called MLESAC [107] and the HEASK method [124]. 3.3.1 Dataset We select 50 videos from SVW dataset [89], where 24 videos are used for model learning in Sec. 3.2.2, and the rest for testing. SVW contains videos of amateurs practicing a sport, shot 25 S5 S1 S2 S3 S4 S6 S14 S7 S8 S15 S13 S16 S18 S20 M1 S10 S11 S22 S21 S23 S24 M2 S12 M5 M6 S25 M9 M10 M13 M14 S26 (a) M3 M7 M11 M4 M8 M12 M15 (b) H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 (c) Figure 3.5: Sample frames of the test videos in (a) SVW, (b) HMDB51, and (c) Holleywood2 datasets. using smartphone by ordinary people. Thus, highly unconstrained SVW is an excellent example of user-generated videos with predominant foreground of humans. We also use 10 videos from Holleywood2 [72] and 15 videos from HMDB51 [49] datasets1 . In total, 51 videos are used for quantitative evaluation with sample frames shown in Fig. 3.5. 3.3.2 2 Parameters In all the experiments, we have the same fixed parameter setting, i.e., τ = 0.5, C = 50, TC = 50, TM = 100, TE = 2, K = 10, η = 1.5, α = 0.5, λ = 70%, and T = 100. Our experiments show that RGMC is robust to variation of parameters. The most important parameter is K. Large values of K increase the computational cost. On the other hand, K should be large enough so that foreground, background, and erroneous matches are mapped to different clusters. As a trade-off, we use K = 10. 1 For these two datasets, videos are temporally trimmed around the signature motion in the video, practically disabling effect of our motion history module. In HMDB51, similar to many existing datasets such as UCF101 [100], the video resolution is only 320 × 240, thus GMC suffers from both video content and the low resolution. 2 In HMDB51 [49] and UCF101 [100] datasets, only very low resolution videos (320 ∗ 240) are publicly available for which GMC basically suffers considerably from the resolution and in UCF-Sports [102], camera is static. Thus, we limit our qualitative evaluation to SVW and available motion-compensated videos in HMDB51 dataset. 26 Algorithm Ground Truth MLESAC HEASK RGMC Setting – DT LT DT LT (20, 50) (50, 100) (100, 200) D-M D-E BRE (×10−3 ) 7.59 15.65 18.59 17.33 14.24 11.77 10.11 10.02 11.60 11.25 Table 3.1: Impact of different settings on average BRE for each algorithm. DT and LT denote default (τs = 1000) and lowered (τs = 100) detection threshold in SURF algorithm, respectively. For RGMC τs = 100 is used and 3 different setting of (TC , TM ) are reported. D-M and D-E denote default setting of (TC , TM ) = (50, 100) with motion history and error handling turned off, respectively. 3.3.3 Evaluation metric For accuracy evaluation, we have manually matched a quadruplet of keypoints and found the ground truth homography θ0 for a total of 350 pairs of consecutive frames in challenging periods in 51 test videos. The same GUI described in Sec. 3.2.2 is used to obtain θ0 and the foreground mask. We denote the intersection of the complement of this mask, i.e., the background mask, and the region covered by It|θ0 , as B. We quantify the consistency of frames It and It−1|θ (grayscale frames with pixels ranging between 0 and 1) using the background region error (BRE), ε= 1 B 3.3.4 1 |(It−1 − It|θt )| B 1. Accuracy assessment Table 4.1 represents the average BRE on test videos for different algorithms. Due to random nature of algorithms, we repeat each experiment 5 times and report the average performance. To ensure that comparisons are fair, we decrease the keypoint detection thresholds also for baseline methods. HEASK has better performance with lowered threshold and thus we use this setting for the experiments. We also report results for different iteration numbers TC and TM for RGMC and as a trade-off between accuracy and efficiency, select (TC , TM ) = (50, 100) as default values for RGMC. In addition, we turn off the modules of Motion History and Error Handling in RGMC alternatively, to verify that their existence is helpful. Fig. 3.6 shows two consecutive frames of 27 three sample videos matched by different algorithms, along with the ground truth matching. As shown, RGMC produces very accurate background matching. Fig. 3.7 represents the average pervideo BRE, sorted by the BRE of ground truth matching. As shown, RGMC performance is very robust and in most videos RGMC matching error is very close to the ground truth value. Finally, Fig. 3.8 compares stitching results on a sample video using different algorithms. It is worth noting that since a cascade of homographies are used for GMC and stitching of video frames, propagation of errors of matching consecutive frames, gives rise to inaccuracy as the length of the input video increases. Also, coexistence of textureless background and large foreground (in terms of the total number of pixels covered by the foreground), may cause failure in the RGMC algorithm, especially if the foreground motion exists starting the initial frames. 3.3.5 Computational cost For the comparison with baseline methods, we test Matlab implementations of algorithms on a PC with Intel i5-3470@2GHz CPU. The average time for matching frame pair of size 720 × 1, 280 (480 × 854) by MLESAC, HEASK, and RGMC is 2.0 (0.3), 53.1 (21.3), and 4.3 (2.3) seconds, respectively. We also have a C++ implementation of RGMC using the OpenCV libraries, which takes 1.4 (0.7) seconds for matching frame pair of size 720 × 1, 280 (480 × 854)3 . 3.3.6 Qualitative evaluation In addition to the aforementioned quantitative study, we also perform the qualitative evaluation on unlabeled videos to demonstrate the severity of the predominant foreground issue in real-world videos, and the superiority of RGMC on a large scale dataset. For each video, we run a GMC algorithm, visually observe the motion-compensated videos, and claim a f ailure if an instable background is observed (e.g., Fig. 3.8 (a,b)). We observe a failure rate of 32% by the MLESAC method 3 Source code is available at http://www.cse.msu.edu/∼liuxm/RGMC 28 (a) (b) (c) (d) Figure 3.6: Each row shows GMC results of two consecutive frames from video ID S17, S19, and S9 by (a) manual labeling, (b) MLESAC, (c) HEASK, and (d) RGMC. In (a), colorful pixels show the pixels that are different between overlaid frames. In (b-d), the pixel brightness indicates the difference. 29 0.06 ε 0.04 GT MLESAC HEASK RGMC 0.02 S26 S25 M12 S24 S22 M11 H2 S23 M4 S21 S20 S17 S19 S16 M13 S18 S14 M2 S12 S13 S9 S11 S10 M3 M10 S7 M1 S8 S15 S6 H3 M14 H6 M5 H9 M15 S5 H5 S4 M8 M7 S2 H10 S3 M9 H1 M6 H8 S1 H7 H4 0 Video sequence ID (ordered by increasing GT error) Figure 3.7: Per-video BRE using the optimal setting for each algorithm compared with ground truth (GT) matching BRE. (a) (b) (c) Figure 3.8: A 40-frame sequence of gymnastics backflips in textureless background stitched using (a) MLESAC, (b) HEASK, and (c) RGMC. Consistency of the background shows the superiority of RGMC. For HEASK, stitching up to frame #10 is shown, after which the stitching drastically fails. among 225 videos from three categories of cartwheel, dive and dribble in HMDB51 dataset. Further, a 35% failure rate by MLESAC is observed from 500 videos of SVW dataset; in contrast on the same data our RGMC has merely a 5% failure rate. 30 3.4 Conclusions We presented a robust global motion compensation (RGMC) algorithm that delivers reliable results in the presence of predominant foreground and textureless or blurry background, enabling its application to real-world unconstrained videos. By foreground suppression, RGMC is able to tolerate large foreground and occlusion. Also, the proposed method successfully handles keypoint matching with a very low matching threshold, required for GMC in low texture background. This is achieved by clustering motion vectors, and analyzing each cluster to identify matches pertaining to the background. A novel homography verification model is proposed to support the RGMC. Extensive experiments and comparison with manually matched ground truth and baseline methods demonstrate the superiority of RGMC. 31 Chapter 4 Temporally Robust Global Motion Compensation As discussed in Chapter 3, Global motion compensation (GMC) removes the impact of intentional and unwanted camera motion in the video, transforming the video to have static background with the only motion coming from foreground objects. Video stabilization is a closely related problem where unwanted camera motion, such as vibration, is removed, leaving a smooth camera motion in the output video. It is important to note that the final product of GMC is a video with static background throughout the entire video. This sets a high bar on accuracy requirement for estimation of transformations to the global coordinate, despite foreground motion and appearance ambiguities. GMC can be re-purposed for video stabilization (VS) and mosaicing, but not vice versa given the accuracy requirement. GMC is an essential module for processing videos from nonstationary cameras, which are abundant due to emerging mobile sensors, e.g., wearable cameras, smartphones, and camera drones. First, the resultant motion panorama [8], as if virtually generated by a static camera, is by itself appealing for visual perception. More importantly, many vision tasks benefit from GMC. For instance, dense trajectories [114] are shown to be superior when camera motion is compensated [117]. Otherwise, camera motion interferes with human motion, rendering the analysis problem very challenging. Accurate and consistent GMC allows reconstruction of a “stitched" background [74], and subsequently segmentation of foreground [103, 113]. This helps 32 Sequential y GMC t TRGMC y t Figure 4.1: Schematic diagrams of proposed TRGMC and existing sequential GMC algorithms, and resultant motion panorama for a video shot by panning the camera up and down. Background continuity breaks easily in the case of the sequential GMC [88]. multi-object tracking by mitigating the unconstrained problem of tracking multiple in-the-wild objects, to tracking objects with a static background [99]. In existing GMC work [9, 24, 88], frames are transformed to a global motion-compensated coordinate (GMCC), by sequentially processing input frames. For a pair of consecutive frames, the mapping transformation is estimated, and by accumulating the transformations, a composite global transformation of each frame to the GMCC is obtained. However, the sequential processing scheme causes frequent GMC failures for multiple reasons: 1) Sequential GMC is only as strong as the weakest pair of consecutive frames. A single frame with high blur or dominant foreground motion can cause the rest of the video to fail. 2) Generally, multiple planes exist in the scene. The common assumption of a single homography will accumulate residual errors into remarkable errors. 3) Even if the error of consecutive frames is in a sub-pixel scale, due to the multiplication of several homography matrices, the error can be significant over time [74]. These problems are especially severe when processing long videos and/or the camera motion becomes more complicated. E.g., 33 when the camera pans to left and right repeatedly, or severe camera vibration exists, the GMC error is obvious by exhibiting discontinuity on the background (see Fig. 4.1 for an example). Although RGMC algorithm discussed in Chapter 3 improves GMC robustness and considerably decreases rate of drastic failures, still accumulation of error degrades RGMC performance. This degradation is more obvious when video length increases and camera motion is more complicated. To address the issues of sequential GMC, we propose a temporally robust global motion compensation (TRGMC) algorithm which by joint alignment of input frames, estimates accurate and temporally consistent transformations to GMCC. The result can be rendered as a motion panorama that maintains perceptual realism despite complicated camera motion (Fig. 4.1). TRGMC densely connects pairs of frames, by matching local keypoints. Joint alignment (a.k.a. congealing) of these frames is formulated as an optimization problem where the transformation of each frame is updated iteratively, such that for each link interconnecting a keypoint pair, the spatial coordinates of two end points are identical. This novel keypoint-based congealing, built upon succinct keypoint coordinates instead of high-dimensional appearance features, is the core of TRGMC. Joint alignment not only leads to the temporal consistency of GMC, but also improves GMC stability by using redundancy of the information. The improved stability is crucial for GMC, especially in the presence of considerable foreground motion, motion blur, non-rigid motion like water, or lowtexture background. The joint alignment scheme also provides capabilities such as coarse-to-fine alignment, i.e., alignment of the keyframes followed by non-keyframes, and appropriate weighting of keypoints matches, which cannot be naturally integrated in sequential GMC. Our quantitative experiments reveal that TRGMC pushes the alignment error close to human performance. In summary, this chapter makes the following contributions: • An algorithm for joint alignment of video frames is proposed to produce a globally motion compensated video where, despite the complicated camera movement and considerable 34 foreground motion, the background appears to be static over the progression of time. • A keypoint-based congealing algorithm aligns the spatial coordinates of keypoints for an image stack. It extends congealing applications from spatially cropped objects (faces and letters) to complex motion-rich video frames. • The capabilities and applications of TGRMC are demonstrated. Our collected video dataset, manual labels, and the code will be publicly available. 4.1 Previous Work TRGMC is related to many techniques in different aspects. We first review them and then compare our work with existing GMC algorithms. Firstly, homography estimation from keypoint matches is crucial to many vision tasks, e.g., image stitching, registration, and GMC. A main challenge of homography estimation from keypoint matches is the false matches due to appearance ambiguities. Robust methods are proposed to handle the outliers, such as RANSAC [35] and its variants [17, 106, 107]. Some methods also directly reject false matches [57, 70]. The hybrid methods [88, 124] combines appearance similarity and keypoint matches in a probabilistic framework. All methods estimate a homography for a frame pair. In contrast, in TRGMC, instead of direct calculation of homography transformation for each pair of frame, we jointly optimize the set of homographies which map the set of input frames into a global coordinate, such that the keypoints over a wide range of temporal distance are aligned well. Thus, TRGMC leverages the redundant background matches over time to better handle outliers. Image stitching (IS) and panoramic image mosaicing share similarity with GMC. IS aims to minimize the distortions and ghosting artifact in the overlap region. Many works utilize multiple homographies, instead of a single homography, due to existence of multiple scene planes [36, 35 69, 104, 105, 109, 133]. Some recent works focus also on the parallax issue, by using a hybrid model that uses homograhy for non-overlapping parallax-free regions and allow some local non-projective deviation to account for parallax and avoid stitching artifacts [59, 60, 127]. Li et allet@tokeneonedot [58] generate panoramas from motion-blurred videos. In these works, input images have much less overlap than GMC. On the other hand, video mosaicing takes in a video which raster scans a wide angle static scene, and produces a single static panoramic image [90, 92, 97]. When the camera path forms a 2D scan [92] or a 360◦ rotation [90], global refinement is performed via bundle adjustment (BA) [108], which ensures an artifact-free panoramic image, assuming a static scene. Although a byproduct of TRGMC is a similar static reconstruction of the scene, TRGMC focuses on efficient generation of an appealing video, where background consistently appears static for visual perception (in contrast to an image), for a highly dynamic scene. The important feature of such a video is that the only apparent motion in the video will rise from foreground motion. While one may use BA to estimate camera pose and then transformation between frames, our experiments reveal that BA is not reliable for videos with foreground motion and is less efficient than TRGMC. Further, BA estimates 3D location of keypoints while TRGMC needs 2D registration. Thus, by using BA, a harder problem needs to be solved which is unnecessary for the purpose of global motion compensation. Hence, image/video mosaicing and GMC have different application scenarios and challenges. Another related topic is the panoramic video [31, 43, 44, 77, 128]. For instance, Perazzi et allet@tokeneonedot [77] create a panoramic video from an array of stationary cameras by generalizing parallax-tolerant image stitching to video stitching. The fast video stitching in [128] can handle proper stitching of objects at varying depths. Jiang and Gu [44] propose an algorithm for stitching multiple video streams into a single panoramic video with spatial-temporal contentpreserving warping. In this work, for alignment of video frames, a spatial-temporal local warping 36 is proposed, which locally aligns frames from different videos while maintaining the temporal consistency. While these works focus on stitching multiple synchronized videos, GMC creates a motion panorama from a single non-stationary camera. Unlike GMC, video panoramas do not require the resultant video to have a stationary background. Video stabilization (VS) is a closely related but different problem. TRGMC can be re-purposed for VS, but not vice versa, due to the accuracy requirement. Given the accurate mapping to a global coordinate using TRGMC, VS would mainly amount to cropping out a smooth sequence of frames and handling rendering issues such as parallax. Among different categories of VS, 2D VS methods calculate consecutive warping between the frames and have similarities with sequential GMC, but any estimation error will not cause severe degradation in VS as long as it is smoothed. While TRGMC targets long-term staticness of the background, VS mainly cares about smoothing of camera motion, not removing it. In other words, TRGMC imposes a stronger constraint on the result, which is background staticness by complete camera motion removal in comparison to VS which deals with camera motion smoothing. This strict requirement differentiates TRGMC also from Re-Cinematography [38]. Also, large occlusion by the foreground may result in VS failure, however TRGMC handles this challenge by utilizing redundancy of background information in the joint alignment scheme. Congealing aims to jointly align a stack of images from one object class, e.g., faces and letters [53, 64]. Congealing iteratively updates the transformations of all images such that the entropy [53] or Sum of Squared Differences (SSD) [18] of the images, is minimized. However, despite many extensions of congealing [19, 41, 50, 68, 96], almost all prior work define the energy based on the appearance features of two images. Since congealing is based on image-based processing, it requires moderate initial alignment and is sensitive to intra-class variation and background clutter [50]. In [41], by incorporating deep learning into the congealing alignment frame37 work, a combination of unsupervised joint alignment with unsupervised feature learning is proposed. Through deep learning, authors obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In [50], a heuristic local feature based algorithm is proposed to rigidly align object class images to a seed image. Best matching local features are selected as object landmark. To overcome the problem of false matches, iteratively a minimal subset of matches are selected, homography is estimated, and image points are transformed to the seed coordinate. Using a spatial scoring algorithm, scores of the features matching are accumulated within a preset distance limit, resulting to refined landmarks. Finally, the other images are aligned to the seed using only the best landmarks. Note that [50] uses a heuristic local feature based algorithm to rigidly align object class images. In contrast we formulate the joint alignment of keypoints as an optimization problem and solve it in a principal way. Cox et allet@tokeneonedot [18] employ a sum of squared differences (SSD) cost solved by Gauss-Newton gradient decent. Unlike entropy-based congealing, the details of each image in the stack are used for alignment, rather than relying on the average image of the stack. Our experiments on GMC show that appearance-based congealing is inefficient and sensitive to initialization and foreground motion. Therefore, we propose a novel keypoint-based congealing algorithm minimizing the SSD of corresponding keypoint coordinates, instead of appearance features, to gain considerable efficiency enhancement and robustness to initialization.Further, most prior works apply to a spatially cropped object such as faces, while we deal with complex video frames with dynamic foreground and moving background, at a higher spatial-temporal resolution. There are a few existing sequential GMC works, where the main problem is to accurately estimate a homography transformation between consecutive frames, given challenges such as appearance ambiguities, multi-plane scene, and dominant foreground [8,24,88]. Bartoli et allet@tokeneonedot [9] first estimate an approximate 4-degree-of-freedom homography, and then refine it. Sakamoto et 38 . . .  t Input Frames Dense Interconnection Initialization Sec. 3.4 Pruning the Links Sec. 3.5 Iterative Updates Sec. 3.2 . . .  Average of frames Independent Alignment of Non-keyframes Average of frames Sec. 3.6 Aligned Keyframes Figure 4.2: Flowchart of the TRGMC algorithm. allet@tokeneonedot [90] generate a 360◦ panorama from an image sequence. Assuming that multiplication of all consecutive homographies results in the identity mapping, and homography has only 5 degrees of freedom, the camera rotation matrix has 3 degrees of freedom, to which are added the focal lengths before and after the camera rotation, all the homographies are optimized jointly to prevent error accumulation. In contrast, TRGMC employs an 8-degree-of-freedom homography. Although using homography in the case of considerable camera translation and large depth variation results in parallax artifacts, using a higher degrees-of-freedom homography than prior works allows TRGMC to better handle camera panning, zooming, and translation. Safdarnejad et allet@tokeneonedot [88] incorporate edge matching into a probabilistic framework that scores candidate homographies. Although [24, 88] improve the robustness to foreground, error accumulation and failure in a single frame pair still deteriorate the overall performance. Thus, TRGMC targets robustness of the GMC in terms of both the presence of foreground and long-term consistency by joint alignment of frames. 4.2 Proposed TRGMC Algorithm The core of TRGMC is the novel keypoint-based congealing algorithm. Our method relies on densely interconnecting the input frames, regardless of their temporal offset, by matching the detected SURF [10] keypoints at each frame. We refer to these connections, shown in Fig. 4.2, as links. Frames are initialized to their approximate spatial location by only 2D translation (Sec. 4.2.4). 39 We rectify the keypoints such that majority of the links have end points on the background region. Then the congealing applies appropriate transformation to each frame and the links connected to it, such that the spatial coordinates of the end-points of each link are as similar as possible. In Fig. 4.2, this translates to having the links as parallel to the t−axis as possible. For efficiency and robustness, TRGMC processes an input video in two stages. Stage one selects and jointly aligns a set of keyframes. The keyframes are frozen, and then stage two aligns each remaining frame to its two encompassing keyframes. The remainder of this section presents the details of the algorithm. 4.2.1 Formulation of keypoint-based congealing Given a stack of N frames {I(i) }, with indices i ∈ K = {k1 , ..., kN }, the keypoint-based congealing is formulated as an optimization problem, min ε = {θi } ∑ [ei(θi)] Ω(i)[ei(θi)], (4.1) i∈K where θi is the transformation parameter from frame i to GMCC, ei (θi ) collects the pair-wise alignment errors of frame i relative to all the other frames in the stack, and Ω(i) is a weight matrix. We define the alignment error of frame i as the SSD between the spatial coordinates of the endpoints of all links connecting frame i to the other frames, instead of the SSD of appearance [18]. Specifically, as shown in Fig. 4.3, we denote coordinates of the start and the end point of each (i) (i) (i) (i) (i) link k connecting frame i to the frame dk ∈ K\{i} as (xk , yk ) and (uk , vk ), respectively. For simplicity, we omit the frame index i in θi . Thus, the error ei (θ ) is defined as, ei (θ ) = [∆Xi (θ ) , ∆Yi (θ ) ] , 40 (4.2) Figure 4.3: The notation used in TRGMC. where (x) (y) ˜ i − v(i) , ∆Yi (θ ) = w ˜ i − u(i) , ∆Xi (θ ) = w (x) (i) (i) (4.3) (y) (i) (i) ˜ i = [Wx (xk , yk ; θ )] and w ˜ i = [Wy (xk , yk ; θ )] are the errors in x− and y− axes. The vectors w (i) (i) denote the x and y− coordinates of (xk , yk ) warped by the parameter θ , respectively. The vec(i) (i) tors u(i) = [uk ] and v(i) = [vk ] denote the coordinates of the end points. Similarly, the vectors (i) (i) x(i) = [xk ] and y(i) = [yk ] denote the coordinates of the start points. If Ni links emanate from frame i, ei is a 2Ni −dim vector. Ω(i) is a diagonal matrix of size 2Ni × 2Ni which assigns a weight to each element in ei . The parameter θ has 2, 6, or 8 elements for the cases of 2D translation, affine transformation, or homography, respectively. In this chapter, we focus on homography transformation which is a projective warp model, parameterized as, θ      (i) (i) (i) Wx (xk , yk ; θ )  p1 p2 p3  xk            (i) (i) Wy (xk , yk ; θ ) =  p4 p5 p6  y(i) .     k       1 p7 p8 1 1 (4.4) Although the homography model assumes the planar scene and this assumption may be violated in real world [127], we identify the problem of temporal robustness to be more fundamental for GMC than the inaccuracies due to a single homography. Also, videos for GMC are generally 41 swiped through the scene with high overlap, thus the discontinuity resulted from this assumption is minor. 4.2.2 Optimization solution Equation 6.1 is a non-linear optimization problem and difficult to minimize. Following [18], we linearize this equation by taking the first-order Taylor expansion around θ . Starting from an initial θ , the goal is to estimate ∆θ by, arg min [ei (θ ) + ∆θ ∂ ei (θ ) ∂ ei (θ ) ∆θ ] Ω(i) [ei (θ ) + ∆θ ] + γ∆θ I ∆θ , ∂θ ∂θ (4.5) where ∆θ I ∆θ is a regularization term, with a positive constant γ setting the trade-off. We observe that without this regularization, parameter estimation may lead to distortion of the frames. The indicator matrix I is a diagonal matrix specifying which elements of ∆θ need a constraint. We use I = diag([1, 1, 0, 1, 1, 0, 1, 1]) to specify that there is no constraint on the translation parameters of the homography, but the rest of parameters should remain small. By setting the first-order derivative of Eqn. 6.4 to zero, the solution for ∆θ is, ∆θ = H−1 R HR = Using the chain rule, we have ∂ ei (θ ) (i) Ω ei (θ ), ∂θ ∂ ei (θ ) (i) ∂ ei (θ ) Ω + γI . ∂θ ∂θ ∂ ei (θ ) ∂θ = ∂ ei (θ ) ∂ W ∂W ∂θ (4.6) (4.7) . Knowing that the mapping has two com- ponents as W = (Wx , Wy ), and the first half of ei only contains x components and the rest only y 42 components, we have, ∂ ei (θ ) = ∂W 1Ni 0Ni 0Ni 1Ni , (4.8) where 1Ni and 0Ni are Ni −dim vectors with all element being 1 and 0, respectively. For homography transformation, ∂W ∂θ = ∂ (Wx ,Wy ) ∂ (p1 ,p2 ,p3 ,p4 ,p5 ,p6 ,p7 ,p8 ) is given by,   (y) (x) (y) (x) (i) (i) ˜i  ˜i ˜i ˜i −u w 1Ni 0Ni 0Ni 0Ni −u w w ∂W w = . ∂θ (y) (x) (y) (x) (i) (i) ˜i ˜i ˜i ˜i −v w 1Ni −v w w 0Ni 0Ni 0Ni w (4.9) At each iteration, and for each frame i, ∆θ is calculated and the start points of all the links emanating from frame i are updated accordingly. Similarly, for all links with end points on frame i, the end point coordinates are updated. 1 We use the SURF [10] algorithm for keypoint detection with a low detection threshold, τs = 200, to ensure sufficient keypoints are detected even for low-texture backgrounds. We use the nearest-neighbor ratio method [65] to match keypoints and form links between each pair of keyframes. Keyframe selection We select keyframes at a constant step of ∆ f , i.e., from every ∆ f frames, only one is selected. Based on the experimental results, as a trade-off between accuracy and efficiency, we use ∆ f = 10 in TRGMC. 4.2.3 Weight assignment We have defined all parameters in the problem formulation, except the weights of links, Ω(i) . We consider two factors in setting Ω(i) . Firstly, the keypoints detected at larger scales are more likely to be from background matches, since they cover coarser information and larger image patches. Thus, to be robust to foreground, the early iterations should emphasize links from larger-scale 1 In algorithm implementation, it is important to store the original coordinates of the detected keypoints and apply the composite transformations accumulated in all the iterations to update the coordinates of the start and end points of the links. Otherwise, accumulation of numerical errors will harm the performance. 43 (a) . . . . . ▪ ▪ ▪ 1-0 1-1 1-1 1-0 =4-2 (b) . . . . . ▪ ▪ ▪ 10-3 Figure 4.4: Comparison of the ratios of background-foreground matches for (a) sequential GMC and (b) TRGMC. keypoints, which forms a coarse-to-fine alignment. We normalize the scales of all keypoints such that the maximum is 1, and denote the minimum of the normalized scales of the two keypoints (i) comprising the link k as sk . Then, Ωk,k is set proportional to sk . Secondly, for each frame i, the links may be made either to all the previous frames, denoted as backward scheme, or both the previous and upcoming frames, denoted as backward-forward scheme. The former is for potential real time application, whereas the latter for offline video processing. These schemes are implemented by assigning different weights to backward and forward links, (i) Ωk,k =    q   (β .sk )r ; if dk(i) < i (Backward links) (4.10)   q   (α.sk )r ; if dk(i) > i (Forward links) where 0 < α, β < 1, q is the iteration index, and 0 < r < 1 is the rate of change of the weights. Note (i) (i) that the alignment errors in x and y−axes have the same weights, i.e., Ωk+Ni ,k+Ni = Ωk,k . After a few iterations, the weights of all the links will be restored to 1. In the backward scheme, we set α = 0. 4.2.4 Initialization Initialization speeds up the alignment and decreases the false keypoint matches. The objective is to roughly place each frame at the appropriate coordinates in the GMCC. For initialization, we 44 (a) (b) Figure 4.5: (a) The input frame, (b) the reliability map, with the red color showing higher reliability. align the frames based only on rough estimation of translation without considering rotation, skew, or scale. We use the average of the motion vectors in matching two consecutive frames as the translation. Using this simple initialization, even if the camera has in-plane rotation, estimated 2D translations are zero, which is indeed correct and does not cause any problem for TRGMC. Given the estimated translation, approximate overlap area of each pair of frames is calculated, and only the keypoints inside the overlap area are matched, reducing number of false matches due to appearance ambiguities. 4.2.5 Outlier handling Links may become outliers for two reasons: (i) the keypoints reside on foreground objects not consistent with camera motion; (ii) false links between different physical locations are caused by the low detection threshold and similar appearances. In order to prune the outliers, we assume that the motion vectors of background matches, i.e., background links, have consistent and smooth patterns, caused by camera motion such as pan, zoom, tilt, whereas, the outlier links will exhibit arbitrary pattern, inconsistent with the background pattern. Specifically, we use Ma et allet@tokeneonedot [70] method to prune outlier links by imposing a smoothness constraint on the motion vector field2 . This method outperforms RANSAC 2 We use the implementation provided by the authors and default parameters. 45 if the set of keypoint matches contains a large proportion of outliers. Since keyframes have larger relative time difference than consecutive frames, the foreground motion is accentuated and more distinguishable from camera motion. This helps with better pruning of the foreground links. At each stage that the keypoints from a pair of frames are matched to form the links, we perform the pruning. Congealing of an image stack also increases the proportion of background matches over the outliers - another way to suppress outliers. The keypoints on background are more likely to form longer range matches than the foreground ones, due to non-rigid foreground motion. Hence, when N 2 combinatorial pairs of frames are interconnected, there are a lot more background matches (Fig. 4.4). 4.2.6 Alignment of non-keyframes The keyframes alignment provides a set of temporally consistent motion compensated frames, which are the basis for aligning non-keyframes. We refer to keyframes and non-keyframes with superscripts i and j, respectively. For a non-keyframe j between the keyframes ki and ki+1 , its alignment is a special case of Eqn. 6.1, with indices K = { j}, and the destination of the links ( j) dk ∈ {ki , ki+1 }, i.e., only θ j of frame j is updated while the keyframes remain fixed. Each nonkeyframe between keyframes ki and ki+1 is aligned independently. ( j) However, given the small time offset between j and dk , the observed foreground motion may be hard to discern. Also, frame j is linked only to two keyframes, thus there is no redundancy of background information to improve robustness to foreground motion. Therefore, we need a different means of outlier handling. We handle this issue by assigning higher weights to links that are more likely to be connected to the background. For each keyframe i, we quantify how well the links emanating from frame i are aligned with 46 (i) (i) (i) (i) (i) (i) other keyframes. If the alignment error is small, i.e., εk = Wx (xk , yk ; θ )−uk + Wy (xk , yk ; θ )− (i) < τ, the link k is more likely on the background of frame i and thus, more reliable for aligning vk non-keyframes. We create a reliability map for each keyframe i, denoted as R(i) (Fig. 4.5). For (i) (i) (i) each link k with εk < τ, a Gaussian function with µk = (xk , yk ) and σk = csk is superposed on R(i) , where the constant c is 20. We define, (i) Rm,n = − ∑e (i) 2 (i) 2 m−x + n−y k k 2σ 2 k , (4.11) 1 η k∈Bi (i) where Bi = {k|εk < τ}, η > 0 is a small constant (set to 0.1), x η = max(x, η) and x 1 = ( j) min(1, η). Now, we assign the weight of the links connecting frame j to the keyframe dk at the ( j) ( j) ( j) (a) coordinate (uk , vk ), as the reliability map of the keyframe at the endpoint, Ωk,k = R ( j) ( j) uk ,vk rq , ( j) where a = dk . We summarize the TRGMC algorithm in Algorithm 2. 4.3 Experimental Results We now present qualitative and quantitative results of the TRGMC algorithm and discuss how different computer vision applications will benefit from TRGMC. 4.3.1 Baselines and details We choose three sequential GMC algorithms as the baselines for comparison: MLESAC [107] and HEASK [124] both based on our own implementation, and RGMC [88] based on the authors’ Matlab code available online. We implement TRGMC in Matlab, and will publish the code. Denoting the video frames of w × h pixels, we set the parameters as γ = 0.1wh, T1 = 300, τ1 = 5 × 10−4 , T2 = 50, τ2 = 10−4 , r = 0.7, τ = 1, ∆ f = 10, and β = 1. For the backward-forward scheme we set 47 Algorithm 2: TRGMC Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data: A set of input frames {I(m) }M m=1 Result: A set of homography matrices {θm }M m=1 /* Align keyframes (Sec. 4.2.2) Specify K = {k1 , ..., kN } and initialize (Sec. 4.2.4); Match keypoints of all frames i ∈ K densely; Prune links (Sec. 4.2.5) and set weights (Eqn. 4.10); Store links’ start and end coordinates in (xi , yi ) and (ui , vi ); repeat forall the i ∈ K do Compute ∆θi (Eqn. 4.6), update θi , xi and yi ; Update (um , vm ) according to θi for m ∈ K\{i}; Update weights (Eqn. 4.10); q ← q + 1; until q < T1 or N1 ∑i∈K ||∆θi ||2 > τ1 ; /* Align non-keyframes (Sec. 4.2.6) Compute reliability map R(i) for i ∈ K ; for i = 1 : N − 1 do forall the j ∈ {ki + 1, ..., ki+1 − 1} do Match keypoints in j with d ( j) ∈ {ki , ki+1 } ; 16 17 18 19 20 21 */ */ ( j) Prune links (Sec. 4.2.5) and set weights Ωk,k ; Store links’ coordinates in (x j , y j ) and (u j , v j ); repeat Compute ∆θ j (Eqn. 4.6), update θ j , x j and y j ; Update weights (Eqn. 4.10), q ← q + 1; until q < T2 or ||∆θ j ||2 > τ2 ; α = 1 and for the backward scheme α = 0. 4.3.2 Datasets and metric Given there is no public dataset for quantitative GMC evaluation, we form a dataset composed of 40 challenging videos from SVW [89] and 15 videos from UCF101 [100], termed “quantitative dataset”. SVW is an extremely unconstrained dataset including videos of amateurs practicing sports, and is also captured by amateurs via smartphone. The min. and max. spatial size of videos are 240 and 480 pixels, respectively. The average, min., and max. length of the videos are 14, 3, 48 Algorithm MLESAC HEASK RGMC TRGMC GT* Setting – – – BF* B* – Avg. BRE 0.116 0.110 0.097 0.058 0.060 0.038 Efficiency (s/f) 0.17 7.47 3.47 0.64 0.41 – Table 4.1: Comparison of GMC algorithms on quantitative dataset (*GT: Ground truth, BF: Backward-Forward, B: Backward). and 45 seconds, captured at 25 or 30 FPS. In addition, we form another “qualitative dataset” with 200 unlabeled videos from SVW, in challenging categories of boxing, diving, and hockey. To compare GMC over different temporal distances of frames, for each video of length M in the quantitative dataset, we manually align all 10 possible pairs from the 5-frame set, F = {1, 0.25M, 0.5M, 0.75M, M}, as long as they are overlapping, and specify the background regions. For this, a GUI is developed for a labeler to match 4 points on each frame pair, and fine tune them up to a half-pixel accuracy, until the background difference is minimized. Then, the labeler selects the foreground regions which subsequently identify the background region. Similar to [88], we quantify the consistency of two warped frames I(i) (θi ) and I( j) (θ j ) (0 to 1 grayscale pixels) via the background region error (BRE), BRE(i, j) = where 1 MB |(I(i) (θi ) − I( j) (θ j ))| MB 1 , (4.12) 1 is element-wise multiplication and MB is the background mask for the intersection of two warped frames. 4.3.3 Quantitative evaluation Average of BRE over all the temporal frames pairs is shown in Table 4.1. TRGMC outperform all the baseline methods with considerable margin. The backward-forward (BF) scheme has a slightly better accuracy than the backward (B) scheme, and is also more stable based on our visual 49 Figure 4.6: Average BRE of frame pairs versus the time difference between the two frames. 150 150 t 200 t 200 100 100 50 50 0 (a) 200 400 600 0 x (b) (c) 200 400 600 x (d) Figure 4.7: Top view of the frames and links (a) before and (b) after TRGMC. The parallel links in (b) show successful spatial alignment of keypoints. Average of frames (c) before and (d) after TRGMC. For better visibility, we show up to 15 links emanated per frame. observation. Thus, we use BF as the default scheme for TRGMC. To illustrate how the accumulation of errors over time affects the final error, Fig. 4.6 summarizes the average error versus the time difference between the frames in F. This shows that TRGMC error is almost constant over a wide temporal distance between the frames. Thus, even if a frame is not aligned accurately, the error is not propagating to all the frames after that. However, in sequential GMC, the error increases as the time difference increases. 50 Figure 4.8: Composite image formed by overlaying the frame n on frame 1 for several videos after TRGMC. Left to right, top to bottom, n is equal to 144, 489, 912, 93, respectively. In the overlap region the difference between the frames is shown. 4.3.4 Qualitative evaluation While quantitative results are comprehensive, the number of videos is limited by the labeling cost. Thus, we further compare TRGMC and the best performing baseline, RGMC, on the larger qualitative dataset. The resultant motion panoramas were visually investigated and categorized into three cases: good, shaking, and failed (i.e., considerable background discontinuity). The comparison in Tab. 4.2 again shows the superiority of TRGMC. Figure 4.7 shows the links of a sample video processed by TRGMC, and the average frames, before and after processing. Initialization module is disable for generating this figure to better illustrate how well the spatial coordinate of the keypoints are aligned, resulting in links parallel to the t− axis. This video also shows how GMC might be utilized for video stabilization. Figure 4.8 shows a composite image formed by overlaying the last frame (or a far apart frame with enough overlap) on frame 1 for several videos, after TRGMC. In the overlap region, difference between the two frames is shown, to demonstrate how well the background region matches for the frames 51 with large temporal distance. 4.3.5 Computational efficiency Table 4.1 also presents the average time for processing each frame for each method, on a PC with an Intel i5-3470@3.2GHz CPU, and 8GB RAM. While obtaining considerably better accuracy than HEASK or RGMC, TRGMC is on average 15 times faster than HEASK and 7 times faster than RGMC. MLESAC is ∼3 times faster than TRGMC, but with twice the error. For TRGMC, the backward scheme is 50% faster than forward-backward, since it has approximately half the links of BF. 4.3.6 Accuracy vs. efficiency trade-off Fig. 4.9 presents the error and efficiency results for a set of 5 videos versus the keyframe selection step, ∆ f . For this set, the ground truth error is 0.049. As a sweet spot in the error and efficiency trade-off, we use ∆ f = 10 for TRGMC. This figure also justifies the two stage processing scheme in TRGMC, as processing frames at a low selection step ∆ f , is costly in terms of efficiency, but only improves the accuracy slightly. Alg. \Perfromance RGMC TRGMC Good 64% 93% Shaking 33% 5% Failed 3% 2% Table 4.2: Comparison of GMC algorithms on qualitative dataset. 4.4 Conclusions We proposed a temporally robust global motion compensation (TRGMC) algorithm by joint alignment (congealing) of frames, in contrast to the common sequential scheme. Despite complicated 52 0.058 0.057 1 0.056 0.055 0.5 0.054 2 4 6 8 10 12 14 16 18 Avg. BRE Efficiency(s/f) 1.5 20 ∆f Figure 4.9: Error and efficiency vs. the keyframe selection step, ∆ f . camera motions, TRGMC can remove the intentional camera motion, such as pan, as well as unwanted motion due to vibration on handheld cameras. Experiments demonstrate that TRGMC outperforms existing GMC methods, and applications of TRGMC. The enabling assumption of TRGMC is that the camera motion in the direction of the optical axis is negligible. For instance, TRGMC will not work properly on a video from a wearable camera of a pedestrian, since in the global coordinate the upcoming frames grow in size and cause computational and rendering problems. Similar to panorama images, the best results are achieved if the optical center of the camera has negligible movement during the capturing, making a homographybased approximation of camera motion appropriate. However, if the optical center moves in the perpendicular direction to the optical axis (e.g., a camera following a swimmer), TRGMC still works well, but rendering the results in the form of motion panorama will be degraded by parallax effect. 53 Chapter 5 Global Motion Compensation Applications There are a wide range of applications which can benefit from a robust and accurate global motion compensation (GMC) algorithm. Basically, many algorithms which are tailored to work only with static cameras benefit from GMC, as GMC transforms the video from a freely moving camera to a video from a pseudo-static camera in which background pixels are static over progression of time. Besides, some byproducts of GMC such as background reconstructions and motion panoramas themselves provide an interesting visualization of the captured video. In these chapter, we briefly investigate these applications and then in Chapter 6 propose an algorithms for spatio-temporal alignment of non-overlapping sequences which is enabled by TRGMC. 5.1 Motion panorama By sequentially reading input frames, applying the transformation found by TRGMC, and overlaying the warped frames on a sufficiently large canvas, a motion panorama is generated. Furthermore, it is possible to reconstruct the background using the warped frames first (as will be discussed later), and overlay the frames on that, to create a more impressive panorama. The last frame on the video generated such, can be referred to as a panoramic mosaic [101]. Figure 5.1 shows a few exemplar panoramas along with the camera motion pattern. For all the input videos of length M, we apply ( 21 (θ1 + θM ))−1 to the transformations found by TRGMC to normalize the result and have a better view of the scene in a smaller spatial area. 54 5.2 Raster scan of scenes/Image mosaic We may swipe the camera through a large scene in a raster scan fashion and use TRGMC to reconstruct a big image mosaic. Note that this scenario is non-trivial since the accumulated error can be obvious when the raster scan comes back to the original camera position. The long term robustness presented by TRGMC is crucial in this scenario. 5.3 Background reconstruction Background reconstruction is important for removing occlusions, or detecting foreground [74]. To reconstruct the background, a weighted average scheme is used to weight each frame by the reliability map, R(i) , which assigns higher weights to background. Since the minimum value of R(i) is a positive constant η, if no reliable keyframe exists at a coordinate, all the frames will have equal weights. Specifically, the background is reconstructed by B = ∑i∈K R(i) (θi )I(i) (θi ) , ∑i∈K R(i) (θi ) where R(i) (θi ) and Ii (θi ) are the reliability map and the input frame warped using the transformation θi . Using our scheme, reconstructed background in Fig. 5.2 is sharper and less impacted by the Frame = 759 300 Frame = 94 200 100 0 −100 −200 −300 −400 200 400 600 800 1000 1200 1400 200 100 0 −100 Frame = 165 −200 200 400 600 800 1000 1200 1400 Frame = 115 1600 500 400 300 300 250 200 200 100 150 0 100 −100 50 0 −200 −50 100 150 200 250 300 350 400 450 500 Figure 5.1: Temporal overlay of frames from different videos processed by TRGMC. Trajectory of the center of image plane over time is overlaid on each plot to show the camera motion pattern, where color changes from blue to red with progression of time. −300 400 600 800 1000 55 1200 1400 Figure 5.2: Background reconstruction results. foreground. 5.4 Foreground segmentation The reliable background reconstruction result B as calculated in Section 5.3 along with the GMC result of frame I(i) , i.e., θi , can be easily used to segment the foreground by thresholding the difference image, |B − I(i) (θi )| (See Fig. 5.3). 5.5 Human action recognition State of the art human action recognition heavily relies on analysis of human motion. For instance, the dense trajectories algorithm [114] for motion analysis reveals its power when camera motion is compensated in the input video, either as pre-processing step, or internally [117]. Otherwise, camera motion interferes with human motion, making the analysis problem very challenging. In [117], camera motion is compensated by detecting human and removing motion vectors due to human motion, and relying on RANSAC algorithm for outlier rejection. However, this internal GMC 56 (a) (b) (c) Figure 5.3: Foreground segmentation: (a) Input frame, (b) reconstructed background, (c) difference of (a,b) on (a). (b) (a) Figure 5.4: Dense trajectories of the (a) original video, and (b) TRGMC-processed video. requires accurate human detection, which has a high failure rate in videos in the wilds, specially for highly articulated human body in sports videos, and loses performance when number of false matches increases. Fig. 5.4 illustrates the difference of dense trajectories calculated on an input video with and without application of TRGMC. As shown in this figure, utilizing TRGMC before extraction of dense trajectories effectively suppresses the camera motion effect. 57 5.6 Multi-object tracking (MOT) When appearance cues for tracking are ambiguous, e.g., tracking players in team sports like football, motion cues gain extra significance [26, 55]. MOT is comprised of two tasks, data association by assigning each detection a label, and trajectory estimation – both highly affected by camera motion. TRGMC can be applied to remove camera motion and thus, revive the power of tracking algorithms relying on motion cues. To verify the impact of TRGMC, we manually label the locations of all players in 566 frames of a football video (the one in Fig. 5.3) and use this ground truth detection results to study how MOT using [3] benefits from TRGMC. Fig. 5.5 compares the trajectories of players over time with and without applying TRGMC. Comparing number of label switches, this qualitatively demonstrates improvement of a challenging MOT scenario using TRGMC. Also, the Multi-Object Tracking Accuracy (MOTA) [12] achieved for the original video and the video processed by TRGMC are 63.79% and 84.23%, respectively. 5.7 Spatio-temporal alignment of non-overlapping sequences Spatio-temporal alignment of multiple videos is an important computer vision problem with a wide range of applications. Previous works study different aspects and scenarios of the spatio-temporal alignment. Given the capabilities of the accurate GMC provided by TRGMC, it is possible to design algorithms for spatio-temporal alignment of sequences in new scenarios. Specifically, since it is possible to transform each given sequence to a global coordinate and also reconstruct the background for each video sequence, it is possible to register the background images, and subsequently register each frame of each sequence with other sequences. To this end, we propose a new algorithm for spatio-temporal alignment of sequences, for non-overlapping sequences (NOS), which is presented in details in Chapter 6. Targeted NOS are captured by freely and independently panning 58 500 t 400 300 200 100 0 0 200 400 600 x 500 t 400 300 200 100 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 x Figure 5.5: Multi-player tracking using [3] for a football video with camera panning to the right, before (top) and after processing by TRGMC (bottom). cameras from nearby viewpoints. In NOS, sequences might not have any pair of frames that have spatial overlap and belong to the same world time instant. More interestingly, sequences might even not cover some common regions of the same scene over the progression of time. Our algorithm uses TRGMC to map each frame to a camera-motion-removed video and reconstruct the background for each sequence, independently. These potentially non-overlapping backgrounds are aligned via appearance cues and also the prediction that where a moving object leaving field of view of a camera will appear in field of view of another camera. Given the spatial alignment, we predict when a moving object leaving field of view of one camera will appear in field of view of another to come up with the temporal. We mathematically formulate this prediction and estimate the temporal synchronization. 59 Chapter 6 Spatio-Temporal Alignment of Non-Overlapping Sequences from Independently Panning Cameras 6.1 Introduction Spatio-temporal alignment of multiple videos [16, 27, 28, 33, 39, 76, 80, 95, 120] is a well-studied vision problem with a wide range of applications, e.g., human action recognition [82, 110], video editing [120], markerless motion capture [39], video mosaicing, change detection [27], and abandoned object detection [48]. Previous works study different aspects and scenarios of the spatiotemporal alignment. Some works target sequences from the same scene but different viewpoints [39, 76]. Some can handle sequences recorded at different times by independent moving cameras that follow a similar trajectory [28, 33,120]. The seminal work of Caspi and Irani [16] studies spatially non-overlapping sequences when two fixed cameras move jointly in space. Our work covers a novel unexplored aspect of spatio-temporal alignment of sequences, for nonoverlapping sequences (NOS). Targeted NOS are captured by freely and independently panning cameras, from nearby viewpoints, with limited translation, especially in optical axis direction. In NOS, sequences might not have any pair of frames that have spatial overlap and belong to the same world time instant. More interestingly, sequences might even not cover some common regions 60 S1 S2 a Temporal overlap t (a) f = 540 S1 b S2 Spatiotemporal overlap x No/minimal overall spatial overlap f = 571 (b) Figure 6.1: (a) Top view of spatio-temporal FOV of two moving cameras capturing sequences S1 and S2 ; Non-overlapping sequences (NOS) may not even cover some common spatial region over the progression of time, i.e, no overall spatial overlap will exist. (b) Spatio-temporal alignment of NOS results in displaying sequences from multiple freely panning cameras in a common coordinate and at the correct time shift. of the same scene over the progression of time. In other words, if we reconstruct the observed background by these sequences, the backgrounds may be non-overlapping, i.e., in Fig. 6.1 (a), overall spatial overlap does not exist. Given the ubiquitousness of smartphones and wearcams, NOS are increasingly common. When amateur users unsynchronizedly shoot videos of an event, aligning these videos leads to a single comprehensive video, with greater spatial and temporal spans (Fig. 6.1 (b)). This resultant video is essentially a panoramic video, shot by smartphones, without the need to fix the cameras to each other or use tripods, with the best visual presentation achieved if there exists even a tiny overlap. Further, in cases of crime actions or violations where many witnesses capture videos from the incident, each sequence may cover part of the story. Aligning these videos into a unified largescale 3D volume provides a better grasp of the full picture. The existing spatio-temporal alignment algorithms fail in the case of NOS, since even if there is some overall spatial overlap, spatial alignment of apparently overlapping frames, as Fig. 6.1 (a) shows for frames a and b, obviously violates the temporal alignment. However, by decompos- 61 ing the task to spatial alignment first and then temporal alignment based on scene dynamics, the problem can be solved. In general our proposed algorithm assumes NOS satisfy the following two assumptions. 1) Although the sequences do not need to have any corresponding frames that share a common scene at the same world time stamp, and no overall overlap as in Fig. 6.1 (a), they cover nearby parts of a scene from similar view angles. 2) There are moving objects in the scene which move from the field of view (FOV) of one camera to FOV of other cameras, or if the sequences happen to have overlap, have motion in the overlap region. Our algorithm uses global motion compensation to map each frame to a camera-motion-removed video and reconstruct the background for each sequence, independently. With the two assumptions, these potentially non-overlapping backgrounds are aligned via appearance cues and also the prediction that where a moving object leaving FOV of a camera will appear in FOV of another camera. Collection of the former mappings and the latter background alignment, can put each frame in each sequence in correct spatial alignment w.r.t. frames from other sequences. Given the spatial alignment and the assumption 2, we predict when a moving object leaving FOV of one camera will appear in FOV of another. We mathematically formulate this prediction and estimate the temporal synchronization. In summary, this chapter makes these contributions: A new scenario in spatio-temporal alignment of sequences is identified and targeted. A spatial alignment algorithm for NOS via alignment of reconstructed backgrounds and consistency of objects movement is proposed. The trajectory of moving objects with smooth path are used as a clue for temporal alignment of NOS. 62 6.2 Previous Work The prior work in spatio-temporal alignment of sequences mostly differ in their assumptions and scenarios, e.g., the camera movement (static, jointly moving, or moving), camera view-point (similar or distinct), extent of overlap in sequences, and extent of similarity of camera motion paths. The work of [34] presents an excellent taxonomy of these assumptions, one of which is that, to align sequences from the same event captured by freely moving cameras, coherent scene appearance is assumed. We lift this assumption by handling non-overlapping sequences, although we do assume negligible camera movement in the optical axis direction. We now review key scenarios in prior work. 6.2.1 Jointly moving cameras Caspi and Irani align spatially non-overlapping sequences when two closely attached cameras move jointly in space (Fig. 6.2 (a)) [16]. Assuming cameras share the same projection center, their relationship is modeled as a fixed homography H, estimated based on the idea that the apparent motion in camera 1 is related to camera 2 with H. Esquivel et al. [32] relax the projection center assumption and calibrate a multi-camera rig from non-overlapping views, assuming synchronized sequences. In contrast, the cameras in NOS can pan freely with no overlap, which is substantially more challenging. 6.2.2 Cameras following similar trajectories The authors of [28, 33, 34, 120] align sequences recorded at different times by independent moving cameras that follow a similar trajectory (Fig. 6.2 (b)). [28] assumes one sequence as the reference and the other sequences entirely contained (temporally) within the reference. The alignment is formulated as an energy minimization problem alternately solved for temporal and spatial alignment 63 y z x t z' y' x' x (a) (b) (c) (d) Figure 6.2: Various scenarios in spatio-temporal alignment of sequences: (a) jointly moving cameras, (b) independently moving cameras at different times following similar trajectories, (c) stationary cameras with different viewpoints, (d) the proposed independently panning cameras with non-overlapping sequences. parameters and is evaluated on four sets of real videos. In [120] an interactive method for nonlinear temporal video alignments is proposed for video editing. All these methods require coherent scene appearance and are not capable of handling sequences from moving cameras with no overlap in FOV — the targeted scenario of NOS. 6.2.3 Stationary cameras at different views Padua et al. [76] target n ≥ 2 sequences from the same scene but different viewpoints (Fig. 6.2 (c)). The stationary cameras allow the estimated camera’s epipolar geometry remain fixed. Motion trajectories are used as cues for both spatial and temporal alignment. Experimental results are provided for 5 sequences, however, as the proposed method is not dependent on a specific tracker, for each sequence, the optimal tracker is chosen based on the application in hand. 6.2.4 Time synchronization Many prior works have focused on time synchronization of sequences. Assuming the known 3D object location and calibrated stationary cameras, [15] synchronizes non-overlapping sequences of these cameras. Gaspar et al. [37] propose a synchronization algorithm for the case that two 64 Temporal alignment Spatial alignment Frame coordinate S1 (a) Sequences coordinate Rendering aligned sequences Frame coordinate (e) (c) f = 920 P2 f = 1003 P1 S2 Trajectories of keypoints t B1:Reconstructed background for S1 (b) B2:Reconstructed background for S2 x World coordinate (f) (d) Δt1 Figure 6.3: Flowchart of our spatio-temporal alignment algorithm. First, spatial alignment is performed by background reconstruction for each sequence (a) and aligning the backgrounds (b). Second, given the spatial alignment parameters, keypoint trajectories (c) are mapped to the world coordinate and the best temporal alignment in terms of continuity of moving object trajectories is found (d). Finally, spatio-temporal alignment parameters are used for displaying the sequence in a world coordinate system and at the correct time shift (e). cameras move independently, even if different features are tracked in two sequences. It assumes the known intrinsic camera parameters and two visible rigid objects in both sequences, whose relative motion is used for synchronization. Lu and Mandal [67] model the video temporal alignment as a spatio-temporal discrete trajectory alignment problem. The method is evaluated on synthetic trajectories and 10 pairs of real videos. Our method also relies on existence of at least one moving object for temporal synchronization. In fact, without spatial overlap between FOVs, any temporal alignment algorithm has to track moving objects or egomotion [34]. However, we can work with non-overlapping sequences where the same moving object is not visible at the same time in all sequences, without relying on camera calibration or known moving object location. 6.3 Proposed Method We discussed the assumptions for the proposed spatio-temporal alignment of NOS in Section 6.1. The intrinsic and extrinsic camera parameters are not required. Also, the cameras might be un65 synchronized, i.e., the capture starts at different times, with possibly distinct frame rates, and are panned freely and independently. However, best results are achieved by small camera baseline and limited translation of cameras, especially in the optical axis direction. The proposed algorithm has two stages, (1) spatial alignment (Fig. 6.3(b)), which relies on the reconstructed backgrounds’ appearance and consistency of movement of objects across the sequences, (2) temporal alignment (Fig. 6.3(d)), which uses the continuity of objects’ trajectories to synchronize the videos. Our method is feature-based, relying on keypoint correspondence for the first stage and keypoint trajectories for the first and second stage. 6.3.1 Notations As shown in Fig. 6.3, frame coordinate refers to the pixel coordinate in the input video, sequence coordinate to the global coordinate of the reconstructed background of one video, and world coordinate to the global coordinate of all input videos where the final aligned video is rendered. We denote the coordinates and time stamps in the frame coordinate with plain letters, in the sequence coordinate with ∼ over the notation, e.g., x˜ , and in the world coordinate with double ∼, e.g., x˜ . Accordingly, a transformation from the frame to sequence coordinate has ∼ over the notation, and a transformation from the sequence to world coordinate has double ∼. We use superscript for the sequence number and subscript for, either the frame number or trajectory number. E.g., θ˜is is the transformation of frame i in sequence s from the frame coordinate to sequence coordinate. 6.3.2 Spatial alignment We break down the spatial alignment to two phases. First, for each sequence, we map all the frames to the sequence coordinate, via global motion compensation (GMC), which also produces a reconstructed background mosaic (Fig. 6.3(a)). A crucial assumption for successful GMC is the 66 camera having small motion in the optical axis direction. Second, image alignment is conducted on the reconstructed backgrounds and maps them to the world coordinate (Fig. 6.3(b)). However, if the backgrounds are non-overlapping, common image alignment cannot be used. Thus, a new alignment scheme is proposed in Sec. 6.3.2.3. 6.3.2.1 Global motion compensation GMC removes any intentional or unwanted camera motion in a sequence, creating a video with static background [86, 88]. Essentially, GMC estimates a per frame transformation to the sequence coordinate. This work utilizes the TRGMC algorithm [86], discussed in Chapter 5, which handles dynamic scenes and estimates the transformations by joint alignment of input frames. TRGMC first detects SURF [11] keypoints in each frame, and performs keypoint matching to densely interconnect all frames, regardless of their temporal offset. These connections are referred as links. Then the keypoint-based congealing applies appropriate transformation to each frame and its links, such that the spatial coordinates of the end-points of each link are as similar as possible. For the convenience of readers, we briefly introduce the keypoint-based congealing. Given a stack of N frames {Ii }, with indices i ∈ K = {k1 , ..., kN }, keypoint-based congealing is formulated as an optimization problem, min ε s = {θ˜is } ∑ [ei(θ˜is)] Ωsi[ei(θ˜is)], (6.1) i∈K where θ˜is is an 8-dim homography transformation parameter from frame i of sequence s to the sequence coordinate, ei (θ˜is ) collects pair-wise alignment errors of frame i relative to all other frames, and Ωsi is a weight matrix. Since TRGMC uses homography transformation, it works best with nodal camera motion. In the case of camera translation, TRGMC still works by matching the dominant background, although the result may downgrade with parallax. 67 The alignment error of frame i relative to all other frames is the sum of squared differences (SSD) between the coordinates of the endpoints of all links connecting keypoints of frame i to keypoints of other frames. The coordinates of the start and end point of each link k starting from frame i are donated as (xi,k , yi,k ) and (ui,k , vi,k ), respectively. The error ei (θ˜is ) is, ei (θ˜is ) = [ xi (θ˜is ) , yi (θ˜is ) ] , (6.2) where (x) ˜ i − ui , xi (θ˜is ) = w (y) ˜ i − vi , yi (θ˜is ) = w (6.3) (x) (y) ˜ i = [Wx (xi,k , yi,k ; θ˜is )] and w ˜ i = [Wy (xi,k , yi,k ; θ˜is )] are the errors in x and y−axes. The vectors w denote the x and y−coordinates of (xi,k , yi,k ) warped by the parameter θ˜is , respectively. The vectors ui = [uk,i ] and vi = [vk,i ] denote the coordinates of the end points. Equation 6.1 is solved by taking the Taylor expansion around θ˜is and finding the increment ∆θ˜is that minimizes, ∂ ei (θ˜is ) ˜ s ∂ ei (θ˜is ) ˜ s s s ˜ ∆ θ ] Ω [e ( θ ) + ∆θi ] + γ∆θ˜is I ∆θ˜is , arg min [ei (θ˜is ) + i i i i s s ˜ ˜ ∂ θ ∂ θ s ∆θ˜ i i (6.4) i where ∆θ˜is I ∆θ˜is is a regularization term which stabilizes the changes to the transformation, with a positive constant γ setting the trade-off. The indicator matrix I is a diagonal matrix specifying which elements of ∆θ˜is need a constraint. By setting the first-order derivative of Eqn. 6.4 to zero, a closed-form solution for ∆θ˜is is obtained. After enough iterations, θ˜is will be the transformation mapping frame i of the input video to the sequence coordinate. 68 6.3.2.2 Spatial alignment of overlapping backgrounds Given the θ˜is for all the input videos, we follow [86] to reconstruct the backgrounds Bs for them. If there exists enough overlap between the backgrounds, common image alignment algorithms may be used. Specifically, we estimate the transformation θ˜ s that maps the background of sequence s to the world coordinate, by matching SURF keypoints on background images via the vector field consensus algorithm [70]. In summary, a point with the homogeneous coordinate (x, y, 1) in frame i of sequence s is mapped to the sequence coordinate of sequence s, denoted as (x, ˜ y, ˜ 1), and the ˜ y, ˜ 1), world coordinate of all sequences, denoted as (x,       ˜ x x x˜         ˜ s   ˜ s ˜s   = θ = θ θ y˜ y˜ i y .             1 1 1 (6.5) Thus, the transformation θ˜ s θ˜is conducts spatial alignment for frame i in sequence s. Given the homography transformation of θ˜ s , as the cameras’ baseline increases, the dominant background plane is aligned, and the foreground may be affected by parallax in the final composite video. 6.3.2.3 Spatial alignment of non-overlapping sequences With freely panning cameras, it is likely that the backgrounds of sequences have no overlap, or the overall overlap is too small to reliably estimate the spatial alignment, flagged with noisy keypoint matches using vector field consensus [70]. One potential solution is to extrapolate the background images, and align the extrapolated images, similar to [79]. However, our experiments reveal that this is not reliable. First, extrapolation introduces many artifacts [7], or blurred areas [2, 79], leading to poor keypoint matching. Second, extrapolation in horizontal direction, helps with alignment in vertical direction, but leaves lots of ambiguity in horizontal alignment. Third, a rigid Euclidean 69 : Extrapolated background warped by : Extrapolated background Background Moving objects trajectories Moving objects trajectories y x Figure 6.4: Spatial alignment of non-overlapping sequences using background extrapolation and smoothness of object trajectories. transformation, as in [79], does not suffice for a proper background alignment. On the other hand, how objects move across the sequences in the spatial world coordinate, irrespective of temporal synchronization, provides hints for spatial alignment (Fig. 6.4). There is ambiguity in the exact spatial alignment, however, as more objects move across the sequences and in more diverse directions, the ambiguity is decreased. To enable spatial alignment for non-overlapping sequences, we propose a spatial alignment algorithm that combines both aforementioned ideas. We first extrapolate the background images of all sequences. Then, we perform motion tracking to obtain trajectories of all keypoints in each sequence. By transforming the trajectories to the sequence coordinate using θ˜is and filtering out static trajectories, we collect moving object trajectories. We create motion tracks by matching moving object trajectories across sequences. Finally, we incrementally update the transformation applied to the background images to increase the motion track smoothness in the world coordinate, while ensuring that this update will not violate the appearance consistency of extrapolated backgrounds 70 in the overlap region. Motion tracking We perform tracking in consecutive frames to form the trajectories. Among various schemes, we prefer the keypoint-based tracking for two reasons. 1) Object-based tracking requires detecting generic objects on each frame, which could be error-prone and inefficient. 2) Our experiments and also the analysis in [61] reveal that optical flow-based tracking such as dense trajectories [115] leads to spurious motion trajectories close to the motion boundaries. We use SURF [11] keypoint detector due to superior performance on blurry images, in comparison to SIFT [65]. To detect newly emerging objects, we start tracking all the keypoints on frame i who have no corresponding matches from frame i − 1. Denote the jth trajectory in sequence s as Pjs = [xsj , ysj , ˜tsj ], where xsj and ysj are the frame coordinates, and t˜sj is the time stamp. To handle sequences at different frame rates, t˜sj should be the absolute time unit such as milliseconds not frame number. We then compute the trajectory P˜js in the sequence coordinate via θ˜is . In this coordinate, trajectories of moving and stationary keypoints are easily distinguishable, as sequence coordinates of static objects remain constant over time (Fig. 6.3(d), bold vs. dashed lines). Denoting the trajectory length as l sj and width and height of the sequence as ws and hs , we omit stationary trajectories if, 1 l sj l sj −1 ∑ k=1 |˜xsj,k − x˜ sj,k+1 | ws + |˜ysj,k − y˜ sj,k+1 | hs < τ1 , (6.6) where τ1 is a threshold for the total displacement of the tracked object, and x˜ sj,k and y˜ sj,k denote the kth element in the vectors x˜ sj and y˜ sj , respectively. Creating motion tracks We describe each moving object trajectory j of sequence s with two SURF descriptors, one for the keypoint starting the trajectory S js and one for the one ending it E js . To match two trajectories j and k from sequences s1 and s2 , a classical keypoint matching algo- 71 rithm [65] is used to match all 4 combinations of keypoints, i.e., (S js1 , Sks2 ), (E js1 , Eks2 ), (S js1 , Eks2 ) and (E js1 , Sks2 ) , and the minimum distance decides a match. This way, more robustness against view point variation is achieved, as the nearby keypoints of the trajectories (in the world coordinate) will be the deciding factor in trajectory matching. We call each set of the matched trajectories a track, denoted by Πk . For simplicity of notation, we assume that the trajectories contributing to a track have been re-indexed such that Πk = {P˜ks ; s ∈ {1, ..., S}}. For a certain sequence s, P˜ks might be empty, i.e., no trajectories from this sequence is part of the track Πk . Note that not all trajectories should be matched to form tracks, as they might be due to noise or objects with non-smooth motion path. Sec. 6.3.3.2 presents a method to remove non-smooth trajectories. Spatial alignment formulation For simplicity, we discuss the alignment of 2 sequences, as more sequences may be aligned in the same manner, sequentially. Also, we set θ˜ 1 = I3×3 and use θ for θ˜ 2 to avoid cluttered equations. Given N tracks indexed by i, and extrapolated backgrounds B1 and B2 , the goal is to find a transformation θ which maps B2 to B1 , such that the pixel contents of the extrapolated background are consistent in the overlap region O(θ ) and trajectories of sequence 2 reside on the extension of trajectories in sequence 1. For image extrapolation, we use PatchMatch algorithm [7]. To further improve extrapolation results, for extrapolating each background, we use contents of both background images. Then, we formulate the problem as an optimization problem (Fig. 6.4), min θ ∑ 2 B2 (W (x; θ )) − B1 (x) + β ∑ ei (θ ) ei (θ ), (6.7) i x∈O(θ ) where W (x; θ ) warps x by the transformation θ , and ei (θ ) represents how far trajectory i of sequence 2 is from spatial extension of matching trajectory in sequence 1. The first term in Eqn. 6.7 is similar to Lucas-Kanade algorithm [5], operated only in the overlapping area. To define ei (θ ), we fit a line, which based on our experiments works better than fitting polynomials, to the ith tra- 72 jectory in sequence 1 (in the sequence coordinate), denoted by fi (x). The vector ei (θ ) collects the y-distance between each point on the ith trajectory in sequence 2, after warped by θ , and the fitted curve, (x) (y) ˜ i )], ˜ i − fi (w ei (θ ) = [w (6.8) (y) (x) ˜ i = [Wy (xi,2 , yi,2 ; θ )] are the warped x˜ and y−coordinates ˜ i = [Wx (xi,2 , yi,2 ; θ )] and w ˜ of where w the ith trajectory in sequence 2 in the sequence coordinate. The optimization problem is solved by taking the Taylor expansion around θ and finding the increment ∆θ by, arg min ∆θ 2 B (W (x; θ )) + ∇B ∑ 2 2∂W x∈O(θ ) 1 ∆θ − B (x) ∂θ + β ∑ ei (θ ) + i ∂ ei (θ ) ∆θ ∂θ ei (θ ) + ∂ ei (θ ) ∆θ + α∆θ I ∆θ , (6.9) ∂θ where ∆θ I ∆θ is a regularization term penalizing some special changes on ∆θ controlled by I and a positive constant α. By setting I = diag([0, 0, 1, 0, 0, 1, 0, 0]), we penalize large changes on translation elements of ∆θ , so that frames are first aligned by warping them rather than translating them. Based on our experiments, this leads to more stable result. We initialize the algorithm by setting the sequences side by side (spatially) with the two possible layouts, and use the alignment result of the layout with lower final cost. The solution to Eqn. 6.9 is, ∆θ = H−1 ∇B2 ∑ x∈O(θ ) ∂W ∂θ B1 (x) − B2 (W (x; θ )) − β ∑ i ∂ ei (θ ) ei (θ ) , ∂θ (6.10) in which H= ∑ x∈O(θ ) ∇B2 ∂W ∂θ ∇B2 ∂W ∂θ +β ∑ i 73 ∂ ei (θ ) ∂θ ∂ ei (θ ) + αI . ∂θ (6.11) ~ ~ P32 t Fitted curve to Π1 Π2 ~ ~ P12 Δt2 x Δt3 Π3 ~ ~ P33 ~ ~ P31 ~ ~ P33 ~ ~ P32 ~ ~ P13 Δt1 ~ ~ P31 ~ ~ P11 Π2 Π3 ~ ~ P13 ~ ~ P12 ~ ~ P11 Π1 Π1 Figure 6.5: Trajectories, tracks, and fitted space-time curve to the tracks from 3 videos. Here ∂W ∂θ is the Jacobian evaluated at x. By the chain rule, in which 6.3.3 ∂ ei (θ ) ∂ ei (θ ) ∂ W ∂W (x) ˜ i ), 1] = = [− f (w . ∂θ ∂W ∂θ ∂θ (6.12) ∂ ei (θ ) ∂ ei (θ ) (x) ˜ i ), 1]. = = [− f (w ∂W ∂ Wx , Wy (6.13) Temporal alignment NOS are assumed to have moving objects, without which the temporal alignment is neither necessary nor possible. Given moving objects and spatial alignment results, the temporal alignment of NOS amounts to estimating when an object will appear in FOV of another camera, after it moves out of the current FOV. If both cameras observe the object’s motion at the same time, the problem is easier. For this purpose, we create motion tracks as discussed in Sec. 6.3.2. Then, we estimate the temporal offset between sequences such that trajectories from the identical object follow a continuous path in x˜ − t˜ and y˜ − t˜ coordinates, i.e., the motion tracks are smooth. Since not all trajectories are due to moving objects, we f ilter motion trajectories with non-smooth paths, before matching trajectories. 74 6.3.3.1 Estimation of temporal offset Given the collection of tracks, the objective is to make each track a smooth curve, by shifting the temporal coordinates of the contributing trajectories appropriately (Fig. 6.5). For S sequences, ˜ coordinate of ˜ coordinate of trajectories forming the kth track is the vector [x˜ 1k , x˜ 2k , ..., x˜ S ] . y− x− k each track is defined similarly. We assume that by temporally shifting each sequence s for ∆ts , the sequences are temporally aligned. To estimate ∆ts , we fit a polynomial curve of degree m to time ˜ stamps versus x˜ and y−coordinates of each track independently and estimate the time shifts, in order to achieve the lowest curve fitting error. Here, we discuss only the t˜ − x˜ curve, and the t˜ − y˜ curve is similar. We denote the trajectory coordinates of sequence s and all the power terms of the polynomial space-time curve as s(m) x˜ k = [1sk , x˜ sk , [x˜ sk ]2 , · · · , [x˜ sk ]m ], (6.14) where 1sk is a lks -dim vector of all ones, and [.]m denotes an element-wise power operation. For the track k, all the required terms of the polynomial space-time curve are collected in a matrix X˜k of size ∑s lks × (m + 1), and all the time stamps in a vector T˜k (∆t) of length ∑s lks ,     1 1(m) ˜tk + ∆t1  x˜ k         2(m)  ˜2  t + ∆t x˜  k 2 k     X˜k =   , T˜k (∆t) =  .  ..   ..   .   .            S(m) ˜x ˜tS + ∆tS k k (6.15) We denote the coefficients of the kth polynomial curve fitting to the kth track as ck = [cq ]; q ∈ {0, ..., m}. We can estimate the coefficients by solving a linear system, arg minck T˜k (∆t) − X˜k ck . 75 Since all tracks share the same ∆t, we can efficiently solve for all tracks jointly, ˜ , c∗ , ∆t∗ = arg min T˜ (∆t) − Xc (6.16) c,∆t in which  ˜ X1 0     0 X˜2  X˜ =   .. .. . .    0 0      0 ˜   T1 (∆t)   c1           ˜ ··· 0   T2 (∆t)   c2   ˜      , T (∆t) =   , c =  .  ..   ..  .  ..  .  . . ..             ˜ (∆t) ˜ T cK K · · · XK ··· (6.17) Here, X˜ is a NK × K(m + 1) matrix where NK = ∑k ∑s lks is the count of keypoints in all K tracks. We alternatively estimate c and ∆t, until the change in ∆t is negligible. We first estimate c, with fixed ∆t. Since NK K(m + 1), this linear system is over-constrained for c. We solve c by Orthogonal-triangular decomposition, which is numerically more accurate than the pseudo inverse ˜ Then, for a given c∗ , we set ∆t as the average of residuals from the keypoints in trajectories of X. s belonging to sequence s, ∆ts = − 1 ˜ ˜ ∗ (T − Xc ) Is , Ns (6.18) where Is is a binary indicator vector with an element equal to 1 if the corresponding row in T˜ comes from a trajectory in sequence s, and Ns = Is 6.3.3.2 1 is the count of such rows. Motion trajectory filtering As mentioned before, not all trajectories are resulted from object motion with a smooth path. In other words, some trajectories might be due to noise in keypoint locations while the camera moves. So, before matching trajectories across sequences and collecting them to a track, we filter out the 76 Sequence ID Camera baseline (m) Temporal error (s) Spatial error (pixel) R1 1 0.13 - R2 3 0.07 - R3 1 0.07 - R4 1 0.13 - R5 1 0.07 - R6 1 0.07 - R7 1 0.10 - R8 1 0.03 - R9 5 0.07 - R10 10 0.07 - S1 0 0 2 S2 0 0.03 3 S3 0 0.07 7 S4 0 0.03 2 S5 0 0.07 2 Table 6.1: Temporal and spatial alignment error in seconds and pixels, respectively, for real (R) and synthetic (S) sequences. trajectories that cannot be well approximated with a smooth path, by fitting the order-m polynomial to the trajectory, ˜s ˜ s(m) s cs∗ k = arg min tk − xk ck csk 2, and thresholding the total fitting residual to remove non-smooth trajectories, i.e., (6.19) 1 lks ˜ts − x˜ s(m) cs∗ k k k 1< τ2 . 6.4 Experimental results In this section, we present the experimental setup and both quantitative and qualitative results. Note that since NOS is a novel scenario for spatio-temporal alignment of sequences, there is no prior work for comparison. We set β = 100, α = 103 , m = 3 for the temporal curve fitting step, and τ1 = 0.03 and τ2 = 0.15 for trajectory filtering. 6.4.1 Dataset Given that there is no public dataset in this new scenario, we collect a NOS dataset including ten real-world sequence sets, and five synthetic sequence sets. Real sets are captured by two or three people using handhold smartphones with the distance between the cameras, i.e. baseline, as shown in Tab. 6.1. Synthetic sets provide sequences for which the ground truth result are exactly known, and are created by taking a sequence and cropping out two spatio-temporal tubes from the 3D sequence volume. This emulates the case of independently panning cameras with almost identical optical centers. To simulate a freely panning camera and hand shake, the spatial region used for each tube at each frame has a fixed size of 640 × 360 pixels, but the region location has an additive 77 zero-mean Gaussian noise. Also, if the original video is stationary, the regions shift in x−direction to create a pan-like effect. Figure 6.6: Spatial alignment of non-overlapping sequences. Top to bottom: reconstructed backgrounds of two sequences with negligible overlap, extrapolated backgrounds, and aligned background with trajectory of moving objects overlaid on the background. 6.4.2 Qualitative results Figure 6.6 presents the reconstructed backgrounds along with image extrapolation results. Further, it is shown how the backgrounds are transformed so that moving object trajectories have smooth path. Figure 6.7 shows the alignment results for three sets of real sequences with some overall spatial overlap. Similarly, Figure 6.9 shows the alignment results for three sets of real sequences with no/minimal overall spatial overlap. For each set, two or three sample frames with moving objects 78 ID Frame Input # frames Trajectories in world coordinate Aligned frames 415 a 49 130 b R3 175 187 c R4 248 204 Figure 6.7: Each row shows spatio-temporal alignment results on a set of real NOS, with some overall spatial overlap. For each sequence, input frames at the estimated time shift and trajectories of moving objects in the world coordinate are shown. The input frames are transformed to the world coordinate to make a composite image via alpha blending. 79 450 400 350 300 t 250 200 150 100 50 0 0 (a) 200 400 600 x 800 1000 (b) (c) Figure 6.8: Results for two synthetic NOS from an accident footage (S2). (a) Trajectories of moving objects, (b) aligned input frames, (c) original frame where the synthetic frames are cropped. 80 ID Frame Input # frames Trajectories in world coordinate Aligned frames 177 a 167 a 58 b R3 R7 83 b R8 c R4 390 366 c R10 Figure 6.9: Each row shows spatio-temporal alignment results on a set of real NOS, with no overall spatial overlap. For each sequence, input frames at the estimated time shift and trajectories of moving objects in the world coordinate are shown. The input frames are transformed to the world coordinate to make a composite image via alpha blending. 81 are shown, at the time shift estimated by the proposed algorithm. Also, keypoint trajectories from both sequences in the world coordinate after spatio-temporal alignment are shown. Trajectories of moving objects have considerable extent in the x−direction, whereas trajectories of stationary objects are roughly parallel to t−axis. Finally, the two input frames are warped to the world coordinate to make a composite image. Although the input frames may not have direct overlap, perceived continuity of the scene and also relative location of the moving objects, demonstrate capabilities of the proposed algorithm and the application scenarios. Note that in all test sequences cameras move freely and independently, as shown by the range of trajectories in the world coordinate. For the case of Fig. 6.9(a), the sequences are non-overlapping, but only a person is tracked moving across the FOVs. Thus, as shown in this figure, spatial alignment has some error, which consequently affects the accuracy of the temporal alignment. Figure 6.8 represents a synthetic set where two sequences are created from a video of a car accident. The two cropped frames after spatio-temporal alignment are shown in a composite image and for comparison, the corresponding frame from the original video is also shown, demonstrating the accuracy of the spatio-temporal alignment. 6.4.3 Quantitative results To quantitatively evaluate the proposed algorithm, we compare the alignment errors with the ground truth. For the case of synthetic sets, the original video from which the synthetic sequences are cropped, provides the ground truth location of the center points of the cropped frames. We measure the spatial location error of each aligned frame w.r.t. the ground truth location and report sum of absolute errors in x and y−direction, averaged over the sum of the length of the sequences, as the spatial alignment error. Also, since we create the synthetic sequences, the ground truth time shift is known. For real sets, when the input frames do not have overlap, quantifying the spatial 82 error is not feasible. For quantification of temporal alignment, we manually align the sequences by relying on visual cues such as body pose, moving object location relative to background landmarks, and consistency of appearance of moving objects in the composite image. Table 6.1 provides the quantified temporal and spatial errors. As may be observed, temporal alignment works well even when the camera baseline distance increases, although the final consolidated result may suffer from parallax. 6.4.4 Computational cost The main computational cost of the proposed algorithm comes from TRGMC. On average, for a video of 15-second long, we spend 450 seconds on TRGMC and background reconstruction, using a PC with an Intel i5-3470@3.2GHz CPU, and 8GB RAM. Spatial alignment is independent of sequence length and takes ∼162 seconds on average for NOS. Finally, temporal alignment takes about 13 seconds on average over the database. 6.4.5 Limitations Violation of assumptions, especially existence of moving objects with a trajectory which spans FOVs of multiple cameras, results in alignment failures. Furthermore, when relying on non-rigid or articulated moving objects for alignment, many keypoints are not tracked long enough due to change of appearance, making alignment difficult. Also, in this case, matching trajectories among different sequences is less reliable and error prone. Since the algorithm is independent of the type of tracking involved, other tracking algorithms can be investigated in the future. Furthermore, alignment of non-overlapping background images suffers from ambiguity and is error prone, although the proposed algorithms makes use of available cues to conduct this task. 83 6.5 Conclusions We proposed an algorithm for spatio-temporal alignment of sequences, referred to as non-overlapping sequences (NOS), from freely panning cameras for which FOVs of the cameras might not even observe some common regions over progression of time. This new scenario of video alignment is useful in reconstructing events, incidents, or crime scenes from multiple amateur-captured sequences, or creation of panoramic videos from cooperative users via handheld cameras without the need for tripods. The spatial alignment of our algorithm relies on reconstructing background for each sequence and aligning the backgrounds. When backgrounds are non-overlapping, the spatial alignment uses clues from smoothness of moving objects’ paths and coherent appearance of background after image extrapolation. Smoothness of trajectory of moving objects is also utilized as a clue for temporal alignment. Our experiments demonstrate capabilities of the proposed method, despite the challenging scenario of NOS. 84 Chapter 7 Conclusion and Future Work 7.1 Conclusions and discussions In this research, algorithms for global motion compensation are proposed to remove effect of camera motion and help with magnifying motions of interest in the videos from moving cameras. In this regard, we proposed two robust global motion compensation algorithms, namely RGMC and TRGMC. RGMC delivers reliable results in the presence of predominant foreground and textureless or blurry background, enabling its application to real-world unconstrained videos. By foreground suppression, RGMC is able to tolerate existence of large foreground and occlusion. Also, the proposed method successfully handles keypoint matching with a very low matching threshold, required for GMC in low texture background, or poorly illuminated scene. This is achieved by clustering motion vectors, and analyzing each cluster to identify matches pertaining to the background. Further, a novel homography verification model is proposed to support the RGMC. This model unifies keypoint matching error and consistency of the edges of images after transformation, and benefits from motion history gleaned from previous frames to ensure that in case of large foreground, foreground motion is not compensated instead of the camera motion affecting the background. Extensive experiments and comparison with ground truth obtained by manually matching the frames and baseline methods demonstrate the superiority of RGMC. Furthermore, we proposed a temporally robust global motion compensation (TRGMC) algorithm by joint alignment (congealing) of frames, in contrast to the common sequential scheme. 85 This is done by dense connection of keypoints throughout all the frames and iteratively applying transformation on each frame such that the keypoints are spatially aligned. Despite complicated camera motions, TRGMC can remove the intentional camera motion, such as panning, as well as unwanted motion due to vibration on handheld cameras. Redundancy of information in the joint alignment of the stack of input images makes TRGMC capable of dealing with foreground motion and false matches without imposing further processing cost. Also, due to the joint alignment scheme, in TRGMC, existence of blurry, low texture, or poorly illuminated frames will not lead to total failure of GMC for all the upcoming frames. Beyond the robustness gained, experiments demonstrate that TRGMC outperforms existing GMC methods in terms of accuracy. It is worth noting that the enabling assumption of any global motion compensation algorithm relying on homography estimation, is that the camera motion in the direction of the optical axis is negligible. For instance, TRGMC will not work properly on a video from a wearable camera of a pedestrian, since in the global coordinate the upcoming frames grow in size and cause computational and rendering problems. Similar to panorama images, the best results are achieved if the optical center of the camera has negligible movement during the capturing, making a homographybased approximation of camera motion appropriate. However, if the optical center moves in the perpendicular direction to the optical axis (e.g., a camera following a swimmer), TRGMC still works well, but rendering the results in the form of motion panorama will be degraded by parallax effect. Subsequently, we investigated the challenging problem of spatio-temporal alignment of multiple video sequences, captured by freely panning handheld cameras. We identified and tackled a novel scenario of this problem referred to as Non-Overlapping Sequences (NOS) and proposed a solution which is only feasible given the reliable global motion compensation and background reconstruction via TRGMC. NOS are captured by multiple freely panning handheld cameras whose 86 field of views might even have no direct spatial overlap. However, over the progression of time, there are nearby regions in the scene that are observed by the cameras independently and probably at distinct time instants. In contrary to many existing works, this assumption is less restrictive than common region being observed by field of views of different cameras over progression of time, and obviously much less restrictive than the common requirement of direct spatial overlap between frames from different cameras. With the popularity of mobile capturing devices such as smartphones and wearable cameras, NOS rise when multiple cooperative users capture a dynamic scene, such as public events, to create a panoramic video or when it is desired to consolidates multiple footages of an incident or crime scenes into a single video. The proposed method makes it possible to better reconstruct the events or crime scenes captured by amateur users, and obtain a better understanding of the incident. For this novel scenario, we first spatially align the sequences by reconstructing the background of each sequence using TRGMC algorithm and then registering these backgrounds across the sequences, even if the backgrounds are not overlapping. To this end, the reconstructed background images are first extrapolated. Then, a cost function is defined and minimized such that while extrapolated backgrounds are aligned well, trajectory of moving objects leaving field of view of one camera and entering field of view of another camera are spatially smooth. Given the spatial alignment, and assuming smoothness of trajectories of moving objects, such as cars or pedestrians, the sequences are temporally synchronized, such that the trajectories of moving objects across sequences are consistent with the prediction of when a moving object leaving the field of view of a camera, would appear in the field of view of another camera. Finally, to develop algorithms for analyzing user-generated videos, unconstrained and representative datasets are of great significance. For this purpose, we collected a dataset of Sports Videos in the Wild (SVW), consisting of videos captured by users of a leading sports training smart87 phone app (Coach’s Eye R ) while practicing a sport or watching a game. The dataset contains 4000+ videos selected by reviewing ∼85, 000 videos and consists of 30 sports categories and 44 actions. Videos of sports practice, which frequently happens outside the typical sports field, have huge intra-class variations due to background clutter, unrepresentative environment, existence of different training equipment and most importantly, imperfect actions. On the other hand, using smartphones for video capturing by ordinary people, in comparison to videos captured by professional crew for broadcasting, leads to challenges due to camera vibration and motion, occlusion, view point variation, and poor illumination. Given various manual labels, this dataset can be used for a wide range of computer vision applications, such as action recognition, action detection, genre categorization, and spatio-temporal alignment. On the sport genre categorization problem, we also design the evaluation protocol and evaluate three different methods to provide baselines for future works. 7.2 Future work TRGMC is a powerful and useful algorithm which can be an essential module in utilizing many existing algorithms designed for static cameras, for handheld and moving cameras. However, many applications require high processing rate. In future, we will further investigate fast algorithms for global motion compensation, and will investigate how a fast GMC algorithm benefits other computer vision applications. 7.2.1 Speed-up TRGMC One potential direction is to keep the alignment framework to be joint, however not as dense as TRGMC. More specifically, TRGMC has O(n2 ) complexity due to dense linking of all the n frames 88 (a) (b) Figure 7.1: Comparison of the links made by (a) TRGMC with (b) the links which might be made via the proposed speed-up scheme. Chosen links by minimum spanning tree are shown in red and links between adjacent frames are shown in blue. in the stack. Decreasing number of links such that each frame is only linked to a constant number of other frames, for instance the two temporally neighboring frame and a few frames with longer temporal distance to enforce temporal consistency, will improve the complexity to O(n). Selection of the best linking choice can be casted to finding a minimum spanning tree in a weighted graph G = (V, E). Each frames in the alignment stack is a node of the graph G, and each edge of the graph connecting node i to j has a weight wi j showing suitability of the matching pair (i, j) to be included in the joint alignment. To enforce long term robustness of the alignment, including frames with larger temporal distance is preferred. Thus, if the normalized temporal difference of frames i and j is denoted as ti j , we define wi j ∝ (1 − ti j ). Furthermore, frames with largest spatial overlap are better matched with each other, and are less affected with view angle change. So, after initialization step of TRGMC which results to finding approximate translation between the frames, it is possible to collect the sequential translations to obtain the translation between arbitrary frames i and j and find the normalized overlap area between these frames, denoted as ai j . Thus, the weights might also be proportional to the overlap area, as wi j ∝ (1 − ti j )(1 − ai j ). Figure 7.1 compares the links made by TRGMC with the links which might be made via the proposed speed-up scheme. Beyond the links of the minimum spanning tree, it is also beneficial to link the sequential frames to enforce 89 both temporally long term and local consistencies. 7.2.2 Joint alignment and outlier rejection In each keypoint matching stage between two given frames, TRGMC utilizes an outlier rejection via Ma et allet@tokeneonedotmethod of vector field consensus [70]. This operation is repeated frequently, and takes ∼ 10 the keypoint matching step itself. So, one potential direction to improve the efficiency is by embedding the outlier detection and rejection in the joint alignment formulation via latent variables or appropriate weighting of each link. For instance, for each iteration of the algorithm in which homographies are updated to decrease the alignment error, nonconforming links which do not connect background keypoints, and thus have increased error with the homography update, may be identified. 90 APPENDICES 91 Appendix A Sports Videos in the Wild (SVW): A Video Dataset for Sports Analysis Considering the enormous creation rate of user-generated videos on websites like YouTube, there is an immediate need for automatic categorization, recognition and analysis of videos. To develop algorithms for analyzing user-generated videos, unconstrained and representative datasets are of great significance. For this purpose, we collected a dataset of Sports Videos in the Wild (SVW), consisting of videos captured by users of a leading sports training smartphone app (Coach’s Eye R ) while practicing a sport or watching a game. The dataset contains 4000+ videos selected by reviewing ∼85, 000 videos and consists of 30 sports categories and 44 actions. Videos of sports practice, which frequently happens outside the typical sports field, have huge intra-class variations due to background clutter, unrepresentative environment, existence of different training equipment and most importantly, imperfect actions. On the other hand, using smartphones for video capturing by ordinary people, in comparison to videos captured by professional crew for broadcasting, leads to challenges due to camera vibration and motion, occlusion, view point variation, and poor illumination. Given various manual labels, this dataset can be used for a wide range of computer vision applications, such as action recognition, action detection, genre categorization, and spatiotemporal alignment. On the sport genre categorization problem, we design the evaluation protocol and evaluate three different methods to provide baselines for future works. 92 A.1 Introduction The amount of digital videos being created is increasing exponentially, e.g., YouTube has reached the upload rate of 100 hours of video per minute. A great deal of this growth is due to the tremendous popularity of smartphones and ubiquitous Internet access. This means that amateur-usergenerated videos form the new trend in content generation. Thus, there is an immediate need for robust algorithms to automatically analyze and retrieve videos. Many computer vision problems are data-driven and the existence of representative and realistic datasets are necessary for developing robust algorithms. Therefore, there has been a trend from research on controlled datasets toward unconstrained datasets. For instance, recent face recognition research focuses on datasets like Labeled Faces in the Wild (LFW) [42] rather than controlled datasets like FERET [78]. Similarly, for human action recognition, datasets with less controlled videos, e.g., Hollywood2 [71], HMDB [49] and UCF101 [100], are gaining popularity, compared with staged datasets like KTH [93] or Weizmann [13]. While these datasets ( [49, 71, 100]) are from YouTube videos and movies and thus have unconstrained environment and actions relative to staged datasets, many of the videos are captured professionally. Therefore, in aspects like camera vibration, view angle variation, and illumination, they are bound to common practices of filmmaking. On the other hand, specifically for sports videos, most videos in public datasets are representative of successful completion of the actions that may not truly reflect the highly complex and diverse real-world sports activities. Finally, for sports videos, due to strong correlation of background and the actions in existing datasets, the state-of-the-art performance on genre categorization is very high. Given the explosion of user-generated videos and the lack of real-world datasets for the research community, we present a highly unconstrained dataset of sports videos, called Sport Videos in the 93 1-Archery 2-Baseball 3-Basketball 4-BMX 5-Bowling 6-Boxing 7-Cheerleading 8-Discusthrow 9-Diving 10-Football 11-Golf 12-Gymnastics 13-Hammer throw 14-High jump 15-Hockey 16-Hurdling 17-Javelin 18-Long jump 19-Polevault 20-Rowing 21-Runnng 22-Shotput 23-Skating 24-Skiing 25-Soccer 26-Swimming 27-Tennis 28-Volleyball 29-Weight lifting 30-Wrestling Figure A.1: Sample frames from all 30 sports categories of SVW. 94 Wild (SVW). SVW is comprised of videos captured solely with smartphones by users of Coach’s Eye R smartphone app, a leading app for sports training developed by TechSmith corporation. The app allows users to conveniently capture videos whenever they practice a sport or watch a game. Fig. A.1 shows sample frames from different categories of SVW. Being captured by smartphone and by ordinary people, along with the fact that many videos are of practices of amateurs, not professional athletes, makes SVW the most unconstrained dataset of sports and action videos. SVW is annotated to serve for multiple purposes. For action recognition, videos are labeled with 44 different actions and timespan of each action. To make the dataset appropriate for action detection, videos are not trimmed around each action, instead, time stamps are provided. In addition, more than 50% of the videos are annotated with spatio-temporal bounding boxes around each and every action in the video. For sports genre categorization, each video is labeled with generic name of the sport being practiced, resulting in 30 sports categories. For the sport genre categorization problem, we design the evaluation protocol and compare the performance of three algorithms on the proposed dataset as baselines for future research. First, the performance of the state-of-the-art motion-based dense trajectories algorithm [114] is reported. Second, purely context-based algorithm of describing videos with SIFT features [66] is presented. Finally, experiments using a motion-assisted context-based algorithm are conducted. All data, including the dataset, labels, evaluation protocol, and experimental results, will be publicly available to the research community for future research1 . 1 http://cvlab.cse.msu.edu/svw-download.html 95 A.2 Related Work Table 1 summarizes different aspects of most popular action recognition (AR) datasets. To the best of our knowledge, there is no publicly available dataset for sports genre categorization. Among existing datasets, HMDB [49] and UCF101 [100] are the most challenging ones in terms of having unconstrained videos. KTH [93] and Weizmann [13] datasets contain simple actions and their AR accuracies are reported to be above 90% [49]. IXMAS [121] contains staged actions captured by 5 calibrated cameras, where an AR accuracy of 93.5% is reported in [116]. UCF Sports [102] and Olympic [75] are the only datasets that cover just sports activities. While the environment is not controlled, the videos are captured by professional crew, the actions are performed by professional athletes, and the background is restricted to official sports fields. As noted in [49], the actions in these datasets are highly distinguishable from shape cues alone. For Olympic, an accuracy of 91.1% is reported in [117]. Having limited number of categories and distinct activities in each category, a recognition rate of 98% is reported in [49] for UCF Sports using the information from static joint locations alone. Hollywood2 [71] dataset is gathered from 69 movies and is labeled for both action recognition and scene understanding. Being selected from movies, it contains unconstrained environment and actions while benefiting from professional capturing. Its main restrictions include the limited number of actions and the fact that clips extracted from the same movie share similar scenes. In [117], an AR accuracy of 64.3% is reported for Hollywood2. For UCF50 [83], Kuehne et al. [49] suggest that low-level features are as predictive as midlevel features and Wang et al. [117] report a 91.2% AR accuracy. As an extension of UCF50, UCF101 has 101 categories and is the largest AR dataset available [100]. Being collected from 96 YouTube, the actions are fairly unconstrained, but no comment can be made about the capturing process. Karpathy et al. [46] report a 66% AR accuracy for UCF101 (80% for the sports group). Probably due to the low resolution of source videos, all clips are normalized to the relatively low resolution of 320 × 240. At the mean clip length of 7.2 second, UCF101 is fairly short compared to SVW, making it less suitable for action detection problems. HMDB [49] is collected by looking for non-ambiguous human actions in Internet videos and movies. As a quality standard, selection of videos has been constrained to having a single action per clip and 40% of the clips are not affected by camera motion. The dataset is prepared in two versions of original videos and stabilized videos and good performance is reported for stabilization. In [117], an accuracy of 57.2% is reported for HMDB. HMDB is very challenging due to not only the unconstrainedness of the dataset, but also having multiple shots in a single clip, where both factors contribute to the low AR accuracy. Although existing datasets have some levels of unconstrained actions and environment, there is still more complexity in real-world videos that need to be represented in research datasets. Specifically for sports videos, current datasets do not provide highly unconstrained conditions. For UCF101, one of the most challenging datasets, sport videos achieve the highest recognition rate ( [46, 100]) among different types of videos. This is claimed to be due to distinctiveness of sports motions and less cluttered background in official sports field than other types of actions, which does not hold for sports in the wild. Considering high performances reported for UCF Sports, Olympic, and sports groups of UCF101, SVW specifically fills the research gap for analyzing challenging sports videos. On the other hand, uploading a video to YouTube implies that the action of desire has been successfully performed and completed in the video. But for a completely unconstrained video, there might be failure cases (e.g., batting practice). In addition, unlike UCF101 or Hollywood2, in SVW no two videos are trimmed from a single footage captured by users, which keeps 97 Table A.1: Comparison of multiple datasets for action recognition (AR), scene understanding (SU), and genre categorization (GC). Dataset KTH Purpose Categ. # Clip # Avg. Unconst. Unconst. Camera length actions capturing vibration Orient. Sources AR 6 100 NA No No No Lands, Staged AR 9 9 NA No No No Lands, Staged AR 11 30 NA No No No Lands, Staged AR 9 14+ NA Yes No No Lands, Broadcast TV AR 16 50 NA Yes No No Lands, YouTube Hollywood2 AR 12 61+ [71] SU 10 62+ NA Yes No No Lands, Movies AR 50 100+ NA Yes No Slight Lands, YouTube AR 51 101+ NA Yes No Slight Lands, AR 101 100+ 7.2 Yes No Slight Lands, AR 44 50+ GC 30 110+ 11.6 Yes Yes Yes [93] Weizmann [13] IXMAS [121] UCF Sports [102] Olympic [75] UCF50 [83] HMDB [49] UCF101 [100] SVW Lands, & Port. Movies & Internet YouTube Smartphone the variance of the actions, environment, and shooting conditions in the dataset as high as possible. Furthermore, due to highly unconstrained environment and illumination condition as well as a high rate of scene occlusion by people, video stabilization of SVW is very challenging and our experiments show a high failure rate of stabilization using the common RANSAC algorithm [35]. Finally, the video resolution and clip length of SVW are larger than all the current datasets, and SVW includes both lanscape and portrait orientaion of videos. 98 A.3 Sports Videos in the Wild (SVW) Dataset A.3.1 Dataset details and statistics Dataset collection SVW is selected from the videos captured by ordinary users of Coach’s Eye R smartphone-based sports app developed by TechSmith corporation, when users practice a sport or watch a game. The users can review the videos and compare them with those of coaches or professional athletes side by side. A user may also upload the videos to the app server for other users to review and comment on his sports training progress. At the time of writing this paper, an average of 4 videos per minute are being uploaded to the app server by users, and among 700, 000 uploaded videos, users have marked ∼418, 000 as publicly usable. Due to the highly nonuniform distribution of sports categories, 85, 000 videos from the public set have been reviewed and labeled to collect enough videos for 30 sports category and 44 action categories with at least 110 and 50 videos per category, respectively. Challenges of SVW Compared to broadcasting videos, sports videos in the wild have many unique challenges for visual analysis, due to both the imperfect practices of amateur players and unprofessional capturing by amateur users. Firstly, the static image context is less discriminative for categorization. For example, in a video of tennis forehand drill (Fig. A.2 (a)), no assumption can be made about existence of the racquet (and in some cases the tennis court). The only reliable clue may be the unique motion characteristics of the hands. Secondly, in these videos, existence of training equipment is more likely than the broadcasting videos (Fig. A.2 (b)). On the other hand, cluttered backgrounds as well as common environments also cause difficulties in unconstrained sports videos; There are many SVW videos that the sport is practiced inside the house, in the garage, or in the backyard (Fig. A.2 (b)). Thirdly, unprofessional capturing by amateur users intro- 99 (a) Tennis category (b) Baseball category (c) Archery category (d) Hurdling category Figure A.2: SVW challenges: (a) Related equipment does not exist, (b) Background is cluttered and uncorrelated with the sport, (c) Uncommon camera angles increase the intra-class variations, (d) Multiple sports co-exist (1: Hurdling, 2: Long jump, 3: Cycling). duces additional challenges like extreme camera vibration, improper camera movement, occlusion from audience, judges and fences due to improper camera location, and uncommon view angles (Fig. A.2 (c)). Finally, for amateur videos, it is more probable to have multiple activities in a single video (Fig. A.2 (d)). It is important to note that unlike other action recognition datasets that are recently widely used, multiple actions defined in SVW may come from a single sport (see Fig. A.3). In other words, while the environment is quite similar for these subsets of actions, movements are completely different. This introduces further challenges in visual analysis of sports videos in the wild for the purpose of action recognition. On the other hand, this arises difficulties for genre categorization. Each sport category has huge intra-class variation due to containing multiple actions that can appear at any timespan of the whole video length. 100 f = 61 f = 211 f = 241 f = 301 f = 314 [f = 343 f = 349 f = 359] [f = 380 f = 390 f = 400] f = 420 [f = 438 f = 446 f = 454] Figure A.3: Annotated actions categories ([343, 359, Forearm], [380, 400, Set], [438, 454, Spike]) within a video from Volleyball genre category. Since distinct actions from the same sport genre may share a common field, visual appearance alone is not enough for action recognition in SVW. Dataset labelling Videos are manually labeled in a two-round scheme. First, for each clip, 6 frames uniformly sampled across the video length constitute a montage, which is saved as an image. A GUI equipped with a button for each category shows the saved montage and records the pressed button from the labeler. In the next round, all labeled clips are reviewed one by one. Clips over 1-minute long are trimmed to loosely cover representative motions, but not precisely around the action of interest so that the dataset is also suitable for action detection. To prepare SVW for action recognition, at least 50% of the videos are reviewed closely to annotate all pre-defined actions within a clip and their corresponding timespans. The same videos are also annotated with bounding boxes around each action in the video for action detection application. Fig. A.3 represents how different actions within a clip are annotated with the label and time stamps. 101 For each video clip, we also label various meta tags. Fig. A.4 represents the distribution of the number of participents in videos, commonality/uniqueness of the action environment, and the camera view angles for 30 sports categories. Meta tags reveal that 19% of SVW videos are affected by considerable camera vibration and the videos of three categories have the highest rates of training equipment usage, Running (9%), Weitgh lifting (9%), and Boxing (4%). Multipe activities in a single video are more common for categories such as Hurdling, High jump, Running, Weight lifting, and Diving. Spatial resolution normalization The resolution of the original videos varies from 480 × 272 to 1280 × 720 (irrespective of video orientation) with 640 × 360 being the most common size. Since for some analysis algorithms variation of video sizes might result in the confusion of scene scales, a normalized version of the dataset is provided along with the original one. Having both landscape and portrait orientations in the dataset, normalized clips have the maximum size (width or height) of 480 pixels. Evaluation protocol In line with UCF101 and HMDB, three splits of 70% training and 30% testing are generated for the genre categorization application of SVW. We designate the splits by aiming to evenly distribute different actions, camera view angles, and field characteristics over the splits. The genre categorization accuracy is used as the performance metric and is defined as the fraction of testing videos whose genres are correctly classified. A.3.2 Potential applications of SVW Action recognition Due to the huge number of video content available online and the desire to content understanding, a great deal of effort has been focused on the problem of action recognition 102 100% Single participent (a) 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 100% Irrelevant (b) 50% 0% Unique 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 100% Shared mixed/else front back side Archery Baseball Basketball BMX Bowling Boxing Cheerleading Discus-throw Diving Football Golf Gymnastics Hammer-throw High jump Hockey Hurdling Javelin Long jump Pole vault Rowing Running Shot put Figure skating Skiing Soccer Swimming Tennis Volleyball Weight wrestling (c) 50% 0% >1 participents Figure A.4: Distribution of (a) number of participents in videos, (b) aspects of the action field and (c) camera views angles, in 30 categories. Irrelevant field is a field that from its appearance, the sports category cannot be deduced (e.g., practicing in the backyard). Shared field refers to the condition in which from just field appearance, more than one sports category might be inferred (e.g., track and field sports). Unique field is the one that just from field context, the corresponding sports category can be conjectured (e.g., Bowling tracks). from videos [45, 51, 52, 71, 81, 122]. Inherently, for sports videos, action recognition is a subset of the genre categorization problem, i.e., for the former, labels for a single action are available but for the latter, a group of different actions within each sport are all labeled with the genre of the sport, resulting in higher intra-class variations. Action detection Although there has been great emphasis on action recognition, the action detection problem has not been extensively studied. Action detection by itself and as part of recognition by detection systems [85] is an important problem to be tackled. Especially, in real-world videos, actions of interest may cover a relatively short period of a video and it is important to be able to detect these actions. Existing approaches use rather simple datasets with short videos [47, 125] or proprietary datasets [25]. SVW enables researchers to push the limit of action detection toward more realistic videos. 103 Genre categorization Sports genre categorization is vastly studied for broadcasting TV channels videos [29, 73, 118, 126, 129, 130]. In these works, it is assumed that sports occur in sports arena (implicitly assuming the existence of specific equipment and field lining) and are captured by professional TV broadcasting crew. Low-level features like color, motion, and histogram of edge directions are used for categorization. Our experiments show that this type of approaches does not perform well on sports in the wild. On the other hand, in [87], authors report superior performance of the dense trajectories method [114] for genre categorization of unconstrained proprietary videos. This paper aims to provide a dataset of such videos. Well-known sports-only datasets of UCF Sports [102] and Olympic [75], include specific actions not generic sports categories, and have been reported to achieve ∼90% accuracy ( [116, 117]) (The method in [116] achieves ∼62% accuracy on SVW.). Thus, a challenging video dataset for this application is highly desirable. Spatio-temporal alignment Given two video sequences of the same action, spatio-temporal alignment is defined as finding the spatial and temporal coordinate transformation that maps the actions of interest in one video to those of the other [4, 111]. For the case of sports videos, spatiotemporal alignment of actions enables effective comparison of actions performed by different people. This is specifically useful for the purposes of sports training and grading. Furthermore, the joint alignment of multiple videos, from either one user over time or a diverse set of users, allows us to study the temporal evolving and inter-subject variations of a particular action, which are novel research problems by themselves. Having unconstrained videos where action of interest may happen at any temporal segment of the video, SVW serves as a realistic and challenging dataset for the alignment problem. 104 Table A.2: Performances (genre categorization accuracy) of different baseline algorithms on SVW. Method Performance A.4 Motion-based Context-based 61.53% 37.08% Motion-assisted context 39.13% Baseline Experiments In this section, we present the performances of three different algorithms for the genre categorization problem on SVW. The first algorithm summarizes features extracted from dense trajectories [114] using the widely used Bag of Words (BoW) approach [20]. The second algorithm analyzes the context of video frames using the BoW on the SIFT features [66]. The third one, a motion-assisted context-based algorithm, segments the moving and stationary pixels using trajectory information and then analyzes the appearance of these two groups of pixels separately. To the interest of computational cost and memory, for all methods, a two-level bottom-up codebook generation scheme is used [130]. At the first layer, for each class, a set of codewords are generated using K-means clustering. At the second layer, codewords of all classes are aggregated and by another round of clustering, the final codewords are obtained. We use Support Vector Machine (SVM) as the classifier for all the algorithms. Table A.2 summarizes the genre categorization accuracies of the baseline algorithms. Unlike UCF Sports, which is conjectured in [49] to be equally predictable using contextual or motion information due to the fact that many sports in UCF Sports are location-specific, and similarly Olympic dataset, sports videos in the wild are better recognized using motion features due to existence of many practice videos in environments uncorrelated with the activities. However, the accuracy achieved by motion-based algorithm is relatively low due to miscellaneous aforementioned challenges of SVW. Fig. A.5 represents confusion matrices of context-based and motion-based algorithms. The contrast of off-diagonal elements indicates the potential benefits of fusing these two algorithms. More detail on all three algorithms follows. 105 Table A.3: Performance of different combinations of trajectory descriptors on SVW. Descriptors Performance s 44.65% HOG 56.45% HOG + s 60.43% MBH 58.12% MBH + HOG 60.69% MBH + HOG + s 61.53% (a) Archery Baseball Basketball BMX Bowling Boxing Cheerleading Discus-throw Diving Football Golf Gymnastics Hammer-throw High jump Hockey Hurdling Javelin Long jump Pole vault Rowing Running Shot put Figure skating Skiing Soccer Swimming Tennis Volleyball Weight wrestling Archery Baseball Basketball BMX Bowling Boxing Cheerleading Discus-throw Diving Football Golf Gymnastics Hammer-throw High jump Hockey Hurdling Javelin Long jump Pole vault Rowing Running Shot put Figure skating Skiing Soccer Swimming Tennis Volleyball Weight wrestling Archery Baseball Basketball BMX Bowling Boxing Cheerleading Discus-throw Diving Football Golf Gymnastics Hammer-throw High jump Hockey Hurdling Javelin Long jump Pole vault Rowing Running Shot put Figure skating Skiing Soccer Swimming Tennis Volleyball Weight wrestling (b) Figure A.5: Confusion matrices of (a) context-based and (b) motion-based categorization algorithms. A.4.1 Motion-based algorithm For motion-based algorithm, the state-of-the-art approach of dense trajectories is used [114]. The BoW approach on top of dense trajectory based features has been reported to outperform those of space-time interest points on various datasets [116]. This approach consists of three main steps: video stabilization, trajectory extraction and description, and BoW representation of trajectory information. We use implementations in [117] for the second step. A.4.1.1 Video stabilization Frame by frame motion stabilization is achieved by matching interest points on consecutive frames and applying RANSAC [35] to obtain the affine transformation between frames. Due to issues 106 such as poor illumination, moving subjects and audience, and uniform or non-rigid backgrounds (like water), the failure rate of video stabilization is quite high, which deteriorates the overall performance of the motion-based algorithm. A.4.1.2 Dense trajectories As proposed in [114], dense trajectories are extracted at multiple spatial scales. Each point pt = (xt , yt ) at frame t is tracked to the next frame t + 1 by performing the median filtering in a dense optical flow field W = (ut , vt ), pt+1 = (xt+1 , yt+1 ) = (xt , yt ) + (K ∗ W)|(xt ,yt ) , where K is the median filtering kernel and (xt , yt ) is the rounded position of (xt , yt ). Trajectories are started from the sample points on a grid spaced by W pixels (set to 5). The length of each trajectory is limited to L (set to 15), and after reaching this length, the trajectory is removed from the tracking process and new sample points are tracked. A.4.1.3 Trajectory descriptors The shape of the trajectories can be used as a representative feature, especially for sports analysis. In [114], the displacements of trajectory, ∆pt = (xt+1 − xt , yt+1 − yt ), over L consecutive frames are concatenated to be a vector, sˆ = [∆pt , ..., ∆pt+L−1 ], which is further normalized to be a trajectory descriptor s = sˆ/ ∑t+L−1 ∆p j . Similar to [114], the video volume of a neighborhood of each traj=t jectory is aligned and the resultant volume is described by using the Motion Boundary Histogram (MBH) [23] and the Histogram of Oriented Gradients (HOG) [22]. Table A.3 shows the performance of different combinations of descriptors. The highest accuracy of 61.53% is achieved by combining MBH, HOG, and s descriptors. 107 A.4.1.4 Context-based algorithm We follow the algorithm in [130] for analyzing the videos using only the static contextual information. In this algorithm, we sample one frame per one-second length of video and use the BoW representation of SIFT descriptors for categorization. For this algorithm, no inter-frame information is utilized, thus video stabilization is not required. For dictionary learning, we use 10 videos per category. In our experiments, the size of codebook is set to 4000. As represented in Table A.2, the categorization accuracy of 37.08% is achieved using this method. A.4.1.5 Motion-assisted context algorithm Along with the idea in [83], we augment the context-based method with the information of moving and stationary pixels. This can be loosely considered as foreground-background segmentation using motion information. For this purpose, the mean position of trajectories of the stabilized videos, for which the standard deviation of the trajectory points is beyond a threshold, is considered as a moving point. The decision about a moving point at a certain frame is propagated to 15 frames before and after the frame on which the trajectory ends. Having groups of moving and stationary pixels ready, SIFT descriptors and BoW representation are calculated for them separately and the resulting histograms are concatenated to represent the video. This algorithm achieves an accuracy of 39.13% (Table A.2), which is slightly better than the algorithm using context information only. A.4.1.6 Discussion Comparing ∼62% accuracy achieved by motion-based algorithm of dense trajectories on SVW with ∼91% accuracy obtained by applying the same method to both UCF50 and Olympic Sports datasets [117], demonstrates that SVW is a very challenging sports video dataset. In addition, comparing accuracy obtained by applying motion-based and context-based algorithms (∼62% vs 108 ∼39%) reveals that in SVW, motion is the main cue for categorization and action recognition. While the motion-assisted context based algorithm results in ∼39% accuracy for SVW, as reported in [83], similar method achieves accuracy of ∼67% for UCF50. This essentially suggests that background and equipment appearance in SVW is not as informative as in UCF50 dataset. In [46], 80% accuracy is reported for Sports group of UCF101 dataset. Considering all these results, we may conclude that although sports videos feature unique movements, analysis of truly unconstrained videos is still challenging and needs further research. A.5 Conclusions To advance computer vision research, and to push the limits of various video analysis problems toward more realistic and unconstrained scenarios happening in the real world, representative and unconstrained datasets are essential. In this regard, we introduced Sports Videos in the Wild (SVW), as a very challenging real-world dataset of sports videos available for genre categorization, action detection, action recognition, and spatio-temporal alignment. We evaluated three different baseline algorithms for sports genre categorization. Experimental results suggest that due to uncorrelatedness of environment and actions in SVW, as well as amateur capturing of the videos, the presented SVW dataset is indeed the most challenging sports and action dataset available. 109 Appendix B Publications • S. Safdarnejad and X. Liu, “Spatio-temporal alignment of minimally overlapping sequences from independently moving cameras”, in Proc. of Computer Vision and Pattern Recognition Conf. (CVPR) 2017. • S. Safdarnejad, Y. Atoum, and X. Liu, “Temporally Robust Global Motion Compensation by Keypoint-based Congealing”, in Proc. of European Conf. on Computer Vision (ECCV) 2016. • S. Safdarnejad, X. Liu, and L. Udpa, “Robust Global Motion Compensation in Presence of Predominant Foreground”, in Proc. of British Machine Vision Conf. (BMVC) 2015, Swansea, UK, 2015. (Best poster award) • S. Safdarnejad, X. Liu, and L. Udpa, “Sports Videos in the Wild (SVW): A Video Dataset for Sports Analysis”, in Proc. of IEEE Conf. on Automatic Face and Gesture Recognition (FG) 2015, Ljubljana, Slovenia, 2015. • P. Banerjee, S. Safdarnejad, L. Udpa, S. S. Udpa, NDT Techniques: Signal and Image Processing, in Reference Module in Materials Science and Materials Engineering, Elsevier, 2015. • S. Safdarnejad, Z. Su, C. Ye, L. Udpa, S. Udpa, “Analysis of EC-GMR Data for Detection of Cracks under Fasteners (CUF)”, Int. Journal of Applied Electromag. and Mechanics, 2015. • S. Safdarnejad, X. Liu, and L. Udpa, “Genre Categorization of Amateur Sports Videos in the Wild”, in Proc. of Int. Conf. of Image Processing (ICIP) 2014, Paris, France, 2014. • S. Safdarnejad, O. Karpenko, L. Udpa, S. Udpa, “A Robust Multi-frequency Mixing Algorithm for Suppression of Rivet Signal in GMR Inspection of Riveted Structures”, in Proc. of Quantitative Nondestructive Evaluation (QNDE) 2015, Minneapolis, MN, 2015, AIP Publishing. • P. Banerjee, S. Safdarnejad, L. Udpa, S. Udpa, “Ensemble of Classifiers for Confidencerated Classification of NDE Signal”, in Proc. of Quantitative Nondestructive Evaluation (QNDE) 2015, Minneapolis, MN, 2015, AIP Publishing. 110 • Z. Su, A. Efremov, S. Safdarnejad, A. Tamburrino, L. Udpa, S. Udpa, “Optimization of Coil Design for Near Uniform Interrogating Field Generation”, in Proc. of Quantitative Nondestructive Evaluation (QNDE) 2015, Minneapolis, MN, 2015, AIP Publishing. • P. Banerjee, S. Safdarnejad, L. Udpa, S. Udpa, “Investigation of a Comprehensive Confidence Measure in NDE”, in Proc. of Review of Progress in Quantitative Nondestructive Evaluation (QNDE) 2014, Boise, ID, 2014, AIP Publishing. • O. Karpenko, S. Safdarnejad, G. Dib, L. Udpa, S. Udpa, A. Tamburrino, “Image Processing Algorithms for Automated Analysis of GMR Data from Inspection of Multilayer Structures”, in Proc. of Quantitative Nondestructive Evaluation (QNDE) 2014, Boise, ID, 2014, AIP Publishing. • S. Safdarnejad, L. Udpa, S. Udpa, C. Buynak, G. Steffes, E. Lindgren, and J. Knopp, “Statistical Algorithms for Eddy Current Signal and Noise Analysis”, in Proc. of Review of Progress in Quantitative Nondestructive Evaluation (QNDE) 2013, Baltimore, MD, 2013, AIP Publishing. 111 BIBLIOGRAPHY 112 BIBLIOGRAPHY [1] Anubhav Agarwal, CV Jawahar, and PJ Narayanan. A survey of planar homography estimation techniques. Centre for Visual Information Technology, Tech. Rep. IIIT/TR/2005/12, 2005. [2] Amit Aides, Tamar Avraham, and Yoav Y Schechner. Multiscale ultrawide foveated video extrapolation. In Computational Photography (ICCP), 2011 IEEE International Conference on, pages 1–8. IEEE, 2011. [3] Anton Andriyenko, Konrad Schindler, and Stefan Roth. Discrete-continuous optimization for multi-target tracking. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1926–1933. IEEE, 2012. [4] Serge Ayer and Martin Vetterli. Method and system for combining video sequences with spatio-temporal alignment, 2001. US Patent 6,320,624. [5] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004. [6] Adam Barclay and Hannes Kaufmann. FT-RANSAC: Towards robust multi-modal homography estimation. In Proc. IAPR Workshop PRRS, pages 1–4. IEEE, 2014. [7] Connelly Barnes, Eli Shechtman, Dan B Goldman, and Adam Finkelstein. The generalized patchmatch correspondence algorithm. In European Conference on Computer Vision, pages 29–43. Springer, 2010. [8] Adrien Bartoli, Navneet Dalal, Biswajit Bose, and Radu Horaud. From video sequences to motion panoramas. In Proc. Conf. Motion and Video Computing Workshops, pages 201–207. IEEE, 2002. [9] Adrien Bartoli, Navneet Dalal, and Radu Horaud. Motion panoramas. Computer Animation and Virtual Worlds, 15(5):501–517, 2004. [10] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In Proc. European Conf. Computer Vision (ECCV), pages 404–417. Springer, 2006. [11] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Proc. European Conf. Computer Vision (ECCV), pages 404–417. Springer, 2006. 113 [12] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the CLEAR MOT metrics. J. Image and Video Process., 2008:1, 2008. [13] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. In Proc. Int. Conf. Computer Vision (ICCV), volume 2, pages 1395– 1402. IEEE, 2005. [14] Angelo Bosco, Arcangelo Bruna, Sebastiano Battiato, Giuseppe Bella, and Giovanni Puglisi. Digital video stabilization through curve warping techniques. IEEE Trans. Consumer Electronics, 54(2):220–224, 2008. [15] Darlan N Brito, Flávio LC Pádua, Rodrigo L Carceroni, and Guilherme AS Pereira. Synchronizing video cameras with non-overlapping fields of view. In XXI Brazilian Symp. Computer Graphics and Image Processing, pages 37–44. IEEE, 2008. [16] Yaron Caspi and Michal Irani. Aligning non-overlapping sequences. Int. J. Computer Vision, 48(1):39–51, 2002. [17] Ondˇrej Chum, Jiˇrí Matas, and Josef Kittler. Locally optimized RANSAC. Pattern Recognition, pages 236–243, 2003. [18] Mark Cox, Sridha Sridharan, Simon Lucey, and Jeffrey Cohn. Least squares congealing for unsupervised alignment of images. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008. [19] Mark Cox, Sridha Sridharan, Simon Lucey, and Jeffrey Cohn. Least-squares congealing for large numbers of images. In Proc. Int. Conf. Computer Vision (ICCV), pages 1949–1956. IEEE, 2009. [20] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, volume 1, page 22, 2004. [21] Xinyi Cui, Qingshan Liu, and Dimitris Metaxas. Temporal spectral residual: fast motion saliency detection. In Proceedings of the 17th ACM international conference on Multimedia, pages 617–620. ACM, 2009. [22] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886– 893. IEEE, 2005. 114 [23] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of flow and appearance. In Proc. European Conf. Computer Vision (ECCV), pages 428–441. Springer, 2006. [24] Oscar Déniz, Gloria Bueno, E Bermejo, and Rahul Sukthankar. Fast and accurate global motion compensation. Pattern Recognition, 44(12):2887–2901, 2011. [25] Konstantinos G Derpanis, Mikhail Sizintsev, Kevin Cannons, and Richard P Wildes. Efficient action spotting based on a spacetime oriented structure representation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1990–1997. IEEE, 2010. [26] Caglayan Dicle, Octavia Camps, and Mario Sznaier. The way they move: tracking multiple targets with similar appearance. In Proc. Int. Conf. Computer Vision (ICCV), pages 2304– 2311. IEEE, 2013. [27] Ferran Diego, Daniel Ponsa, Joan Serrat, and Antonio M López. Video alignment for change detection. IEEE Trans. Image Proc., 20(7):1858–1869, 2011. [28] Ferran Diego, Joan Serrat, and Antonio M López. Joint spatio-temporal alignment of sequences. IEEE Trans. Multimedia, 15(6):1377–1387, 2013. [29] Ling-Yu Duan, Min Xu, Qi Tian, Chang-Sheng Xu, and Jesse S Jin. A unified framework for semantic shot classification in sports video. Multimedia, IEEE Transactions on, 7(6):1066– 1083, 2005. [30] Rahul Dutta, Bruce Draper, and J Ross Beveridge. Video alignment to a common reference. In IEEE Winter Conf. WACV, pages 808–815. IEEE, 2014. [31] Motaz El-Saban, Mostafa Izz, Ayman Kaheel, and Mahmoud Refaat. Improved optimal seam selection blending for fast video stitching of videos captured from freely moving devices. In Proc. Int. Conf. Image Processing (ICIP), pages 1481–1484. IEEE, 2011. [32] Sandro Esquivel, Felix Woelk, and Reinhard Koch. Calibration of a multi-camera rig from non-overlapping views. In Pattern Recognition, pages 82–91. Springer, 2007. [33] Georgios D Evangelidis and Christian Bauckhage. Efficient and robust alignment of unsynchronized video sequences. In Pattern Recognition, pages 286–295. Springer, 2011. [34] Georgios D Evangelidis and Christian Bauckhage. Efficient subframe video alignment using short descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 35(10):2371–2386, 2013. 115 [35] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [36] Junhong Gao, Seon Joo Kim, and Michael S Brown. Constructing image panoramas using dual-homography warping. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 49–56. IEEE, 2011. [37] Tiago Gaspar, Paulo Oliveira, and Paolo Favaro. Synchronization of two independently moving cameras without feature correspondences. In Proc. European Conf. Computer Vision (ECCV), pages 189–204. Springer, 2014. [38] Michael L Gleicher and Feng Liu. Re-cinematography: Improving the camerawork of casual video. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 5(1):2, 2008. [39] Nils Hasler, Bodo Rosenhahn, Thorsten Thormahlen, Michael Wand, Jürgen Gall, and HansPeter Seidel. Markerless motion capture with unsynchronized moving cameras. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 224– 231. IEEE, 2009. [40] Yuwen He, Bo Feng, Shiqiang Yang, and Yuzhuo Zhong. Fast global motion estimation for global motion compensation coding. In Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), volume 2, pages 233–236. IEEE, 2001. [41] Gary Huang, Marwan Mattar, Honglak Lee, and Erik G Learned-Miller. Learning to align from scratch. In Advances in Neural Information Processing Systems (NIPS), pages 764– 772, 2012. [42] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007. [43] Muhammad Twaha Ibrahim, Rehan Hafiz, Muhammad Murtaza Khan, Yongju Cho, and Jihun Cha. Automatic reference selection for parametric color correction schemes for panoramic video stitching. In Advances in Visual Computing, pages 492–501. Springer, 2012. [44] Wei Jiang and Jinwei Gu. Video stitching with spatial-temporal content-preserving warping. In Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), pages 42–48. IEEE, 2015. 116 [45] Yu-Gang Jiang, Qi Dai, Xiangyang Xue, Wei Liu, and Chong-Wah Ngo. Trajectory-based modeling of human actions with motion reference points. In Proc. European Conf. Computer Vision (ECCV), pages 425–438. Springer, 2012. [46] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014. [47] Tae-Kyun Kim and Roberto Cipolla. Canonical correlation analysis of video volume tensors for action categorization and detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(8):1415–1428, 2009. [48] Hui Kong, Jean-Yves Audibert, and Jean Ponce. Detecting abandoned objects with a moving camera. IEEE Trans. Image Process., 19(8):2201–2210, 2010. [49] Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. HMDB51: A large video database for human motion recognition. In High Performance Comput. Sci. Eng., pages 571–582. Springer, 2013. [50] Jukka Lankinen and Joni-Kristian Kämäräinen. Local feature based unsupervised alignment of object class images. In Proc. British Mach. Vision Conf. (BMVC), volume 1, 2011. [51] Ivan Laptev. On space-time interest points. Int. J. Comput. Vision, 64(2-3):107–123, 2005. [52] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008. [53] Erik G Learned-Miller. Data driven image models through continuous joint alignment. IEEE Trans. Pattern Anal. Mach. Intell., 28(2):236–250, 2006. [54] Erik G Learned-Miller et al. ICA using spacings estimates of entropy. J. Machine Learning Research, 4:1271–1295, 2003. [55] José Lezama, Karteek Alahari, Josef Sivic, and Ivan Laptev. Track to the future: Spatiotemporal video segmentation with long-range motion cues. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR). IEEE, 2011. [56] Qian Li, Shifeng Chen, and Beiwei Zhang. Predictive video saliency detection. In Pattern Recognition, pages 178–185. Springer, 2012. 117 [57] Xiangru Li and Zhanyi Hu. Rejecting mismatches by correspondence function. Int. J. Comput. Vision, 89(1):1–17, 2010. [58] Yunpeng Li, Sing Bing Kang, Neel Joshi, Steve M Seitz, and Daniel P Huttenlocher. Generating sharp panoramas from motion-blurred videos. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2424–2431. IEEE, 2010. [59] Chung-Ching Lin, Sharathchandra U Pankanti, Karthikeyan Natesan Ramamurthy, and Aleksandr Y Aravkin. Adaptive as-natural-as-possible image stitching. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1155–1163. IEEE, 2015. [60] Wen-Yan Lin, Siying Liu, Yasuyuki Matsushita, Tian-Tsong Ng, and Loong-Fah Cheong. Smoothly varying affine stitching. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 345–352. IEEE, 2011. [61] Ce Liu, William T Freeman, and Edward H Adelson. Analysis of contour motions. In Advances in Neural Information Processing Systems (NIPS), pages 913–920, 2006. [62] Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos in the wild. In Proc. IEEE Conf. CVPR, pages 1996–2003. IEEE, 2009. [63] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun. Steadyflow: Spatially smooth optical flow for video stabilization. In Proc. IEEE Conf. CVPR, pages 4209–4216. IEEE, 2014. [64] Xiaoming Liu, Yan Tong, and Frederick W. Wheeler. Simultaneous alignment and clustering for an image ensemble. In Proc. Int. Conf. Computer Vision (ICCV), pages 1327–1334. IEEE, 2009. [65] David G Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Computer Vision, 60(2):91–110, 2004. [66] David G Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [67] Cheng Lu and Mrinal Mandal. A robust technique for motion-based video sequences temporal alignment. IEEE Trans. Multimedia, 15(1):70–82, 2013. [68] Simon Lucey, Rajitha Navarathna, Ahmed Bilal Ashraf, and Sridha Sridharan. Fourier lucas-kanade algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 35(6):1383–1396, 2013. 118 [69] Jiayi Ma, Jun Chen, Delie Ming, and Jinwen Tian. A mixture model for robust point matching under multi-layer motion. PloS one, 9(3):e92282, 2014. [70] Jiayi Ma, Ji Zhao, Jinwen Tian, Alan L Yuille, and Zhuowen Tu. Robust point matching via vector field consensus. IEEE Transactions on Image Processing, 23(4):1706–1721, 2014. [71] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Actions in context. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2929–2936. IEEE, 2009. [72] Marcin Marszałek, Ivan Laptev, and Cordelia Schmid. Actions in context. In Proc. IEEE Conf. CVPR, 2009. [73] C Krishna Mohan and B Yegnanarayana. Classification of sport videos using edge-based features and autoassociative neural network models. Signal, Image and Video Processing, 4(1):61–73, 2010. [74] Eduardo Monari and Thomas Pollok. A real-time image-to-panorama registration approach for background subtraction using pan-tilt-cameras. In Proc. IEEE Conf. Advanced Video and Signal Based Surveillance (AVSS), pages 237–242. IEEE, 2011. [75] Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In Proc. European Conf. ECCV, pages 392–405. Springer, 2010. [76] Flávio LC Pádua, Rodrigo L Carceroni, Geraldo AMR Santos, and Kiriakos N Kutulakos. Linear sequence-to-sequence alignment. IEEE Trans. Pattern Anal. Mach. Intell., 32(2):304–320, 2010. [77] F Perazzi, A Sorkine-Hornung, H Zimmer, P Kaufmann, O Wang, S Watson, and M Gross. Panoramic video from unstructured camera arrays. In Computer Graphics Forum, volume 34, pages 57–68. Wiley Online Library, 2015. [78] P. J. Phillips, H. Moon, P. J. Rauss, and S. Rizvi. The FERET evaluation methodology for face recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 22(10):1090–1104, October 2000. [79] Yair Poleg and Shmuel Peleg. Alignment and mosaicing of non-overlapping images. In Computational Photography (ICCP), 2012 IEEE International Conference on, pages 1–8. IEEE, 2012. 119 [80] Dmitry Pundik and Yael Moses. Video synchronization using temporal signals from epipolar lines. In Proc. European Conf. Computer Vision (ECCV), pages 15–28. Springer, 2010. [81] Cen Rao and Mubarak Shah. View-invariance in action recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), volume 2, pages II–316. IEEE, 2001. [82] Cen Rao, Alper Yilmaz, and Mubarak Shah. View-invariant representation and recognition of actions. Int. J. Computer Vision, 50(2):203–226, 2002. [83] Kishore K Reddy and Mubarak Shah. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013. [84] Xiaobo Ren, Tony X Han, and Zhihai He. Ensemble video object cut in highly dynamic scenes. In Proc. IEEE Conf. CVPR, pages 1947–1954. IEEE, 2013. [85] Sreemanananth Sadanand and Jason J Corso. Action bank: A high-level representation of activity in video. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1234–1241. IEEE, 2012. [86] S. Morteza Safdarnejad, Yousef Atoum, and Xiaoming Liu. Temporally robust global motion compensation by keypoint-based congealing. In Proc. European Conf. Computer Vision (ECCV), pages 101–119. Springer, 2016. [87] S. Morteza Safdarnejad, Xiaoming Liu, and Lalita Udpa. Genre categorization of amateur sports videos in the wild. In Proc. Int. Conf. ICIP. IEEE, 2014. [88] S. Morteza Safdarnejad, Xiaoming Liu, and Lalita Udpa. Robust global motion compensation in presence of predominant foreground. In Proc. British Machine Vision Conf. (BMVC), 2015. [89] S. Morteza Safdarnejad, Xiaoming Liu, Lalita Udpa, Brooks Andrus, John Wood, and Dean Craven. Sports videos in the wild (SVW): A video dataset for sports analysis. In Proc. Int. Conf. Automatic Face and Gesture Recognition (FG), pages 1–7. IEEE, 2015. [90] Masatoshi Sakamoto, Yasuyuki Sugaya, and Kenichi Kanatani. Homography optimization for consistent circular panorama generation. In Advances in Image and Video Technology (PSIVT), pages 1195–1205. Springer, 2006. [91] Pekka Sangi, Jari Hannuksela, Janne Heikkilä, and Olli Silvén. Sparse motion segmentation using propagation of feature labels. In VISAPP (2), pages 396–401. Citeseer, 2013. 120 [92] Harpreet S Sawhney, Steve Hsu, and Rakesh Kumar. Robust video mosaicing through topology inference and local to global alignment. In Proc. European Conf. Computer Vision (ECCV), pages 103–119. Springer, 1998. [93] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proc. Int. Conf. Pattern Recognition (ICPR), volume 3, pages 32–36. IEEE, 2004. [94] Hae Jong Seo and Peyman Milanfar. Static and space-time visual saliency detection by self-resemblance. Journal of vision, 9(12):15, 2009. [95] Joan Serrat, Ferran Diego, Felipe Lumbreras, and José Manuel Álvarez. Synchronization of video sequences from free-moving cameras. In Pattern Recognition and Image Analysis, pages 620–627. Springer, 2007. [96] Fatemeh Shokrollahi Yancheshmeh, Ke Chen, and Joni-Kristian Kamarainen. Unsupervised visual alignment with similarity graphs. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2901–2908. IEEE, 2015. [97] Heung-Yeung Shum and Richard Szeliski. Construction and refinement of panoramic mosaics with global and local alignment. In Proc. Int. Conf. Computer Vision (ICCV). [98] Aljoscha Smoli´c, Yuriy Vatis, Heiko Schwarz, and Thomas Wiegand. Long-term global motion compensation for advanced video coding. In ITG-Fachtagung Dortmunder Fernsehseminar, pages 213–216, 2003. [99] Francesco Solera, Simone Calderara, and Rita Cucchiara. Learning to divide and conquer for online multi-target tracking. arXiv preprint arXiv:1509.03956, 2015. [100] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [101] Drew Steedly, Chris Pal, and Richard Szeliski. Efficiently registering video into panoramic mosaics. In Proc. Int. Conf. Computer Vision (ICCV), volume 2, pages 1300–1307. IEEE, 2005. [102] M Sullivan and M Shah. Action mach: Maximum average correlation height filter for action recognition. In Proc. IEEE Conf. CVPR, pages 1–8. IEEE, 2008. 121 [103] Yunda Sun, Bo Li, Baozong Yuan, Zhenjiang Miao, and Chengkai Wan. Better foreground segmentation for static cameras via new energy form and dynamic graph-cut. In Proc. Int. Conf. Pattern Recognition (ICPR), volume 4, pages 49–52. IEEE, 2006. [104] Zygmunt L Szpak, Wojciech Chojnacki, and Anton van den Hengel. Robust multiple homography estimation: An ill-solved problem. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2132–2141. IEEE, 2015. [105] Roberto Toldo and Andrea Fusiello. Robust multiple structures estimation with j-linkage. In Proc. European Conf. ECCV, pages 537–547. Springer, 2008. [106] Ben J Tordoff and David W Murray. Guided-mlesac: Faster image transform estimation by using matching priors. IEEE Trans. Pattern Analysis and Machine Intelligence, 27(10):1523–1535, 2005. [107] Philip HS Torr and Andrew Zisserman. Mlesac: A new robust estimator with application to estimating image geometry. Comput. Vision and Image Understanding, 78(1):138–156, 2000. [108] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In Vision algorithms: theory and practice, pages 298–372. Springer, 1999. [109] Hirofumi Uemura, Seiji Ishikawa, and Krystian Mikolajczyk. Feature tracking and motion compensation for action recognition. In Proc. British Mach. Vision Conf. (BMVC), pages 1–10, 2008. [110] Yaron Ukrainitz and Michal Irani. Aligning sequences and actions by maximizing spacetime correlations. Springer, 2006. [111] Yaron Ukrainitz and Michal Irani. Aligning sequences and actions by maximizing spacetime correlations. Springer, 2006. [112] M Vargas and E Malis. Deeper understanding of the homography decomposition for visionbase control. Unité de recherche INRIA Sophia Antipolis (France), 2007. [113] CK Wan, BZ Yuan, and ZJ Miao. A new algorithm for static camera foreground segmentation via active coutours and GMM. In Proc. Int. Conf. Pattern Recognition (ICPR), pages 1–4. IEEE, 2008. 122 [114] Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proc. IEEE Conf. CVPR, pages 3169–3176. IEEE, 2011. [115] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 3169–3176. IEEE, 2011. [116] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vision, 103(1):60–79, 2013. [117] Heng Wang, Cordelia Schmid, et al. Action recognition with improved trajectories. In Proc. Int. Conf. Computer Vision (ICCV), 2013. [118] Jinjun Wang, Changsheng Xu, and Engsiong Chng. Automatic sports video genre classification using pseudo-2d-hmm. In Proc. Int. Conf. Pattern Recognition (ICPR), volume 4, pages 778–781. IEEE, 2006. [119] Lu Wang, Ulrich Neumann, and Suya You. Wide-baseline image matching using line signatures. In Proc. Int. Conf. ICCV, pages 1311–1318. IEEE, 2009. [120] Oliver Wang, Christopher Schroers, Henning Zimmer, Markus Gross, and Alexander Sorkine-Hornung. Videosnapping: Interactive synchronization of multiple videos. ACM Trans. on Graphics (TOG), 33(4):77, 2014. [121] Daniel Weinland, Edmond Boyer, and Remi Ronfard. Action recognition from arbitrary views using 3d exemplars. In Proc. Int. Conf. Computer Vision (ICCV), pages 1–7. IEEE, 2007. [122] Shandong Wu, Omar Oreifej, and Mubarak Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In Proc. Int. Conf. Computer Vision (ICCV), pages 1419–1426. IEEE, 2011. [123] Yawen Xue, Xiaojie Guo, and Xiaochun Cao. Motion saliency detection using low-rank and sparse decomposition. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 1485–1488. IEEE, 2012. [124] Qing Yan, Yi Xu, Xiaokang Yang, and Truong Nguyen. HEASK: Robust homography estimation based on appearance similarity and keypoint correspondences. Pattern Recognition, 47(1):368–387, 2014. 123 [125] Junsong Yuan, Zicheng Liu, and Ying Wu. Discriminative video pattern search for efficient action detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(9):1728–1743, 2011. [126] Xun Yuan, Wei Lai, Tao Mei, Xian-Sheng Hua, Xiu-Qing Wu, and Shipeng Li. Automatic video genre categorization using hierarchical svm. In Proc. Int. Conf. Image Processing (ICIP), pages 2905–2908. IEEE, 2006. [127] Jordi Zaragoza, Tat-Jun Chin, Quoc-Huy Tran, Michael S Brown, and David Suter. Asprojective-as-possible image stitching with moving dlt. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1285–1298, 2014. [128] Wei Zeng and Hongming Zhang. Depth adaptive video stitching. In Proc. IEEE Conf. Computer and Information Science (ICIS), pages 1100–1105. IEEE, 2009. [129] Ning Zhang, Ling-Yu Duan, Qingming Huang, Lingfang Li, Wen Gao, and Ling Guan. Automatic video genre categorization and event detection techniques on large-scale sports data. In Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research, pages 283–297. IBM Corp., 2010. [130] Ning Zhang, Ling-Yu Duan, Lingfang Li, Qingming Huang, Jun Du, Wen Gao, and Ling Guan. A generic approach for systematic analysis of sports videos. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):46, 2012. [131] Sheng-hua Zhong, Yan Liu, Feifei Ren, Jinghuan Zhang, and Tongwei Ren. Video saliency detection via dynamic consistent spatio-temporal attention modelling. In AAAI, 2013. [132] Yaping Zhu, Natan Jacobson, Hong Pan, and Truong Nguyen. Motion-decision based spatiotemporal saliency for video sequences. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 1333–1336. IEEE, 2011. [133] Marco Zuliani, Charles S Kenney, and BS Manjunath. The multiransac algorithm and its application to detect planar homographies. In Proc. Int. Conf. ICIP, volume 3, pages III– 153. IEEE, 2005. 124