TOWARD OPEN WORLD VISUAL UNDERSTANDING By Wentao Bao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT Visual data such as images and videos are the most prominent media to record, transmit, and exchange information in this era. Though we have witnessed waves of success in visual intelli- gence, teaching machines to understand visual content at the level of human intelligence remains a fundamental challenge. In past decades, visual understanding has been extensively explored through computer vision tasks such as object (or activity) recognition, segmentation, and detection. However, existing methods can hardly be deployed in real open-world applications where unseen environments, objects, and activities inevitably appear in testing. Such a limitation is attributed to the closed-world assumption that ignores the unknown in model design, learning, and evaluation. In this dissertation, I will introduce my works that go beyond the traditional closed-world visual understanding and tackle several challenging open-world problems. The ultimate goal is to endow machines with visual perception capabilities in an open world, where unseen environments, image objects, and video activities can be handled. First, I will begin the dissertation by investigating open-world visual forecasting problems in an unseen perception environment. Specifically, I primarily explore how the early observed videos can be leveraged to promptly forecast the traffic accident risk for safe self-driving (in Chapter 2 and Chapter 3), and forecast the 3D hand motion trajectory in an unseen first-person view (in Chapter 4). Second, I will cover the open-world visual recognition problems that aim to identify the unseen visual concepts. In this part, we are especially interested in identifying and localizing unseen human actions in general videos (in Chapter 5 and Chapter 6). Lastly, I will delve into open-world visual language understanding problems that further recognize unseen visual concepts from language queries, including the recognition of unseen compositional objects in images (in Chapter 7) and spatiotemporally detecting unseen human actions (in Chapter 8). In Chapter 9, I summarize the main contributions of this dissertation and discuss unsolved challenges in real-world practices. Based on the line of the dissertation research, some future directions for open-world visual understanding are briefly discussed. Copyright by WENTAO BAO 2024 ACKNOWLEDGEMENTS I would like to extend many thanks to my advisor Yu Kong, for all his efforts in making my five years of Ph.D. fruitful and enjoyable. I am fortunate to have received his hands-on guidance to become an independent researcher. The guidance such as critical thinking, paper writing and presentation, and professional communication, are really helpful to my growth. Particularly, his durable passion for research impresses me a lot, with countless brainstorming discussion hours to exchange cutting-edge ideas. These discussions have motivated most of my top-tier publications with valuable research questions and greatly improved my research taste in the past years. Besides, I want to extend my sincere gratitude to my co-advisor Qi Yu when I was in RIT for the first three years. Though Prof. Qi Yu is a senior professor with a large group of Ph.D. students, he is always supportive of my research, with many insightful comments on my projects. I am impressed by his expertise and emphasis on machine learning, which equipped me with a solid ML background for doing computer vision research in the early years of my Ph.D. journey. Besides, I want to sincerely thank my dissertation committee members, Xiaoming Liu, Vishnu Boddeti, and Daniel Morris. Their thoughtful advice and cooperative spirit make my dissertation and defense presentation of high quality at a smooth pace. I particularly want to thank Xiaoming and Vishnu for their recommendations for my career search and grant proposal collaboration, and thank Daniel for his prompt and insightful feedback on my defense. I also want to thank many other excellent faculties and staff from MSU such as Sandeep Kulkarni and Vincent Mattison, and those from RIT such as Pengcheng Shi, Linwei Wang, Rui Li, and Min-Hong Fu, who collectively contributed to excellent doctoral programs in MSU and RIT. Over the past five years, I would like to thank many researchers, engineers, and collaborators who shaped my research interest and contributed to my fruitful doctoral outcomes and career development. Particularly, I want to thank Terrance Boult and Walter Scheirer for their decades of research in open-set recognition as well as their kind and insightful in-person discussions, that inspired my dedication to this dissertation. I want to thank Jason Corso for his interesting and inspiring advice when he was my doctoral consortium advisor and thank Junsong Yuan and iv Heng Huang for paper collaboration and career search recommendations. I also want to thank many scholars with helpful discussions of my dissertation research: Guangyao Chen, Zhen Fang, Bolei Zhou, Xiang Pan, Tianyu Luan, Yuanhao Zhai, Shuai Li. I want to thank Weishi Shi for this countless help in my career search preparation. Also, I want to extend many thanks to my excellent internship mentors and managers, Kai Li, Martin Renqiang Min, Deep Patel from NEC Lab America, Lele Chen, Zhong Li, Yi Xu from OPPO US Research Center, and Weiyu Zhang, Yongxi Lu from Apple. I believe that high-quality research outcomes cannot be achieved without excellent collaborators. I would like to thank my collaborators and those smart Master students that I have honorably mentored, Junwen Chen, Yifan Li, Yuxiao Chen, Lichang Chen, Libing Zeng, Yuansheng Zhu, Xinmiao Lin, Xiwen Dengxiong, Hanbin Hong, Suhan Park. Apart from academics, I value the memorable friendships that have been supporting me to survive and refresh along the Ph.D. journey. I would like to extend my particular thanks to my senior lab mate Junwen Chen, who not only helped my research a lot but also comradely walked me through those difficult early times since my first visit to the U.S. Besides, I want to thank Yifan Li, Yujiang Pu, Zhanbo Huang, Suhan Park, Kwonyeong Cho, Xinmiao Lin, Xiwen Dengxiong, Haiting Hao, Anna Starynska, Chuanqi Zang, Hanbin Hong, and many other ActionLab students, who made the lives of ActionLab enjoyable and wonderful. I also want to thank many many of my friends including but not limited to Lan Wang, Zhiyuan Ren, Andrew Hou, Haobo Zhang, Yijiang Pang, Xiajun Jiang, Yuansheng Zhu, Zixin Yang, and Grace Cao, for their great friendships. Last, but far from least, I must extend my sincere thanks to my family, who have been supporting my overseas Ph.D. study with unconditional love. I truly appreciate their understanding of my life decision to study abroad, especially in the hardships of the past pandemic times. In a word, my family is always the harbor of my life journey. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 UNCERTAINTY-BASED ACCIDENT ANTICIPATION . . . . . . . . . 10 CHAPTER 3 DEEP REINFORCED EXPLAINABLE ACCIDENT ANTICIPATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 CHAPTER 4 EGOCENTRIC 3D TRAJECTORY FORECASTING . . . . . . . . . . . 50 CHAPTER 5 OPEN-SET ACTION RECOGNITION . . . . . . . . . . . . . . . . . . 75 CHAPTER 6 OPEN-SET TEMPORAL ACTION LOCALIZATION . . . . . . . . . . 100 CHAPTER 7 COMPOSITIONAL ZERO-SHOT LEARNING . . . . . . . . . . . . . . 124 CHAPTER 8 OPEN-VOCABULARY ACTION DETECTION . . . . . . . . . . . . . 147 CHAPTER 9 CONCLUSIONS AND DISCUSSIONS . . . . . . . . . . . . . . . . . . 168 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 vi CHAPTER 1 INTRODUCTION 1.1 Research Motivation Visual understanding has been one of the most fundamental directions in computer vision. Its goal is to teach machines to understand the visual world analogous to humans from data captured by camera sensors. For a long time, visual understanding has been investigated in lab- controlled environments using small-scale data [54, 157]. In the past decades, such a situation has been revolutionized since the arising of deep learning methods [274, 106, 328, 65], large- scale datasets [53, 111, 22, 1], and advanced computation hardware. However, when it comes to real-world applications, most existing visual understanding methods can hardly work well because of dynamically changing visual environments and test-time requirements [363]. In reality, it is impractical to assume infinite observations or annotations in testing. In this dissertation, we denote such a testing scenario as an open world, in which we identify two key aspects in terms of the unknowns with respect to the model learning: (1) unknown data observation and (2) unknown task requirements. The former can be attributed to the infeasible access to the incoming data such as streaming videos, or unseen testing environments. The latter often deals with task requirements that contain new semantics, e.g., identifying unseen wild animals. In summary, at every testing moment in an open world, the visual understanding systems could be fed with only the historical observational data, in an unseen environment, and asked to identify unknown concepts. Given these open-world situations, we ask the question: how to achieve the visual understanding of various unknowns in an open world? To answer the question, we first begin with open-world visual forecasting, because future events are naturally unseen in streaming videos. Due to limited visual observations, it requires a comprehensive abstraction of physical patterns and temporal dynamics from early observed visual data. In this dissertation, we take the accident risk anticipation for self-driving safety and 3D hand trajectory prediction for virtual reality applications as case studies. Three types of unseen are handled as shown in Fig. 1.1 (left). Specifically, the first one is to answer the question: how to 1 Figure 1.1 Open-world Visual Understanding. The many “unknowns” in an open world could be handled in various computer vision applications. This dissertation primarily covers the topics of visual forecasting, recognition, and vision-language understanding in an open world. predict the unseen future accident risk from dashed camera videos? We answer this question from a Bayesian probabilistic view such that the predictive uncertainties of unseen future risk can be quantified in real-world testing. Furthermore, the second one is the unknown distractors in accident forecasting. This motivates us to study which regions are key to the model to “visually” attend in accident forecasting. Our work, for the first time, introduces selective visual attention to suppress the unknown driving distractors in forecasting. Lastly, when a forecasting system is deployed to a 3D physical world, it is interesting to answer the question: can the model perform forecasting in an unknown 3D scene? This motivates us to explore 3D hand trajectory forecasting in an open world where the testing scenes are unseen. Next, regarding open-world visual recognition problems, we consider the most fundamental research question: can we build a video model knowing what is unknown? The unknown means that in testing, either the entire video contains unseen activities or only a few short clips of a long video contain unseen activities, as shown in Fig. 1.1 (middle). This is challenging because the model does not have any information about test-time unknowns in training, as evidenced by many open-set recognition works in the image understanding domain [278, 16, 86]. Furthermore, specific to the video modality, some key challenges such as how to efficiently quantify the unknown scores for large-scale video data and how to deal with the open-world temporal dynamics, have never been explored in literature. With these motivations, we develop the first works for the open- set video understanding tasks, including open-set video recognition and open-set temporal action 2 ?open-world visual forecastingopen-world visual recognitionopen-world vision-language understandingstripped zebrasliced applestrippedslicedzebraappletomatoesshoes?????in unknown sceneunknown distractorsunknown futureclosed setopen set??unknown clipsunknown actionsunknown compositions??unknown action regions Figure 1.2 Visual Forecasting. localization, that aim to recognize and localize the unseen human activities in videos. Lastly, with the recent trend of language modeling in computer vision, we are further interested in the power of the current vision-language foundation models such as CLIP [261] and large-language models (LLM) [247] in handling more complex open-world visual understanding problems, as shown in Fig. 1.1 (right). Under the CLIP-like vision-language modeling paradigm, two complex open-world visual understanding research questions are explored in this dissertation. The first one is to understand unseen compositional concepts from images. Different from the prior works [239, 214], we aim to ask the question: how to effectively leverage LLMs to understand images of unseen compositional concepts? We introduce the idea of prompting the language-informed distributions when adapting the CLIP model. The other one is to understand unseen action regions from videos. This is motivated by the question: can we detect any unseen video actions from an open-ended action vocabulary? For the first time, we formulate it as an open-vocabulary action detection problem and explore key factors when adapting video-based CLIP models. 1.2 Organization 1.2.1 Part I: Open-World Visual Forecasting Forecasting from early observed visual data is essentially to mimic the capability of human imagination. We human beings are able to imagine how things will evolve in time. For example, human drivers could avoid tragic collisions with other cars or pedestrians by a subconscious mind of the trend of their motions [329]. However, such a capability is extremely challenging for machines to learn from video data. The challenges lie in the following aspects. First, intuitively the less we have observed, the more uncertain our forecasting about the future will be. Furthermore, when 3 ?pastfuture the videos are captured from egocentric views, the relative motion between an ego-agent and the dynamic environments incurs more complexity in understanding the trend of physical motion from videos. Lastly, instead of video frame prediction, exploring visual forecasting problems in specific in-the-wild applications, such as traffic accident risks or 3D hand trajectories, requires injecting domain-specific knowledge or constraints into the model learning. These aspects are even more challenging in an open world where unseen distractors or environments are tested. In this dissertation, we explore the challenges above through the lens of traffic accident antici- pation and 3D hand trajectory forecasting problems, enabling safe autonomous driving and robust virtual reality applications. Specifically, in Sec. 2, we develop a novel model to early predict the occurrence of future traffic accidents from dashcam videos. In this work, we collected accident videos recorded by dash cameras mounted on driving cars, and annotated the start timestamps of the traffic accidents. The task is to early predict whether there will be an accident or not before it happens. Technically, the model first detects the traffic objects on each frame which are further structured as a graph. Then, the model is learned to capture the spatial and temporal dynamics from the sequence of graphs. Finally, the learned features are used to predict the probability of accident occurrence by Bayesian neural networks (BNNs) [241], which provide uncertainty estimation in prediction. The Bayesian uncertainty modeling is valuable for providing a learning regularization in training, and more importantly, the variability of unknown future accident risk can be quantified to achieve trustworthy autonomous driving in an open world. Motivated by this work, in Sec. 3, we further explore the traffic accident anticipation by considering the visual explanation, that interprets how the unseen distractors are suppressed in forecasting. We argue that in autonomous driving scenarios, a better way to achieve trustworthy decisions from input videos is to provide visual attentional cues. Inspired by visual attention modeling which mimics the human visual attention mechanism [85, 245], we propose to formulate the driver’s visual attention and the system’s accident anticipation in a single Markov decision process (MDP). At each time step, our model “fixates" at the most risky regions on dashcam videos and predicts the occurrence of future accidents as well as the next fixation point. The model is 4 Figure 1.3 Open-Set Action Recognition. learned by deep reinforcement learning (DRL) and achieves much better performance than prior arts while providing visual attention as the explanation. In addition, we explore the visual forecasting problem through the tasks of 3D trajectory prediction in the open-world virtual reality scenario (in Sec. 4). It aims to early predict the future hand trajectory in either seen or unseen 3D physical world from egocentric (first-person view) videos, while only giving the historically observed videos and trajectories as input. Different from the previous accident anticipation, this task is challenging due to the requirement of 3D sensing of the indoor environment and helmet wearers from dynamic videos. In this work, for the first time, we build the benchmarks of the egocentric 3D hand trajectory forecasting by collecting and annotating the egocentric videos. Then, we develop a novel and effective model for the task by leveraging recent Transformers [328, 305] and the classical uncertainty-aware state-space modeling. The model empirically achieves the best trajectory forecasting performance on both seen and unseen 3D egocentric scenes. 1.2.2 Open-world Visual Recognition Detecting the unknown from visual data has been one of the fundamental visual understanding topics. The unknown refers to the objects or actions not defined in the model learning process due to the open-world testing scenario [278, 317]. For example, in a self-driving scenario, if a self-driving system has only learned from common objects such as vehicles, pedestrians, trees, buildings, etc., we expect the system can also know that a wild animal standing on the road is unknown. Without 5 such a capability, there could be a catastrophic traffic accident caused by the self-driving system. For a long period, visual recognition has been dominated by empirical risk minimization (ERM) of the statistical machine learning [327], which aims to learn a recognition model by minimizing task error over the collected training data. When the test data exhibit a different semantic distribution from the training data, e.g., data from unknown categories exist, the recognition model would fail to identify the unknown. According to whether the unknown data exists in training (without labels), the testing unknown can be treated as the unknown-unknown and the known-unknown [86]. In this dissertation, for the first time, we tackle these two types of unknowns for open-world video understanding. Specifically, detecting the unknown-unknown is studied by open-set recognition while the known-unknown problem is studied by open-set action localization. In Sec. 5, we formulate the video-based open-set action recognition (OSAR) problem, which aims to learn a video classification model with the capability of identifying the unknown action in testing. The fundamental challenge is to learn a scoring function to identify the unknown when training a video-based classifier. Inspired by evidential deep learning (EDL) [283, 4], we treat the video classification as an evidence collection process such that low total evidence indicates a high classification uncertainty of the testing video, i.e., more likely contains an unknown action. Besides, compared to image-based open-set recognition [278, 147, 34], OSAR models need to overcome the static bias issue that the model tends to learn the shortcut mapping from the confounding frame- level visual features to the class labels. We propose to debias the target video feature by introducing biased auxiliary branches, that encourage the biased feature to be discriminative in classification while statistically independent of the target feature. Eventually, our proposed method, termed the deep evidential action recognition (DEAR) model, achieves superior performance over different video benchmarks using multiple action recognition backbones. In Sec. 6, following the DEAR work, we further extend the EDL method to tackle the known- unknown detection problem through the open-set temporal action localization (OpenTAL) task, which we formulate as the first work in existing literature. The task aims to temporally localize and recognize human actions while identifying the unknown action in recognition given a testing video. 6 Figure 1.4 Visual understanding paradigm. Since the model is learned on video data with annotations of only the known action categories, the unknown actions are mixed with the background segments. This presents a unique challenge that the model needs to simultaneously distinguish between the foreground and background and classify between known and unknown actions. To tackle this challenge, we develop an OpenTAL model based on DEAR, that uses evidential uncertainty to identify the unknown action proposals and proposes a semi-supervised binary classification module to predict the actionness score for each proposal. Besides, to comprehensively evaluate the OpenTAL task, we develop a new evalu- ation protocol, open-set detection rate (OSDR), to benchmark our method. Our method achieves much better performance than baselines that combine existing TAL models [192] with open-set recognition methods. 1.2.3 Open-world Vision-Language Understanding Revisiting the history of computer vision research, most visual understanding literature has been exploring a key research question: what feature representation do we need for visual data? Past research experienced from hand-craft feature engineering [212, 48, 12] to the recent data-driven deep learning paradigm [158, 106, 65]. Despite the success achieved, they are still limited to representing visual features for complex requirements from diversified vision tasks. Thanks to the recent vision-language models [261], the traditional visual understanding evolves into a vision- language alignment problem, as shown in Fig. 1.4. For example, the CLIP model [261] that has been pre-trained on web-scale image-text pairs shows superior zero-shot recognition performance, i.e., recognizing unseen concepts without fine-tuning in an open world. Moreover, in the spirit of 7 closedopenhand-craftvision-centricvision-languagevisvis⨂vistxt “next-token-prediction” by large-language models (LLM) [247], more general visual understanding tasks can be unified by utilizing language models as interfaces to represent task input or output. These advances inspire us to investigate more challenging open-world understanding problems by foundation models such as CLIP and LLM. In this dissertation, we first explore how to leverage LLMs to strengthen the CLIP model for recognizing unseen compositional concepts from image data. Next, for video understanding, we make contributions by adapting video-based CLIP and LLM for localizing unseen human actions in the space and time of videos. In Sec. 7, the goal is to achieve compositional zero-shot learning (CZSL) capability for image data by pre-trained CLIP. With such a capability, a powerful image recognition model can be learned from limited data without large-scale compositional annotations. For example, the model could recognize unseen compositions such as sliced tomatoes when the model only learns the sliced from sliced potatoes and tomatoes from red tomatoes. In this work, we leverage LLM to generate multiple descriptions for compositional concepts, which enables class-specific Gaussian distribution modeling for margin-aware prompt optimization and enhances the textual class representation for vision-language alignment. We further decompose the text and image into simple primitives, i.e., states and objects, for hierarchical representation learning. Our method shows superiority in both the closed- and open-world testing environments. In Sec. 8, for video understanding, we formulate the first open-vocabulary action detection (OVAD) work that could detect any human actions from videos. This work is inspired by existing open-vocabulary learning literature, that uses a pre-trained CLIP model to align visual features with corresponding language semantics from an open-ended vocabulary. To fully exploit the CLIP semantics for recognition and CLIP localizability for spacetime localization, we build an OpenMixer model that bridges the pre-trained video CLIP and detection transformer (DETR) [24] where an LLM is used to obtain generalizable language context for the open-ended action vocabulary. In Sec. 9, we conclude the dissertation with discussions on the limitations of the mentioned publications and present some ideas for future work. 8 1.3 Relevant Publications Following is the list of relevant first-authored publications for each chapter. • Chapter 2 - Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning [315] (ACM MM 2020) • Chapter 3 - Deep Reinforced Accident Anticipation with Visual Explanation [316] (ICCV 2021) • Chapter 4 - Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting [310] (ICCV 2023) • Chapter 5 - Evidential Deep Learning for Open Set Action Recognition [317] (ICCV 2021) • Chapter 6 - Towards Open Set Temporal Action Localization [318] (CVPR 2022) • Chapter 7 - Prompting Language-Informed Distribution for Compositional Zero-Shot Learn- ing [311] (ECCV 2024) • Chapter 8 - Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection [312] (arXiv 2024) 9 CHAPTER 2 UNCERTAINTY-BASED ACCIDENT ANTICIPATION 2.1 Introduction Accident anticipation aims to predict an accident from dashcam video before it happens. It is one of the most important tasks for safety-guaranteed autonomous driving applications and has been receiving increasing attentions in recent years [29, 303, 68, 46]. Thanks to the accident anticipation, the safety level of intelligent systems on vehicles could be significantly enhanced. For example, even a successful anticipation made only a few seconds earlier before the occurrence of an accident can help a self-driving system to make urgent safety control, avoiding a possible car crash accident. However, accident anticipation is still an extremely challenging task due to the noisy and limited visual cues in an observed dashcam video. Take Fig. 2.1 as an example, a traffic scene captured in an egocentric view is typically crowded with multiple cars, pedestrians, motorcyclists, and so on. In this scenario, accident-relevant visual cues could be overwhelmed by objects that are not relevant to the accident, making an intelligent system insensible to a car crash accident that will happen at the road intersection. Nevertheless, traffic accidents are foreseeable by training a powerful uncertainty-based model to distinguish the accident-relevant cues from noisy video data. For example, the inconsistent motions of multiple vehicles may indicate a high risk of possible future accidents. In this paper, we propose a novel uncertainty-based accident anticipation model with spatio- temporal relational learning. The model aims to learn accident-relevant cues for accident antici- pation by considering both spatial and temporal relations among candidate agents. The candidate agents are a group of moving objects like vehicles and their relational features are indicative of future unobserved accidents. The spatial relations of candidate agents are learned from their spatial This chapter is from the following publication: "Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), 2020." 10 Figure 2.1 Illustration of Uncertainty-based Accident Anticipation. This paper presents a novel model to predict the probabilities (black curve) of a future accident (ranges from 90-th to 100- th frame). Our goal is to achieve early anticipation (large Time-to-Accident) giving a threshold probability (horizontal dashed line), while estimating two kinds of predictive uncertainties, i,e., aleatoric uncertainty (wheat color region) and epistemic uncertainty (blue region). distance, visual appearance features, as well as historical visual memory. The temporal relations of agents provide learnable patterns to indicate how the agents evolve and end with an accident in the temporal context. It can be recurrently learned by updating historical memory with agent-specific features and spatial relational representation. To address the variability of the spatio-temporal re- lational representations, a probabilistic module is incorporated to simultaneously predict accident scores and estimate how much uncertainty when making the prediction. As shown in Fig. 2.2, on the one hand, we propose to learn spatial relations with graph convolutional networks (GCN) [52, 145] by considering the hidden states from a recurrent neural network (RNN)) [199, 285] cell. On the other hand, we propose to build temporal relations with RNNs by considering both spatial relational and agent-specific features. The cyclic process of the coupled GCNs and RNNs could generate representative latent spatio-temporal relational features. Besides, we propose to incorporate Bayesian deep neural networks (BNNs) [58, 241] into our model to address the predictive uncertainty. With the Bayesian formulation, our derived epistemic uncertainty-based ranking loss is effective in improving the quality of the learned relational features and significantly leads to performance gain. At last, to further consider the global guidance of all hidden states in the training stage, we propose a self-attention aggregation layer as shown in Fig. 2.4, from which an auxiliary video-level loss is obtained and demonstrated beneficial to our model. Compared with existing RNN-based methods [29, 303], our model captures not only agent- 11 Time-to-Accidentthreshold specific features but also relational features for accident anticipation. Compared with the recent approach [242] which is developed with 3D CNNs, our model is developed with GCNs and RNNs so that both spatial and temporal relations can be learned. Moreover, our method is capable of estimating the predictive uncertainty while all existing methods are deterministic. The proposed model is evaluated on two public dashcam video datasets, i.e., DAD [29] and A3D [303], and our collected Car Crash Dataset (CCD). Experimental results show that our model can outperform existing methods on all datasets. For DAD datasets, our method can anticipate a traffic accident 3.53 seconds on average earlier before an accident happens. With the best precision setting, our model can achieve 72.22% average precision. Compared with DAD and A3D datasets, our CCD dataset includes diversified environmental annotations and accident reason descriptions, which could promote research on traffic accident reasoning. The main contributions of this paper are summarized below: • We propose a traffic accident anticipation model by considering both agent-specific features and their spatio-temporal relations, as well as the predictive uncertainty. • With Bayesian formulation, the spatio-temporal relational representations can be learned with high quality by a novel uncertainty-based ranking loss. • We propose a self-attention aggregation layer to generate video-level prediction in the training stage, which serves as global guidance and is demonstrated beneficial to our model. • We release a new dataset containing real traffic accidents, in which diversified environmental annotations and accident reasons are provided. 2.2 Related Work Traffic Accident Anticipation To anticipate traffic accidents that will happen in future frames, an intuitive solution is to iteratively predict the accident confidence score for each time step. Chan et al. [29] recently proposed DSA framework to leverage candidate objects appeared in each frame to represent the traffic status. They applied spatial-attention to these objects to get a weighted 12 feature representation for each LSTM cell. Based on this work, Suzuki et al. [303] proposed an adaptive loss for early anticipation with quasi-recurrent neural networks [19]. Similar to the DSA that implements dynamic-spatial attention to focus on accident-relevant objects, Corcoran and James [46] proposed a two-stream approach to traffic risk assessment. They utilized features of candidate objects as input of a spatial stream and optical flow as input of a temporal stream, and the two-stream features are fused for the risk level classification. Instead of using dashcam videos, Shah et al. [286] proposed to use surveillance videos to anticipate traffic accidents by using the framework DSA. Different from previous works, recently Neumann and Zisserman [242] used 3D convolutional networks to predict the sufficient statistics of a mixture of 1D Gaussian distributions. In addition to using only dashcam video data, Takimoto et al. [304] proposed to incorporate physical location data to predict the occurrence of traffic accidents. Closely related to traffic accident anticipation, traffic accident detection has been recently studied by Yao et al. [368]. They proposed to detect traffic anomalies by predicting the future locations on video frames using ego-motion information. To anticipate both spatial risky regions and temporal accident occurrence, Zeng et al. [387] proposed a soft-attention RNN by considering the event agent such as the human that triggers the event. However, existing work typically ignores the relations between accident-relevant agents which capture important cues to anticipate accidents in future frames. Besides, none of them considers the uncertainty estimation in developing their models, which is critical to safety-guaranteed systems. Uncertainty in Sequential Modeling Uncertainty estimation is crucial to sequential relational modeling. One way is to directly formulate the latent representations of relational observations at each time step as random variables, which follow posterior distributions that can be approximated by deep neural networks. This is similar to variational auto-encoder (VAE) [143, 270]. Inspired by VAE, Chung et al. [43] proposed a variational recurrent neural network (VRNN), which formulates the hidden states of RNN as random variables and uses neural networks to approximate the posterior distributions of the variables. To further consider the relational representation of sequential data, 13 Figure 2.2 Framework of the proposed model. With graph embedded representations 𝐺 ( 𝑿𝑡, 𝑨𝑡) at time step 𝑡, our model learns the latent relational representations 𝒁𝑡 by the cyclic process of graph convolutional networks (GCNs) and recurrent neural network (RNN) cell, and predicts the accident score 𝒂𝑡 by Bayesian neural networks (BNNs). Hajiramezanali et al. [103] proposed variational graph recurrent neural networks (VGRNN) for dynamic link prediction problem by combining the graph RNN and variational graph auto-encoder (VGAE) [144]. Another way to address uncertainty estimation is to formulate the weights of neural network as random variables such as Bayesian neural networks (BNNs) [58, 241]. Recently, Zhao et al. [400] proposed a Bayesian graph convolution LSTM model for skeleton-based action recognition. In this paper, we also use graph convolution and BNNs but the difference is that their method uses stochastic gradient Hamiltonian Monte Carlo (SGHMC) sampling for posterior approximation, while we use Bayes-by-backprop [18] as our approximation method. Compared with SGHMC, the Bayes-by-backprop method can be seamlessly integrated into a deep learning optimization process so that it is more flexible to handle the learning tasks with large-scale datasets, i.e., dashcam videos used in traffic accident anticipation. 2.3 Approach Problem Setup. In this paper, the goal of accident anticipation is to predict an accident from dashcam videos before it happens. Formally, given a video with current time step 𝑡, the model is expected to predict the probability 𝑎𝑡 that an accident event will happen in the future. Furthermore, suppose an accident will happen at time step 𝑦 where 𝑡 < 𝑦, the Time-to-Accident (TTA) is defined as 𝜏 = 𝑦 − 𝑡 when 𝑡 is the first time that 𝑎𝑡 is larger than given threshold (see Fig. 2.1). For any 𝑡 ≥ 𝑦 with a positive video that contains an accident, we define 𝜏 = 0 which means the model fails to anticipate the accident. In this paper, our goal is to predict 𝑎𝑡 and expect 𝜏 to be as large as possible 14 GCNs⋮BNNsRNN𝑡𝑡=1𝑡𝑡=2𝑡𝑡=𝑇𝑇 for dashcam videos that contain accidents. Similar to [29], the ground truth of 𝑎𝑡 is expressed with 2-dimensional one-hot encoding so that prediction target is 𝒂𝑡 = (𝑎 ( 𝑝) and 𝑎 (𝑛) )𝑇 , where 𝑎 ( 𝑝) , 𝑎 (𝑛) 𝑡 𝑡 𝑡 𝑡 represent the positive and negative predictions, respectively, meaning an accident will happen or not happen in the given video. Framework Overview. The framework of our model is depicted in Fig. 2.2. With a dashcam video as input, a graph is constructed with detected objects and corresponding features at each time step. To learn the spatio-temporal relations of these objects, we use graph convolutional networks (GCNs) to learn the spatial relations and leverage the hidden state 𝒉𝑡 of recurrent neural network (RNN) cell to enhance the input of the last GCN layer. Besides, the latent relational features are fused with corresponding object features as input of an RNN cell to update the hidden state at next time step. The cyclic process encourages our model to learn the latent relational features 𝒁𝑡 from both spatial and temporal aspects. Furthermore, we propose to use Bayesian neural network (BNN) to predict accident scores 𝑎𝑡 so that predictive uncertainties are naturally formulated. During the training stage, we propose a self-attention aggregation (SAA, in Fig. 2.4) layer to predict video-level score, which can globally guide the learning of the proposed model. In the following sections, each part of our model will be introduced in detail. 2.3.1 Spatio-Temporal Relational Learning The spatio-temporal relations of traffic accident-relevant agents are informative to predict future accidents. In our model, we propose to use graph structured data to represent the observation at each time step. Then, the feature learning of spatial and temporal relations are coupled into a cyclic process. Graph Representation. Graph representation for traffic scene has the advantages over full- frame feature embedding in that the impact of cluttered traffic background can be reduced and infor- mative relations of traffic agents can be discovered for accident anticipation. Similar to [29, 286], we exploit object detectors [267, 23] to obtain a fixed number of candidate objects. These objects are treated as graph nodes so that a complete graph can be formed. However, the computational cost of graph convolution could be tremendous if the node features are with high dimensions. 15 In this paper, to construct low-dimensional but representative features for graph nodes 𝑿𝑡, we introduce fully-connected (FC) layers to embed both the features of full-frame and candidate objects into the same low-dimensional space. Then, the frame-level and all object-level features are concatenated to enhance the feature representation capability: 𝑿 (𝑖) 𝑡 = (cid:104) Φ (cid:17) (cid:16) 𝑶 (𝑖) 𝑡 , Φ (𝑭𝑡) (cid:105) , (2.1) where Φ denotes FC layer, 𝑶 (𝑖) 𝑡 and 𝑭𝑡 are high-dimensional features of the 𝑖-th object and corresponding frame at time 𝑡, respectively. The operator [, ] represents concatenation in feature dimension and is used throughout this paper for simplicity. The graph edge at time 𝑡 is expressed as an adjacent matrix 𝑨𝑡 of a complete graph since we do not have information on which candidate object will be involved in an accident. Typically, an object with closer distance to others has higher possibility to be involved in an future accident. Therefore, the spatial distance between objects should be considered in edge weights such that we define 𝑨𝑡 as 𝑨(𝑖 𝑗) 𝑡 = exp{−𝑑 (𝑟𝑖, 𝑟 𝑗 )} (cid:205)𝑖 𝑗 exp{−𝑑 (𝑟𝑖, 𝑟 𝑗 )} , (2.2) where 𝑑 (𝑟𝑖, 𝑟 𝑗 ) measures the Euclidean distance between two candidate object regions 𝑟𝑖 and 𝑟 𝑗 . By this formulation, closer distance leads to larger 𝑨(𝑖 𝑗) . This means the two objects 𝑖 and 𝑗 will 𝑡 be applied with larger weight when we use graph convolution to learn their relational features for accident anticipation. Note that due to object occlusions, small distance defined in pixel space does not necessarily indicate close distance in physical world. It is possible to use 3D real-world distance if camera intrinsics are known. Nevertheless, the adjacency matrix defined in Eq. 2.2 has advantage to suppress the impact of irrelevant objects with significant large pixel distance to relevant objects. Temporal Relational Learning. To build temporal relations at different time steps, RNN methods such as LSTM [113] and GRU [41] are widely adopted in existing works. However, traffic objects may not always be remained in each frame, the node features of the statically structured graph will be dynamically changing over time. Thanks to the recent graph convolutional recurrent network (GCRN) [285], it can handle the node dynamics defined over a static graph structure [103]. Therefore, we propose to adapt GCRN for temporal relational feature learning. Specifically, the 16 hidden states 𝒉𝑡 of RNN cell at each time step are recurrently updated by 𝒉𝑡+1 = GCRN ( [𝒁𝑡, 𝑿𝑡] , 𝒉𝑡) , (2.3) where 𝒁𝑡 is the relational feature generated by the last GCN layer. The feature fusion between 𝒁𝑡 and 𝑿𝑡 ensures our model to make fully use of both agent-specific and relational features. Spatial Relational Learning. To capture spatial relations of detected objects, we follow the graph convolution defined by [52, 145] for each GCN layer. In this paper, we use two stacked GCN layers and consider the hidden state 𝒉𝑡 learned by RNNs to learn the spatial relational features: 𝒁𝑡 = GCN ( [GCN ( 𝑿𝑡, 𝑨𝑡) , 𝒉𝑡] , 𝑨𝑡) . (2.4) The fusion with 𝒉𝑡 enables the latent relational representation aware of temporal contextual infor- mation. This fusion method is demonstrated to be effective to boost the performance of accident anticipation in our experiments. 2.3.2 BNNs for Accident Anticipation To predict traffic accident score 𝒂𝑡, a straightforward way is to utilize neural networks (NNs) as shown in Fig. 2.3a. However, the output of NNs is a point estimate which cannot address the intrinsic variability of the input relational features at each time step. Moreover, NNs could be overconfident in false model predictions when the model suffers from over-fitting problem. To this end, we incorporate Bayesian neural networks (BNNs) [58, 241] into our framework for accident score prediction. The architecture is shown in Fig. 2.3b. The BNNs module consists of two BNN layers with latent representation 𝒁𝑡 given by Eq. 2.4 as input to predict accident score 𝒂𝑡. To best of our knowledge, we are the first to incorporate BNNs into video-based traffic accident anticipation such that predictive uncertainty can be achieved. The predictive uncertainty could be utilized to not only guide the relational features learning (see Section 2.3.3), but also provide tools to interpret the model performance. As we formulate the accident anticipation part as BNNs, the network parameters of BNNs such as weights and biases are all random variables, denoted as 𝜽. Each entry of 𝜽 is drawn from a Gaussian distribution determined by a mean and variance, i.e., 𝜽 ( 𝑗) ∼ N (𝜇, 𝜎), in which 17 (a) Neural Networks (b) Bayesian Neural Networks Figure 2.3 Compared with NNs (Fig.. 2.3a), network parameters of BNNs (Fig. 2.3b) are sampled from Gaussian distributions so that both 𝒂𝑡 and its uncertainty can be obtained. 𝜶( 𝑗) = (𝜇, 𝜎) need to be learned with dataset D = (𝒁𝑡, 𝒂𝑡). Therefore, the likelihood of prediction can be expressed as 𝑝(𝒂𝑡 |𝒁𝑡, 𝜽) = N ( 𝑓 ( 𝒁𝑡; 𝜽), 𝛽), where 𝛽 is the predictive variance. However, according to the Bayesian rule, to obtain the true posterior of model parameters, i.e., 𝑝(𝜽 |D), in addition to the likelihood and prior of 𝜽, the marginal distribution ∫ 𝑝(𝒂𝑡 |𝒁𝑡, 𝜽)𝑑𝜽 is required, which is intractable since 𝒂𝑡 = 𝑓 (𝒁𝑡, 𝜽) is modeled by a complex neural network. To estimate 𝑝(𝜽 |D), existing variational inference methods (VI) [96, 18, 79] could be used. In this paper, we adopt the VI method Bayes-by-Backprop [18] to approximate 𝑝(𝜽 |D) since it can be seamlessly incorporated in standard gradient-based optimization to learn from large-scale video dataset. According to [18], the variational approximation aims to minimize the following objective: arg min 𝜶 𝐽 ∑︁ 𝑖=1 log 𝑞 (𝜽𝑖 |𝜶) − log 𝑝 (𝜽𝑖) − log ( 𝑝 (D|𝜽𝑖)) , (2.5) where 𝐽 is the number of Monte Carlo samplings for 𝜽. The first term 𝑞 (𝜽𝑖 |𝜶) is the variational posterior distribution parameterized by 𝜶. The distribution parameters 𝜶 can be efficiently learned by using reparameterization trick and standard gradient descent methods [18]. We denote this loss term as L𝑉 𝑃𝑂𝑆. The second term 𝑝(𝜽𝑖) is the prior distribution of 𝜽. It is typically modeled with a spike-and-slab distribution, i.e., a mixture of two Gaussian density functions with zero means but different variances. We denote this loss term as L𝑃𝑅𝐼. The third term in Eq. 2.5 is the negative log-likelihood of model predictions. Since minimizing this term is equivalent to minimizing the mean squared error (MSE), in this paper, we propose to 18 ⋮⋮⋮⋮ use exponential binary cross entropy to achieve this objective: L𝐸 𝑋 𝑃 = 𝑇 ∑︁ 𝑡=1 −𝑒− max (cid:16) 0, 𝑦−𝑡 𝑓 (cid:17) log 𝑎 ( 𝑝) 𝑡 + − log (cid:16) 1 − 𝑎 (𝑛) 𝑡 (cid:17) , 𝑇 ∑︁ 𝑡=1 (2.6) where 𝑓 is the constant frame rate for the given video, and 𝑦 is the beginning time of an accident provided by training set. The exponential weighted factor applies larger penalty to the time step that is closer to the beginning time of an accident. 2.3.3 Uncertainty-guided Ranking Loss With the Bayesian formulation for accident anticipation, we can perform multiple forward passes at each time step such that an assembled prediction could be obtained by taking the average of these multiple outputs. Furthermore, as suggested by [136], the predictive uncertainty (variance) can be decomposed as aleatoric uncertainty and epistemic uncertainty [163, 289]: 𝑼𝑡 = 𝑀 ∑︁ (cid:2)diag ( ˆ𝒂𝑖) − ˆ𝒂𝑖 ˆ𝒂𝑇 𝑖 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) 1 𝑀 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) Aleatoric Uncertainty(𝑼𝑎𝑙𝑡 𝑖=1 ) (cid:124) 𝑡 (cid:3) + 𝑀 ∑︁ 1 𝑀 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) ( ˆ𝒂𝑖 − ¯𝒂) ( ˆ𝒂𝑖 − ¯𝒂)𝑇 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) 𝑖=1 (cid:125) (cid:124) (cid:123)(cid:122) Epistemic Uncertainty(𝑼 (cid:125) 𝑒 𝑝𝑡 𝑡 ) , (2.7) where ¯𝒂 = 1 𝑀 𝑖=1 ˆ𝒂𝑖 and ˆ𝒂𝑖 = ( ˆ𝑎 (𝑛) (cid:205)𝑀 𝑡 , ˆ𝑎 ( 𝑝) 𝑡 )𝑇 𝑖 . They are the predictions of the 𝑖-th forward pass at time step 𝑡 with total 𝑀 forward passes. The first term in Eq. 2.7 is the aleatoric uncertainty, which measures the input variability (noise) of BNNs. In our model, the aleatoric uncertainty serves as an indicator to the quality of the learned relational features from GCNs and RNNs. The second term in Eq. 2.7 is epistemic uncertainty which is determined by the BNNs model itself. Inspired by Ma et al. [216], ideally the epistemic uncertainties of sequential predictions should be monotonically decreasing, since as more frames the model observes, the more confident of the learned model (smaller epistemic uncertainty) will be. Therefore, we propose a novel ranking loss: L𝑅 𝐴𝑁𝐾 = max (cid:16) 0, trace (cid:16) 𝑼 𝑒 𝑝𝑡 𝑡 − 𝑼 𝑒 𝑝𝑡 𝑡−1 (cid:17)(cid:17) , (2.8) where 𝑼 𝑒 𝑝𝑡 𝑒 𝑝𝑡 𝑡−1 and 𝑼 𝑡 are epistemic uncertainties of successive frames 𝑡 − 1 and 𝑡 defined in Eq. 2.7. Note that 𝑼𝑡 as well as the two terms in Eq. 2.7 are matrices with size 2 × 2, therefore in practice we propose to use matrix trace to quantify the uncertainties, which is similar to the method adopted 19 Figure 2.4 SAA Layer. First, all 𝑁 × 𝑇 hidden states are gathered and pooled by max-avg concatenation. Then, the simplified self-attention and adaptive aggregation are proposed to predict video-level accident score 𝒂. in [289]. Our proposed ranking loss aims to apply penalty to the predictions that do not follow the epistemic uncertainty ranking rule. For aleatoric uncertainty 𝑼𝑎𝑙𝑡 𝑡 , it is not necessary to satisfy the monotonic ranking requirement since the noise ratio of accumulated data in video sequence is intrinsically not monotonic. 2.3.4 Temporal Self-Attention Aggregation Recurrent network can naturally build temporal relations of observations. However, the draw- back of RNNs is that inaccurate hidden states in early temporal stages could be accumulated in iterative procedure and mislead the model to give false predictions in latter temporal stages. Be- sides, the hidden states in different time steps should be adaptive to anticipate the occurrence of a future accident. To this end, motivated by recent self-attention design [328], we propose a self-attention aggre- gation (SAA) layer in the training stage by adaptively aggregating hidden states of all time steps. Then, we use the aggregated representation to predict video-level accident score. The architecture of SAA layer is shown in Fig. 2.4. Specifically, we first aggregate hidden states of 𝑁 individual objects at each time step by applying the concatenation between mean- and max-pooling results. Then, the self-attention [328] is adapted to weigh the representation of all 𝑇 time steps. In this module, the embedding layers are not used. Lastly, instead of using simple average pooling, we introduce an FC layer with 𝑇 learnable parameters to adaptively aggregate the 𝑇 temporal hidden states. The aggregated video- level representation is used to predict the video-level accident score 𝒂 by two FC layers. This 20 ⋮maxavg𝑁𝑁×𝑇𝑇×𝑑𝑑2𝑑𝑑×𝑇𝑇𝑇𝑇×𝑇𝑇softmax2𝑑𝑑×𝑇𝑇Self-attentionAdaptive aggregationAll hidden states𝑎𝑎2 FCs⋮𝑡𝑡=1𝑡𝑡=2𝑡𝑡=𝑇𝑇2𝑑𝑑×1⋮ Table 2.1 Comparison between CCD dataset and existing datasets. Information about DAD and A3D is obtained from their released sources. Temp means the temporal accident time annotations. RandABT means accidents beginning times are randomly placed. EgoIn means the ego-vehicles are involved in accidents. Light indicates the data is collected in day or night. Weather includes rainy, snowy, and sunny conditions. Bbox means the bounding boxes tracklets for accident participants. Reasons contains multiple possible reasons for each accident participant. Datasets # Videos # Pos Hours Temp RandABT EgoIn Light Weather Bbox Reasons DAD [29] A3D [368] Ours (CCD) 1,750 1,500 4,500 620 2.43 h ✓ 1,500 3.56 h ✓ 1,500 6.25 h ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ network is trained with binary cross-entropy (BCE) loss: L𝐵𝐶𝐸 = − log 𝑎 ( 𝑝) − log (cid:16) 1 − 𝑎 (𝑛)(cid:17) , (2.9) where 𝒂 = (𝑎 (𝑛), 𝑎( 𝑝))𝑇 normalized by softmax function. This auxiliary learning objective en- courages the model to learn better hidden states even though SAA layer is not used in testing stage. Finally, the complete learning objective of our model is to minimize the following weighted loss: L = L𝐸 𝑋 𝑃 + 𝑤1 · (L𝑉 𝑃𝑂𝑆 − L𝑃𝑅𝐼) + 𝑤2 · L𝑅 𝐴𝑁𝐾 + 𝑤3 · L𝐵𝐶𝐸 (2.10) where the L𝑉 𝑃𝑂𝑆 and L𝑃𝑅𝐼 are loss functions of variational posterior and prior. The constants 𝑤1, 𝑤2 and 𝑤3 are set to 0.001, 10 and 10, respectively, to balance the magnitudes of these loss terms. The second penalty term (L𝑉 𝑃𝑂𝑆 − L𝑃𝑅𝐼) is also termed as complexity loss and has similar effect to overcome over-fitting problem. The third penalty term L𝐵𝐶𝐸 introduces video-level classification guidance while the fourth term L𝑅 𝐴𝑁𝐾 brings uncertainty ranking guidance to train our model. 2.4 Experiments In this section, we evaluate our model on three real-world datasets, including our collected Car Crash Dataset (CCD) and two public datasets, i.e., Dashcam Accident Dataset (DAD) [29] and AnAn Accident Detection (A3D) dataset [368]. State-of-the-art methods are compared and ablation studies are performed to validate our model. 21 Figure 2.5 Annotation samples of our Car Crash Dataset (CCD). The gray box on top-left contains video-level annotations, while the other three white boxes provide instance-level annotations. 2.4.1 Datasets CCD dataset1. In this paper, we collect a challenging Car Crash Dataset (CCD) for acci- dent anticipation. We ask annotators to label YouTube accident videos with temporal annotations, diversified environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions. For temporal anno- tations, the accident beginning time is labeled at the time when a car crash actually happens. To get trimmed videos with 5 seconds long, the accident beginning times are further randomly placed in last 2 seconds, generating 1,500 traffic accident video clips. We also collected 3,000 normal dashcam videos from BDD100K [375] as negative samples. The dataset is divided into 3,600 training videos and 900 testing videos. Examples are shown in Fig. 2.5 and comparison details with existing datasets are reported in Table 2.1. Compared with DAD [29] and A3D [368], our CCD is larger with diversified annotations. DAD dataset. DAD [29] contains dashcam videos collected in six cities in Taiwan. It provides 620 accident videos and 1,130 normal videos. Each video is trimmed and sampled into 100 frames with totally 5 seconds long. For accident videos, accidents are placed in the last 10 frames. The dataset has been divided into 1,284 training videos (455 positives and 829 negatives) and 466 testing videos (165 positives and 301 negatives). A3D dataset. A3D [368] is also a dashcam accident video dataset. It contains 1,500 positive 1CCD dataset is available at: https://github.com/Cogito2012/CarCrashDataset 22 type: carinvolved: truereason: [‘change lane’]tracklet: {‘004529’:[636, 266, 739, 353], ‘004534’:[633, 277, 717, 371],…‘004554’: [149, 249, 325, 409]}type: motorcyclistinvolved: truereason: {‘speedy’, ‘traffic violation’,‘poor judgement’}tracklet: {‘004529’: [78, 183, 214, 485], ‘004534’: [188, 209, 314, 425],…‘004554’: [304, 265, 367, 377]}type: carinvolved: falsereason: {‘none’}tracklet: {‘006174’: [332, 221, 437, 306],‘006179’: [395, 226, 435, 261] … ‘006199’: [526, 235, 589, 273]}video id: 000846accident begin: 31time: dayweather: sunnyego-involved: falseaccident begins⋰ traffic accident videos. In this paper, we only keep the 587 videos in which ego-vehicles are not involved in accidents. We sampled each A3D video with 20 fps to get 100 frames in total and placed the beginning time of each accident at the last 20 frames similar to DAD. The dataset is divided into 80% training set and 20% testing set. 2.4.2 Evaluation Metrics Average Precision. This metric evaluates the correctness of identifying an accident from a video. Following the same definition as [29], at time step 𝑡, if 𝑎 ( 𝑝) 𝑡 is larger than a threshold, then the prediction at frame 𝑡 is positive to contain an accident, otherwise it is negative. For accident videos, all frames are labeled with ones (positive), otherwise the labels are zeros (negative). By this way, the precision, recall, as well as the derived Average Precision (AP) can be adopted to evaluate models. Time-to-Accident. This metric evaluates the earliness of accident anticipation based on positive predictions. For a range of threshold values, multiple TTA results as well as corresponding recall rates can be obtained. Then, we use mTTA and TTA@0.8 to evaluate the earliness, where mTTA is the average of all TTA values and TTA@0.8 is the TTA value when recall rate is 80%. Note that if a large portion of predictions are false positives, very high TTA results can still be achieved while corresponding AP would be low. That means the model is overfitting on accident video and may give positive predictions for arbitrary input. Therefore, except for fair comparison with existing methods, we mainly report TTA metrics when the highest AP is achieved, because it is meaningless to obtain high TTA if high AP cannot be guaranteed. Predictive Uncertainty. Based on Eq. 2.7, we introduce to use the mean aleatoric uncertainty (mAU) and mean epistemic uncertainty (mEU) to evaluate the predictive uncertainties. 2.4.3 Implementation Details We implement our model with PyTorch [255]. For DAD dataset, we use the candidate objects and corresponding features provided by DSA2 for fair comparison. For the experiments on A3D and CCD datasets, we use the public detection codebase MMDetection3 to train Cascade R-CNN [23] 2DSA: https://github.com/smallcorgi/Anticipating-Accidents 3MMDetection: https://github.com/open-mmlab/mmdetection 23 Table 2.2 Evaluation results on DAD, A3D, and CCD datasets. Results of baselines on DAD are obtained from [387] and [303]. The notation “–" means the metric is not applicable. Datasets Methods mTTA(s) AP(%) DAD [29] A3D [368] CCD DSA [29] L-RAI [387] adaLEA [303] Ours DSA [29] Ours DSA [29] Ours 1.34 3.01 3.43 3.53 4.41 4.92 4.52 4.74 48.1 51.4 52.3 53.7 93.4 94.4 99.6 99.5 mAU – – – 0.0294 – 0.0095 – 0.0137 mEU – – – 0.0011 – 0.0023 – 0.0001 with ResNeXt-101 [351] backbone and FPN [196] neck as our object detector on KITTI 2D detection dataset [84]. The trained detector is used to detect candidate objects and then extract VGG-16 features of full-frame and all objects. As suggested by Bayes-by-backprop [18], we set the number of forward passes 𝑀 to 2 in training stage and 10 for testing stage. For the hyper- parameters of prior distribution, we set the mixture ratio 𝜋 to 0.5 and the variances of the two Gaussian distributions 𝜎1 to 1 and 𝜎2 to exp(−6). The dimensions of both hidden state of RNN and output of GCNs are set to 256. In the training stage, we set batch size to 10 and initial learning rate to 0.0005 with ReduceLROnPlateau as learning rate scheduler. The model is trained by Adam optimizer for totally 70 training epochs. 2.4.4 Performance Evaluation Compare with State-of-the-art Methods. Existing methods [29, 387, 303] are compared and results are reported in Table 2.2. For fair comparison, we use the model at the last training epoch for evaluation on DAD datasets. Nevertheless, the trained model with best AP is kept for evaluation on other two datasets since high AP is important to suppress impact of false positives on TTA evaluation. Note that these two metrics currently are only applicable to our model, since we are the first to introduce uncertainty formulation for accident anticipation. From Table 2.2, our model on DAD dataset achieves the best mTTA which means the model anticipates on average 3.53 seconds earlier before an accident happens, while keeping competitive AP performance at 53.7% compared with L-RAI and adaLEA. Note that the video lengths of the 24 Table 2.3 TTA with different recall rates on DAD dataset. Recall DSA [29] Ours 0.1 0.28 0.59 0.2 0.50 0.75 0.3 0.73 0.84 0.4 0.87 0.96 0.5 0.92 1.07 0.6 1.02 1.16 0.7 1.24 1.33 0.8 1.35 1.56 0.9 2.28 1.99 three datasets are all 5 seconds, our high performance on A3D and CCD demonstrate that our model is easier to be trained on different datasets. This can be explained by the mAU results due to their consistence with TTA evaluation results in Table 2.2. The low mAU values on A3D and DAD datasets reveal that our model has learned relational representations with high quality on these datasets. We further report TTA results with different recall rates from 10% to 90% in Table 2.3. It shows that our model outperform DSA in most of recall rate requirements. For recall rates larger than 80%, our method performs poorly compared with DSA. However, high recall rate may also lead to too much false alarm so that AP cannot be guaranteed to be high. This finding also supports our motivation to use the trained model with best AP for evaluation. Visualization We visualized accident anticipation results with samples in DAD dataset (see Fig. 2.6). The uncertainty regions indicate that in both early and late stages, the model is quite confident on prediction (low uncertainties), while in the middle stage when accident scores start are increasing, the model is uncertain to give predictions. Note that the predicted epistemic uncertainty (blue region) is not necessary to be monotonically decreasing since we only use Eq. 2.8 as training regularizer rather than strict guarantee on predictions. The results are with good interpretability, in that driving system is typically quite sure about the accident risk level when the self-driving car is far from or almost being involved in an accident, while it is uncertain about it when accumulated accident cues are insufficient to make decision. 2.4.5 Ablation Study In this section, to validate the effectiveness of the several main components, the following components are replaced or removed, and compared with our model based on best AP setting. (1) BNNs: The BNNs are replaced with vanilla FC layers. Note that in this case, L𝑉 𝑃𝑂𝑆 − L𝑃𝑅𝐼 and 25 Figure 2.6 Examples of our predictions on DAD datasets. The red curves indicate smoothed accident scores as observed frames increase. The ground truths (beginning time of an accident) are labeled at the 90-th frame. We plot one time of squared epistemic (blue region) and aleatoric uncertainties (wheat color region). The horizontal line indicates the probability threshold 0.5. our proposed ranking loss L𝑅 𝐴𝑁𝐾 in Eq. 2.10, as well as mAU are not applicable. (2) SAA: The SAA layer is removed so that L𝐵𝐶𝐸 in Eq. 2.10 is not used. (3) GCN: We replace GCNs with vanilla FC layers in Eq. 2.3 and Eq. 2.4. (4) Fusion: For this variant, the fusion in Eq. 2.3 and Eq. 2.4 are removed such that only 𝒁𝑡 and GCN( 𝑿𝑡, 𝑨𝑡) are used, respectively. (5) RankLoss: The epistemic uncertainty-based ranking loss is removed so that L𝑅 𝐴𝑁𝐾 in Eq. 2.10 is not applicable. Results are shown in Table 2.4. We can clearly see that the uncertainty-based ranking loss contributes most to our model by comparing variant (2)(6) with (1), with about 7.6% performance gain. Though the BNNs module leads to small performance gain, we attribute the benefit of BNNs to its derived uncertainty 26 threshold=0.5Accident AnticipatedTTA = 2.4 sAccident Startedthreshold=0.5Accident AnticipatedTTA = 2.4 sAccident Startedthreshold=0.5Accident AnticipatedTTA = 2.4 sAccident Started Table 2.4 Ablation studies results on DAD dataset. Variants (1) (2) (3) (4) (5) (6) BNNs ✓ ✓ ✓ ✓ ✓ SAA ✓ ✓ ✓ ✓ ✓ GCN ✓ ✓ ✓ ✓ ✓ Fusion ✓ ✓ ✓ ✓ ✓ RankLoss ✓ ✓ ✓ ✓ AP(%) mAU 72.22 70.38 67.34 67.10 65.50 64.60 0.0731 – 0.1150 0.1250 0.1172 0.0950 Table 2.5 Model size comparison. Our model variants (2), (4), and (5) are included for comparison. Unit M means a million. Methods # Params. (M) DSA 4.40 Ours 1.97 v(2) 1.66 v(4) 1.97 v(5) 1.90 ranking loss as well as the interpretable results. Furthermore, the lowest mAU and highest AP for variant (1) demonstrate that the learned relational features are of the highest quality (smallest uncertainty) compared with other variants. The results of variant (3) validate the effectiveness of our self-attention aggregation (SAA) layer, while the results of variant (4) validate the superiority GCN over naive FC layers. The results of variant (5) show that the feature fusion between GCN outputs and hidden states, and the fusion between relational features and agent-specific features are important to accident anticipation, leading to approximately 7% performance gain. Model Size Comparison. The number of network parameters are counted and reported in Table 2.5. It shows that the proposed model is much light-weighted than DSA, and only slightly increases the model size when compared with other variants of our model. 2.5 Conclusion In this paper, we propose an uncertainty-based traffic accident anticipation with spatio-temporal relational learning. Our model can handle the challenges of relational feature learning and antic- ipation uncertainty from video data. Moreover, the introduced Bayesian formulation not only significantly boosts anticipation performance by using the uncertainty-based ranking loss, but also provides interpretation on predictive uncertainty. In addition, we release a CCD dataset for accident anticipation which contains rich environmental attributes and accident reason annotations. 27 CHAPTER 3 DEEP REINFORCED EXPLAINABLE ACCIDENT ANTICIPATION 3.1 Introduction With increasing demand for autonomous driving, anticipating possible future accidents is becoming the central consideration to guarantee a safe driving strategy [29, 303, 315]. Given a dashcam video, an accident anticipation model aims to tell the driving system if and when a traffic accident will occur in the near future. Despite remarkable advances in visual perception [55, 84, 106], the decision-making of driving control has long been studied in isolation with vision perception research for the autonomous driving scenario [141, 277]. We target at bridging this gap by investigating a key research question: where do drivers look when predicting possible future accidents? This will lead to a visually explainable model that associates low-level visual attention and high-level accident anticipation. The traffic accident anticipation is far from being solved due to the following challenges. First, the visual cues of a future accident are vital to training a discriminative model but in practice, they are difficult to be captured from the limited and noisy video data before the accident occurs. Previous works take advantage of object detection and learn the accident visual cues by either soft attention in [29] or graph relational learning in [315]. In this paper, we propose to explicitly learn the visual attention behavior to address where to look such that accident-risky regions can be localized. Second, it is intrinsically a trade-off between an early decision and a correct decision since the earlier to anticipate an accident, the harder to make the decision right due to fewer accident-relevant cues. Existing works [29, 315] simply address the trade-off by training supervised deep learning models with an exponentially weighted classification loss. In this paper, we address this trade-off by formulating the task as a Markov Decision Process (MDP), where exploration and exploitation This chapter is adapted from the following publication: "Wentao Bao, Qi Yu, and Yu Kong. DRIVE: Deep Reinforced accident anticipation with visual explanation. International Conference on Computer Vision (ICCV), 2021." In 28 Figure 3.1 The Markov decision process of the DRIVE model. The neural network agent (left) learns to exploit visual attentive state (bottom) to predict the actions including the accident score and the next fixation (top), which in return explore the driving environment (right) to maximize the total reward (middle). can be dynamically balanced in a driving environment. In the context of accident anticipation, the MDP model aims to exploit the immediate visual cues for accident anticipation and also explore more possibilities of accident scoring and attention allocation. Our proposed DRIVE model is illustrated with the MDP perspective in Fig. 3.1. The DRIVE model simultaneously learns the policies of accident anticipation and fixation prediction based on a deep reinforcement learning (DRL) algorithm. At each time step, the agent takes actions to predict the occurrence probability of a future accident, as well as the fixation point indicating where drivers will look in the next time step. Our environmental model dynamically provides the observation state by considering both the bottom-up and top-down visual attention, which is recurrently modulated by the actions from the previous time step. We develop a novel dense anticipation reward to encourage early and accurate prediction, as well as a sparse fixation reward to enable visual explanation. Moreover, to effectively train the DRIVE model on real-world datasets, substantial improvements are made based on the DRL algorithm SAC [101]. Our method is demonstrated to be effective on the DADA-2000 dataset [68], and can be easily extended to the DAD dataset [29] without fixation annotations. The proposed approach differs from existing works [29, 303, 46, 315] that are formulated within the supervised learning (SL) framework. The proposed DRL-based solution is fundamentally superior to SL in that the DRL could utilize immediate observations to achieve a long-term goal, i.e., making early decision for anticipating future accidents. Moreover, according to [142], our method 29 Accident ScoreNext Fixation⋮⋮(𝑃1,𝑃2,…,𝑃𝑇)AgentEnvironmentStateReward(Early & Correct & Attentive)(𝑎1,𝑎2,…,𝑎𝑇)Action Figure 3.2 The DRIVE Model. At each time step 𝑡, the traffic observation environment model (left part) acquires visual attention from bottom-up and top-down pathways, generating an observation state s𝑡 by the dynamic attention fusion (equation box) and feature pooling. The stochastic multi- task agent (right part) takes s𝑡 as input to predict the actions a𝑡 which includes the accident score ˆ𝑎 and the next fixation point ˆ𝑝. All the states, actions, and rewards are collected in the replay buffer D to train the two policy networks of the agent. is introspectively explainable as compared to [29, 46], which simply provide rationalizations (post- hoc explanation), since we explicitly formulate drivers’ visual attention during model learning. Our experimental results also validate that the learned visual attention serves as the causality of the outcome from the agent. The main contributions are threefold: • The DRIVE model is proposed for traffic accident anticipation from dashcam videos based on deep reinforcement learning (DRL). • The DRIVE model is visually explainable by explicitly simulating the human visual attention within a unified DRL framework. • The proposed dense anticipation reward and sparse fixation reward are effective in training the model by our improved DRL training algorithm. 3.2 Related Work Traffic Accident Anticipation. Different from recent works on accident detection [368, 118], accident recognition [373], and early action/activity prediction [35, 149], the accident anticipation problem is more challenging as the model needs to make an early decision before the accident occurs. The accident anticipation task was first formally proposed by Chan et al. [29], in which 30 Saliency Modelfeature volumetop-down attentionො𝑎𝑆=(1−𝜌)𝑆𝑏𝑢+𝜌𝑆𝑡𝑑bottom-up attentionpooling𝑟𝐹(ො𝑝,𝑝)𝑟𝐴(ො𝑎,𝑎)Fixation Pred.ℒrec⊙ො𝑝Saliency ModelFoveal VisionAccident Pred.𝒛𝑡𝒔𝑡Stochastic Multi-task Agentvideo𝑡Traffic Observation Environmentො𝑎ො𝑝a𝑡observation statereplay bufferrewardactionspolicy networksused in training𝐷𝜌=min(𝑚,ො𝑎)RAE they proposed a DSA-RNN method to solve the accident anticipation problem. This method is based on object detection and dynamic soft-attention on each time step and uses LSTM to sequentially predict the accident score. In [303], an adaptive loss for early accident anticipation was introduced. Based on these works, Fatima et al. [71] proposed a feature aggregation method for LSTM-based sequential accident score prediction. Inspired by the success of the two-stream design, Corcoran et al. [46] adopted RGB video and optical flow for accident anticipation. Neumann et al. [242] formulated the temporal accident scores as a mixture of Gaussian distribution and proposed to use 3D-CNN to predict the sufficient statistics of the distribution. Recently, Bao et al. [315] proposed to use GCN and Bayesian deep learning for traffic accident anticipation. In addition to the dashcam video used in these works, Shah et al. [286] utilized surveillance videos to anticipate traffic accidents. Zeng et al. [387] recently proposed to anticipate failing accidents by localizing risky regions within an RNN framework. By investigating these works, we found that they typically adopted recurrent neural networks or 3D convolutional networks as the model architecture. However, their supervised learning (SL) design requires large amounts of annotated training data. In terms of explainability, [29, 46] only give post-hoc bounding box explanations, which are essentially rationalizations rather than introspective explanations. RL-based Visual Attention. Visual attention has been studied for several decades and it has been widely modeled as a Markov process [189, 165]. The earlier work [245] utilized the actor-critic RL algorithm on top-down attention modeling. Mnih et al. [228] developed an RL-based recurrent visual attention model for image classification. Jiang et al. [127] used the Least-Squares Policy Iteration for visual fixation prediction. Recent works such as [355] and [115] implemented deep RL algorithms for 360◦ video-based human head movement prediction. In addition to RL methods, inverse reinforcement learning (IRL) algorithms take the advantages of expert demonstrations to train policy networks for task objectives, and a recent work [367] showed that IRL can be leveraged to predict goal-directed human attention. Zhang et al. [394] proposed an imitation learning framework by using human fixations to learn a policy network for Atari games. In this paper, 31 different from these works, we integrate visual attention and traffic accident anticipation into a unified RL framework in a real-world environment. Explainable Self-driving. For self-driving applications, it is important to provide explainable decision making so that the self-driving system can be trusted by humans. Similar to our work, recently Xia et al. [349] proposed to use the foveal vision mechanism to model human visual attention for driving speed prediction. Kim et al. [141] used the visual attention model and causal filtering to visually explain the predicted steering control, i.e., steering angle and speed. Based on this work, [142] further proposed to combine both visual attention and textual description for self- driving behavior explanation. Though there are existing works investigating the visual attention of drivers in traffic scenario [55, 350, 3], few of them simultaneously formulate the up-stream visual attention and down-stream accident anticipation into a unified learnable model. Inspired by these works, in this paper we propose that the traffic accident anticipation can be visually explained by explicitly modeling the visual attention behavior of ego-vehicle drivers. 3.3 Approach Framework Overview. Fig. 3.2 illustrates the framework of the DRIVE model. Given a dashcam video as input, the stochastic multi-task agent (right part) recurrently outputs the accident score ˆ𝑎 and the next fixation ˆ𝑝 at each time step based on the observation state from the environment (left part). In particular, the environment is built by considering the bottom-up and top-down attention of the dashcam video frames, while the agent consists of a shared state auto-encoder and two parallel prediction branches. The two actions ˆ𝑎 and ˆ𝑝 are guided by the reward 𝑟 𝐴 and 𝑟𝐹 respectively to encourage earliness, correctness, and attentiveness. During inference, the DRIVE model simultaneously observes the driving environment by visual attention allocation to risky regions and predicts the occurrence probability of a future accident by the trained agent. Problem Setup. In this paper, we follow the task setting in the existing literature [29, 315]. A traffic accident anticipation model aims to predict a frame-level accident score 𝑎𝑡 that indicates the probability of the accident occurrence in the future. To evaluate the performance, Time-to-Accident (TTA) 𝑡𝑡𝑎 = max(0, 𝑡𝑎 − 𝑡) is used to evaluate earliness, where 𝑡𝑎 is the actual beginning time of an 32 accident and 𝑡 is the first point in time when the predicted score is higher than a threshold 𝑎0, i.e., 𝑎𝑡 > 𝑎0. A larger 𝑡𝑡𝑎 indicates the earlier time the model can anticipate the traffic accident. Besides, binary classification and saliency evaluation metrics are adopted to evaluate the correctness and attentiveness. Inspired by the natural decision-making process of human drivers, i.e., observe and anticipate, we formulate the traffic accident anticipation and fixation prediction tasks as a unified Markov Decision Process (MDP). Formally, let a tuple (S, A, 𝑃, 𝑅, 𝐿, 𝛾) represent a discounted MDP with finite horizon (video length) 𝐿, where S and A are spaces of action and state, 𝑅 defines the reward for each state-action pair, and 𝛾 ∈ (0, 1] is a discount factor. In this paper, the action a𝑡 in the action space A consists of accident score 𝑎𝑡 and the next fixation point 𝑝𝑡 = (𝑥𝑡+1, 𝑦𝑡+1)𝑇 defined in the image domain such that a𝑡 = (𝑎𝑡, 𝑥𝑡+1, 𝑦𝑡+1)𝑇 . The state s𝑡 is shared with the two kinds of actions. The state representation and action policy will be introduced in Sec. 3.3.1 and 3.3.2, respectively. Note that 𝑃 defines the state transition model, i.e., 𝑃(s𝑡+1|s𝑡, a𝑡). In our method, the state transition 𝑃 is achieved by the fixation prediction module (Eq. (3.3)) and the environment observation module (Eq. (3.1) and 3.2). In Sec. 3.3.3 and 3.3.4, the reward design and training algorithm will be discussed, respectively. 3.3.1 Traffic Observation Environment To anticipate a traffic accident, the observation state needs to be discriminative to distinguish accident-relevant cues from the cluttered traffic scene. In this paper, we are inspired by the perception mechanism of the human visual system. It is widely acknowledged that visual perception is dependent on two distinct types of attention procedure, i.e., bottom-up attention and top-down attention [44]. The bottom-up attention is determined by the salient visual stimuli from the sensory input, while the top-down attention is driven by the browsing task to achieve a long-term cognitive goal. These two mechanisms have been demonstrated to be successful in modeling the visual attention of drivers in traffic scene [56, 57]. For traffic accident anticipation, observing the entire scene is inefficient while the attention mechanism can be utilized to capture the discriminative accident-relevant cues for better traffic observation state modeling. 33 (a) Full Frame (b) Foveal Frame (c) Bottom-up Attention (d) Top-down Attention Figure 3.3 Examples of Foveation and Attention. With a saliency model, the full frame 𝐼 (Fig. 3.3a) and its foveated frame 𝐹 (𝐼, 𝑝) (Fig.3.3b) are used to generate bottom-up attention 𝐺 (𝐼) (Fig. 3.3c) and top-down attention 𝐺 (𝐹 (𝐼, 𝑝)) (Fig. 3.3d), respectively. Traffic Attention Modeling. Given the observation of current time step 𝐼𝑡, the bottom-up 𝑏𝑢 = 𝐺 (𝐼𝑡), where 𝐺 is instantiated by recent deep convolutional neural networks (CNNs) such as [47, 225] so that 𝑏𝑢 ∈ R𝐻×𝑊 is simulated by a saliency prediction module 𝐺, i.e., 𝑆𝑡 attention 𝑆𝑡 feature extraction can be shared with the saliency module. As the saliency module is not updated by the actions, 𝑆𝑡 𝑏𝑢 is only determined by the appearance of video frames. To simulate top-down attention 𝑆𝑡 𝑡𝑑 ∈ R𝐻×𝑊 , we propose an auxiliary task to predict the fixation point 𝑝𝑡 ∈ R2, which will dynamically guide the visual attention allocation to the risky region at each time step. Specifically, 𝑆𝑡 into the saliency module, i.e., 𝑆𝑡 𝑡𝑑 is computed by applying a foveal vision module 𝐹 before feeding 𝑡𝑑 = 𝐺 (𝐹 (𝐼𝑡, 𝑝𝑡)), where 𝐹 is implemented by the widely used method in [85]. As 𝑆𝑡 𝑡𝑑 is dependent on the action of fixation prediction, the agent thus dynamically interacts with the attention-based observation environment. In this paper, both the 𝑆𝑡 𝑏𝑢 and 𝑆𝑡 𝑡𝑑 are normalized in [0, 1] to follow their probability nature. Fig. 3.3 visualizes them along with corresponding video frames. It clearly shows that bottom-up attention highlights the most salient objects while top-down attention is more centralized in the risky region. This is because the foveal vision filters out irrelevant visual stimuli and only attends 34 to the fixated area that indicates a high risk for a future accident. To combine the two attention mechanisms, we propose a novel dynamic attention fusion (DAF) method which is a weighted-sum of 𝑆𝑡 𝑏𝑢 and 𝑆𝑡 𝑡𝑑: 𝑆𝑡 = (1 − 𝜌𝑡)𝑆𝑡 𝑏𝑢 + 𝜌𝑡 𝑆𝑡 𝑡𝑑, (3.1) where 𝜌𝑡 is defined as 𝜌𝑡 = min(𝑚, 𝑎𝑡). Here, 𝑎𝑡 ∈ [0, 1] is the predicted accident score and 𝑚 ∈ (0, 1) serves as a hyperparameter. By introducing 𝑚 to clip 𝑎𝑡, instead of directly using 𝑎𝑡, the percentage that 𝑆𝑡 DAF gains flexibility to utilize the learned top-down attention, because 𝑚 controls the maximum 𝑡𝑑 is utilized (𝜌𝑡 ≤ 𝑚). Note that for any 𝑎𝑡 < 𝑚, we have 𝜌𝑡 = 𝑎𝑡 such that 𝑎𝑡 and 1 − 𝑎𝑡 are used as the weighting factors. The motivation is that the more probable there will be an accident (𝑎𝑡 → 1) at the current time step, the more confident that the predicted top-down attention can be utilized at the next time step. The benefits of Eq. (3.1) are enormous. Because both 𝜌𝑡 and 𝑆𝑡 𝑡𝑑 are dependent on the actions from the agent, the proposed DAF method dynamically fuses visual attention by considering both the immediate observation from the environment and the previous decision made by the agent. Our experimental results show that DAF performs better accident anticipation than the static attention fusion (SAF), i.e., manually set a fixed weighting factor. Furthermore, because the attention mechanism is explicitly formulated for accident anticipation, the resulting decisions of the agent can be visually explained by telling which region is risky. State Representation. Since CNNs show extraordinary capability to extract appearance fea- tures, we propose to utilize the feature volume 𝑉 𝑡 ∈ R𝐶×𝐻×𝑊 from CNN-based saliency model 𝐺 for state representation. To save the GPU memory usage while maintaining the representation capa- bility of CNN features, the feature maps of the volume 𝑉 𝑡 are aggregated and further 𝐿2-normalized by global max pooling ( ˜𝑓𝐺 𝑀 𝑃) and global average pooling ˜𝑓𝐺 𝐴𝑃. The normalized features are then concatenated as the observation state representation: 𝑡 = cat (cid:0) ˜𝑓𝐺 𝑀 𝑃 (𝑆𝑡 ⊙ 𝑉 𝑡 s𝑖 𝑖 ), ˜𝑓𝐺 𝐴𝑃 (𝑆𝑡 ⊙ 𝑉 𝑡 𝑖 )(cid:1) , (3.2) 35 where ⊙ is the element-wise product on the 𝑖-th channel of the feature volume 𝑉 𝑡, and cat() is the concatenation along the channel dimension. 3.3.2 Stochastic Multi-task Agent To simultaneously perform the accident anticipation and fixation prediction, the observation state s𝑡 is shared with the two tasks. The state sharing brings two benefits. First, the state sharing establishes the causality relationship between the two tasks such that the visual attention modulated by the fixation prediction task could introspectively explain the accident anticipation outcomes, which distinguish our method from existing explanation-by-rationalization methods [29, 46]. The causal attention is also recently studied for explainable self-driving [141, 142]. Second, it significantly saves the communication workload between the environment and the agent, especially when the state s𝑡 is of high dimensionality. The quality of the state s𝑡 is essential for improving the sample efficiency in DRL-based training. One of the typical ways is to include the auxiliary observation reconstruction task along with the prediction/control task of the agent [110, 88]. Inspired by this, we propose to use the Regularized Auto-Encoder (RAE) to encode the state s𝑡 into a more compact low-dimensional latent representation z𝑡, i.e., z𝑡 = E (s𝑡) where E is the encoder part of RAE. And the decoder of RAE is to reconstruct the observation state. To encourage more exploration of the environment, the agent policies are designed to be stochastic in recent state-of-the-art DRL algorithms [227, 280, 101]. In our model, an action is associated with both the accident score and the next fixation, drawn from a Gaussian, which can be leveraged for exploration. Therefore, the shared latent embedding z𝑡 is used to predict the mean and the variance of each action dimension for the two tasks by two parallel policy networks, respectively (see the yellow boxes in Fig. 3.2). In the training stage, an action a𝑡 is sampled from the predicted Gaussian distribution, i.e., (a𝑡) ∼ 𝜋𝜙 (a𝑡 |s𝑡) where 𝜙 is the parameterized policy network. We implement the two policy networks by two fully-connected layers with ReLU activation. Besides, similar to the recent DRL-based attention models [228, 355], an LSTM [113] is utilized after the last FC layers to capture the temporal dependency of consecutive actions. In the testing stage, the 36 predicted means of the accident score policy 𝜙 𝐴 and the fixation policy 𝜙𝐹 are concatenated as action output: ˆa𝑡 = cat (𝜙 𝐴 (E (s𝑡)) , 𝜙𝐹 (E (s𝑡))) . (3.3) Note that we do not directly predict the top-down attention map but instead predict the fixation point as one of the actions in a𝑡. The motivation is that the attention map prediction leads to a high dimension action space which is not efficient to be learned. 3.3.3 Reward Functions With the observed state s𝑡 and the executed action a𝑡, the agent needs a scalar reward 𝑟 from a driving environment to guide its learning. In this paper, we propose a dense anticipation reward 𝑟 𝐴 and a sparse fixation reward 𝑟𝐹 to encourage early, accurate, and explainable decisions such that the total reward at each time step is 𝑟 = 𝑟 𝐴 + 𝑟𝐹. Dense Anticipation Reward. For the accident score 𝑎𝑡, we propose to reward it densely (for all time steps) by considering both the correctness and earliness at each time step. Given a score threshold 𝑎0, we propose a temporally weighted XNOR1 (also called Equivalence Gate) measurement as the accident anticipation reward 𝑟 𝐴: 𝐴 = 𝑤𝑡 · XNOR (cid:2)I[𝑎𝑡 > 𝑎0], 𝑦(cid:3) , 𝑟𝑡 (3.4) where 𝑤𝑡 is the weighting factor and I(·) is an indicator function. The binary label 𝑦 ∈ {0, 1} and 𝑦 = 1 indicates there will be an accident in the future part of the video. The motivation to use XNOR is that it assigns one as the reward to the true predictions (either true positives or true negatives), while assigns zero reward to false predictions. Though in the autonomous driving scenario, false negative is more detrimental than false positive, it is non-trivial to manually design the weights to achieve the balance and it is out of the scope in this paper. Furthermore, to encourage early anticipation (earliness), the temporally weighting factor 𝑤𝑡 in Eq. (3.4) is designed as a normalized expression such that 𝑟 𝐴 and 𝑟𝐹 can be numerically balanced 1XNOR: https://en.wikipedia.org/wiki/XNOR_gate 37 with the same magnitude scale: 𝑤𝑡 = (cid:16) 1 𝑒𝑡𝑎 − 1 𝑒max(0,𝑡𝑎−𝑡) − 1 (cid:17) , (3.5) where 𝑡 and 𝑡𝑎 are the current time step 𝑡 and the beginning time of a future accident, respectively. This factor exponentially decays from 1 to 0 before the accident occurs. Therefore, the earlier the decision is made, the larger reward will be given for the true positive prediction. After the accident occurs at 𝑡𝑎, there is no need to reward the agent. Compared with the exponential binary-cross entropy loss in existing accident anticipation works [29, 315], our dense anticipation reward is more appropriate for DRL training. Sparse Fixation Reward. Different from rewarding the accident scores, rewarding the predicted fixations is more challenging as the ground truth fixation data are valuable and typically only sparsely provided for a few accident frames [68]. To this end, we resort to a sparse rewarding scheme that is widely used in dynamic programming and reinforcement learning. In particular, our sparse fixation reward is given by 𝑟𝑡 𝐹 = I [𝑡 > 𝑡𝑎] exp (cid:18) − || ˆ𝑝𝑡 − 𝑝𝑡 ||2 𝜂 (cid:19) , (3.6) where the indicator function I [𝑡 > 𝑡𝑎] zeroes out the rewards of predictions before a future accident occurs. The ˆ𝑝𝑡 and 𝑝𝑡 are 2-D coordinates of predicted and ground truth fixation point, respectively, defined in video frame space. The fixation points are normalized by the height and the width of the video frame for stable training. The motivation to use the radial kernel based on Euclidean distance is that the closer distance between ˆ𝑝𝑡 and 𝑝𝑡, the larger reward the agent will get. The hyperparameter 𝜂 can be empirically set to ensure the same magnitude between 𝑟𝑡 𝐹 and 𝑟𝑡 𝐴. 3.3.4 Model Training To train the DRIVE model, we follow the soft actor-critic SAC model [101] but extend it to accommodate the accident anticipation task. SAC improves the exploration capacity of the traditional actor-critic RL through policy entropy maximization. Specifically, SAC aims to optimize the objective: max 𝜙 𝑇 ∑︁ 𝑡=1 E(s𝑡 ,a𝑡 )∼𝜌 𝜋𝜙 (cid:2)𝑟 (s𝑡, a𝑡) + 𝛼H (𝜋𝜙 (·|s𝑡))(cid:3) (3.7) 38 where 𝛼 is the temperature that controls the contribution from the policy entropy H . To achieve this objective, the actor which is the policy network, and the critic which approximates the state-action value function 𝑄(s, a), are optimized in an interleaved way. In our model, as the stochastic multi-task agent gives separate entropy estimation for accident anticipation and fixation prediction, we propose to express the total entropy as the sum of the entropy from each task. Using the logarithm rule, −H (𝜋𝜙 (·|s𝑡)) can be expressed as −H (𝜋𝜙 ( ˆa|s)) = log (cid:2)𝜋𝜙 𝐴 ( ˆ𝑎|s) · 𝜋𝜙𝐹 ( ˆ𝑝|s)(cid:3) . (3.8) This enables SAC to be extended to the multi-task agent. Our adapted SAC algorithm for training the DRIVE model is briefly summarized in the Algorithm 3.1. Update Critic. For the critic network 𝑄𝜃, it is updated by minimizing the soft Bellman residual: 𝐽 (𝜃) = E (cid:104) (𝑄𝜃 (s, a) − 𝑦(𝑟, s′, a))2(cid:105) , (3.9) where the target 𝑦(𝑟, s′, a) is greedily correlated with the reward 𝑟, the discounted soft Q-target 𝑄 ¯𝜃, and the entropy. Here, the ¯𝜃 are parameters of the soft Q-target network which is the delayed soft copy of the critic network. More details are provided in [101] and our supplementary materials. Update Actor. The policy networks (actor) are updated to maximize Eq. (3.7) by policy gradient method, which is equivalent to minimizing (cid:20) 𝐽𝑜 (𝜙) = E 𝛼 log 𝜋𝜙 ( ˆa|s) − min 𝑗=1,2 𝑄𝜃 𝑗 (s, ˆa) (cid:21) + 𝑤0||𝜙||2, (3.10) where ˆa ∼ 𝜋𝜃 (·|s). The entropy term (logarithm part) is computed by Eq. (3.8). For the second term of expectation, the Clipped Double Q-learning [77] is used in practice. In this paper, we add an 𝐿2 regularizer term for the policy network parameters 𝜙 to mitigate the over-fitting issue. To accommodate SAC with multi-task policies in our model, we separately update each sub- policy network with corresponding losses L 𝐴 and L𝐹 as regularizers: 𝐽 (𝜙 𝐴) = 𝐽𝑜 (𝜙) + 𝑤1E (cid:2)L ( ˆ𝑎𝑡, 𝑡𝑎, 𝑦)(cid:3) 𝐽 (𝜙𝐹) = 𝐽𝑜 (𝜙) + 𝑤2E (cid:2)I[𝑡 > 𝑡𝑎]𝑑 ( ˆ𝑝𝑡, 𝑝𝑡)(cid:3) , (3.11) 39 where 𝜙 = {𝜙 𝐴, 𝜙𝐹 }, and ( ˆ𝑎𝑡, ˆ𝑝𝑡, 𝑡𝑎, 𝑦, 𝑝𝑡) are sampled from a replay buffer D. The distance 𝑑 (·) defines the Euclidean distance. The accident anticipation loss L ( ˆ𝑎𝑡, 𝑡𝑎, 𝑦) follows the definition in [315, 29]. The indicator function I[·] ensures that only fixation points in accident frames can be accessed during training. Note that if 𝐽𝑜 (𝜙) is removed, the SAC algorithm is reduced to a purely supervised learning (SL) without architectural modification. Update Temperature. Recent works [102, 337] show that entropy-based RL training is brittle with respect to the temperature 𝛼. In this paper, we follow the automatic entropy tuning [102] that updates 𝛼 by minimizing 𝐽 (𝛼) = E (cid:2)−𝛼 log 𝜋𝜙 ( ˆa|s) − 𝛼H0 (cid:3) , (3.12) where the negative target entropy −H0 is empirically set to the dimension of the action a. In this paper, we found that 𝛼 could be updated to zero such that the entropy (logarithm term) is hard to be optimized. To tackle this problem, we propose to clip 𝛼 before it is updated: 𝛼 ← max(𝛼 − 𝜆𝛼 ˆ∇𝛼𝐽 (𝛼), 𝛼0) (3.13) where 𝛼0 is a small value for 𝛼. This enables sufficient exploration of the agent during training. Update RAE. The regularized auto-encoder (RAE) basically imposes 𝐿2 regularizers on both the latent representation and model parameters for reconstruction learning: 𝐽𝑅 𝐴𝐸 (𝛽) = L𝑟𝑒𝑐 (s; 𝛽) + 𝑤0||𝛽||2 + 𝑤s||z||2, (3.14) where 𝛽 are decoder parameters and z is the encoded state representation by RAE encoder. Similar to the existing work [369], the encoder parameters are updated by the critic loss 𝐽 (𝜃) and the RAE loss 𝐽𝑅 𝐴𝐸 (𝛽) while the decoder parameters are only updated by 𝐽𝑅 𝐴𝐸 (𝛽). To enable the training on large-scale real-world videos, we reconstruct the observation state s rather than raw pixels as done in [369]. Summary of Our DRL Contribution. In this paper, the existing SAC algorithm is adapted to the real-world applications, which bridges the gap between simulation-based DRL applications and the challenging real-world tasks. Besides, for traffic accident anticipation, two novel reward 40 functions by considering the earliness, correctness, and attentiveness are developed to guide the SAC-based model training. Moreover, to enable the multi-task learning by SAC, the proposed action entropy decomposition as well as other training techniques such as the temperature clipping and state reconstruction are empirically found useful. 3.4 Experiments Datasets. Our method is evaluated on two traffic accident datasets, i.e., DADA-2000 [68] and DAD [29]. For the DADA-2000 dataset, we only use the beginning times of accidents and fixations of accident frames as ground truth. DAD is an accident dataset, in which the beginning times of accidents are fixed at the 90th frame for positive clips. The video clips of the two datasets are all 5 seconds long. Evaluation Protocols. In this paper, we use the video-level Area Under ROC curve (AUC) to evaluate the anticipation correctness and the average time-to-accident (TTA) with score threshold 0.5 to evaluate the earliness. For classification metrics AUC, we only evaluate the predictions of accident frames since the output represents the occurrence probability of a future accident. To evaluate the visual attention, we adopt similarity (SIM), linear correlation coefficients (CC), and Kullback-Leibler distance (KLD). Smaller KL values indicate better performance. Implementation Details. The proposed DRIVE model is implemented with the PyTorch framework. We adopt VGG-16-based MLNet [47] as the saliency module. The saliency module is pre-trained on the fixation data of the DADA-2000 training set and the parameters are kept frozen in DRIVE training. For the DAD dataset, as the fixations are not annotated, we remove the fixation prediction policy and top-down attention. For all datasets, the video frames are resized and zero-padded to 480 × 640 with an equal scaling ratio. The 𝑚 of 𝜌 in Eq. (3.1) and score threshold 𝑎0 in Eq. (3.4) are set to 0.5 by default. We use the Adam optimizer for all gradient descent processes and train the DRIVE for 50 epochs. Other parameter settings are in the supplement. 3.4.1 Main Results Baselines. We compare the proposed DRIVE with DSA-RNN [29] and UString [315] since their source codes are available. We also implement the accident anticipation loss function 41 Table 3.1 Comparison with state-of-the-art methods. The best results are marked with bold fonts. AUC and TTA evaluate the correctness and earliness of accident anticipation, respectively. Methods DSA-RNN [29] AdaLEA [303] UString [315] DRIVE (ours) DADA-2000 [68] DAD [29] AUC (%) TTA (s) AUC (%) TTA (s) 47.19 55.05 60.19 72.27 3.095 3.890 3.849 3.657 71.57 58.06 65.96 93.82 1.169 2.228 0.915 2.781 Ground Truth Bottom-Up Attention Top-Down Attention DAF Attention 5 3 = 𝑡 5 8 = 𝑡 5 4 1 = 𝑡 y t i l i b a b o r P Figure 3.4 Visualization on the DADA-2000 dataset. The shaded region on the curve figure is the accident window (FPS=30). For this example, with the operation threshold 0.5 (dashed horizontal line) and a five-frame decision window, the model could anticipate future accidents at around the 42-th frame which is more than 3 seconds earlier before the accident occurs. AdaLEA [303] on top of the DSA-RNN method (AdaLEA). Note that all methods are using VGG-16 [292] as the backbone. The AUC and TTA results on DADA-2000 and DAD datasets are reported in Table 3.1. Results for Accident Anticipation. It shows that our DRIVE method significantly outperforms other baselines on both DADA-2000 and DAD dataset with the AUC metric. This demonstrates that our method is advantageous to accurately anticipate if a future accident will occur or not. Note that AdaLEA achieves the best TTA performance on DADA-2000 dataset, i.e., 0.23 seconds higher 42 (a) Intervention on Attention (b) Reward Curves Figure 3.5 Experimental results on DADA-2000 dataset. than our DRIVE method. The advantage of AdaLEA on TTA metric can be attributed to that during training, AdaLEA utilizes validation set to compute TTA to drive the model to make early anticipation. In contrast, we do not use validation set guidance but still achieve comparable TTA results on DADA-2000 and much better TTA on DAD datasets. 3.4.2 Visual Explanation Results Correlation Results. To investigate how the visual attention mechanism could explain the acci- dent anticipation decision, we first jointly compare the performance of the fused saliency (Eq. (3.1)) and corresponding AUC score for both the proposed dynamic attention fusion (DAF) strategy and an alternative one, i.e., static attention fusion, for which the fusion parameter is manually set without updating. Results are reported in Table 3.2. We can see that DAF consistently outperforms SAF for both the saliency prediction and accident prediction (AUC), which demonstrates the superiority of DAF and the strong correlation between the visual attention and accident anticipation. Causality Results. The visual attention learned by the proposed DRIVE model should exhibit the causality of accident anticipation performance. Therefore, inspired by the causal saliency analysis [141] and conterfactual visual explanation [93], we adopt two different intervention tests on the saliency 𝑆𝑡 in Eq. (3.1), i.e., removing the attention (𝑆𝑡 ← 1) and inverse the attention (1−𝑆𝑡) in testing stage. Results are reported in Fig. 3.5a. It shows that with either recall rate or frame-level AUC (f-AUC), the performance of the DRIVE model would decrease for both test cases, which 43 Recall (%)f-AUC (%)020406080100DRIVEDRIVE (w/o Attention)DRIVE (inverse Attention)7UDLQLQJ(SRFK7HVW5HZDUG ×1000 6$&5$(ZR7'$WWZR%8$WWZR5$( Table 3.2 Evaluation of visual attention and accident anticipation on DADA-2000. The best results are shown in bold. The parameters (Params) represent the values of 𝜌 for SAF and 𝑚 for DAF. Params Methods 0.5 0.8 1.0 SAF DAF SAF DAF SAF DAF AUC 0.645 0.659 0.691 0.726 0.632 0.679 SIM 0.188 0.192 0.144 0.158 0.080 0.112 CC 0.322 0.331 0.190 0.226 0.079 0.143 KLD (↓) 2.679 2.654 3.087 2.986 12.948 7.836 demonstrate causality relation between the learned DAF visual attention and accident anticipation. Attention Visualization. In Fig. 3.4, we visualized the saliency maps on three representative time steps using test data from DADA-2000 dataset. For reference, the curve of the predicted accident probability is also presented. We can see that the bottom-up attention captures visually attentive regions while the top-down attention indicates the risky region both before and after the accident occurs. The DAF attention maps exhibit the fused attention. 3.4.3 Ablation Studies In Table 3.3, we report the results of ablation studies with DADA-2000 dataset. In the first row, we remove the fixation prediction policy and corresponding learning objectives, we see the AUC is about 10% lower than our full model (the last row). The second row shows that the RAE module also contributes a lot to the performance gain. To further see if it is the DRL framework itself that leads to good performance, we keep the DRIVE architecture unchanged and only remove the 𝐽𝑂 (𝜙) in Eq. (3.11) for training the multi-task policy networks such that the algorithm reduces to a supervised learning (SL). Results in the third row of Table 3.3 show that DRL-based learning method (SAC) is superior to SL algorithm for accident anticipation. To show the performance of SAC-based DRIVE variants during training, we plot their reward curves in Fig. 3.5b. It shows that training DRIVE model by SAC + RAE could achieve stable increasing reward. Besides, both top-down attention (w/o BU Att) and bottom-up attention (w/o TD Att) could contribute to the learning process. In particular, we see that RAE contributes most to the performance gain. 44 Table 3.3 Ablation studies on DADA-2000 dataset. In the Type column, “RL" and “SL" represent reinforcement learning and supervised learning, respectively. Type SAC RAE Fixations AUC (%) RL RL SL RL ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✓ 61.91 66.21 63.96 72.27 3.5 Conclusion In this paper, we propose the DRIVE model to anticipate traffic accidents from dashcam videos. Based on deep reinforcement learning (DRL), we explicitly simulate both the bottom-up and top- down visual attention in the traffic observation environment and develop a stochastic multi-task agent to dynamically interact with the environment. The DRIVE model is learned by the improved DRL algorithm SAC. Experimental results on real-world traffic accident datasets show that our method achieves the best anticipation performance as well as good visual explainability. 3.6 Supplementary Material This document provides further details about the training algorithms of SAC [101], and imple- mentation settings. 3.6.1 Soft Actor Critic As introduced in Section 3.3.4 in the main paper, to adapt the soft actor critic (SAC) [101] algorithm to our DRIVE model training, original SAC algorithm needs to be substantially adapted. The Algorithm 3.1 summarizes the training steps of the improved SAC algorithm. At first, the transitions including the current state s𝑡, action a𝑡, immediate reward 𝑟𝑡, next state s𝑡+1, and the hidden states of LSTM layer h𝑡 are gathered into the replay buffer D. For each gradient step, a mini-batch of transitions are uniformly sampled from D to update different model components, including the policy networks (actor), Q-networks (critic), and RAE. As the actor update, automatic entropy tuning, and RAE update are elaborated clearly in the main paper, here we only present more details about how the critic networks are learned during SAC training. To update the critic, in practice, the Clipped Double Q-learning [77] is used that two identical 45 Q-networks 𝜃𝑖 (𝑖 ∈ {1, 2}) are maintained. The loss function takes the sum of the losses from the two outputs, i.e., 𝐽 (𝜃) = (cid:205)𝑖 𝐽 (𝜃𝑖), where each of them 𝐽 (𝜃𝑖) is defined as the expectation of mean-squared error: 𝐽 (𝜃𝑖) = E (cid:104) (cid:0)𝑄𝜃𝑖 (s, a) − 𝑦(𝑟, s′, a)(cid:1) 2(cid:105) , Here, the optimization target 𝑦(𝑟, s′, a) is defined as 𝑦(𝑟, s′, a) = 𝑟 + 𝛾(1 − 𝑑) (cid:18) min 𝑗=1,2 𝑄 ¯𝜃 𝑗 (s′, ˆa′) − 𝛼 log 𝜋𝜃 ( ˆa′|s′) (cid:19) (3.15) (3.16) where 𝑟 is the reward batch, 𝛾 is the discounting factor, and 𝑑 labels whether the sampled transitions are at the last step 𝑇. Note that the sate s′ is the batch of next state from replay buffer, while the action ˆa′ is sampled from the output of pre-updated policy network 𝜋𝜃, i.e., ˆa′ ∼ 𝜋𝜃 (·|s′) which enables SAC to be an off-policy method. The entropy term log 𝜋𝜃 ( ˆa′|s′) is obtained by the Eq. 3.8 in our main paper. In this paper, the critic network parameters 𝜃 are updated more frequently than other param- eters by the gradients of 𝐽 (𝜃) to achieve more stable training. The Table 3.4 summarizes the hyperparameter setting in experiments. Note that the major hyperparameters are following existing literature [101]. For different datasets, we used the same set of hyperparameters and did not tune them specifically. 3.6.2 Implementation Details Network Architecture. As shown in Fig. 3.2 in the main paper, the saliency model is im- plemented with the existing CNN-based saliency model [47], which takes as input with the size 480 × 640 × 3 and output the feature volume 𝑉 𝑡 with the size 60 × 80 × 64 by default. The stochas- tic multi-task agent consists of a shared RAE and two policy networks, i.e., accident prediction and fixation prediction branches. In our implementation, the encoder of RAE consists of three fully-connected (FC) and the decoder is symmetric to the encoder. Each policy branch consists of two FC layers and one LSTM layer, followed by an FC layer for predicting means and an FC layer for predicting the variance. ReLU activations are used for all layers except for the last FC output layer. According to the default SAC setting, the output of policy networks a𝑡 are activated by tanh 46 Algorithm 3.1 Improved SAC for the DRIVE Model Training Require: 𝜃1, 𝜃2, 𝜙, 𝛽 ¯𝜃1 ← 𝜃1, ¯𝜃2 ← 𝜃2 1: 2: D ← ∅, h0 ← 0 3: for each iteration do 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: end for Ensure: 𝜃1, 𝜃2, 𝜙, 𝛽 end for for each environment step do Sample actions (a𝑡, h𝑡) ∼ 𝜋𝜙 (a𝑡 |s𝑡, h𝑡−1) Compute state s𝑡 with actions Compute reward 𝑟𝑡 = 𝑟𝑡 𝐴 + 𝑟𝑡 𝐹 D ← D ∪ {(s𝑡, a𝑡, 𝑟𝑡, h𝑡, s𝑡+1)} end for for each gradient step do for each critic update do 𝜃 ← 𝜃 − 𝜆 ˆ∇𝜃 𝐽𝑄 (𝜃) end for 𝜙 ← 𝜙 − 𝜆 ˆ∇𝜙𝐽𝜋 (𝜙) 𝛼 ← max(𝛼 − 𝜆𝛼 ˆ∇𝛼𝐽 (𝛼), 𝛼0) ¯𝜃 ← 𝜏𝜃 + (1 − 𝜏) ¯𝜃 𝛽 ← 𝛽 − 𝜆 ˆ∇𝛽𝐽RAE(𝛽) ⊲ Initial parameters ⊲ Initialize target networks ⊲ Replay buffer and hidden states ⊲ See Eq. 3.2 ⊲ See Eq. 3.4-3.6 ⊲ Update by Eq. 3.9 ⊲ Update by Eq. 3.11 ⊲ See Eq. 3.12 ⊲ Update Q-target ⊲ Update by Eq. 3.14 functions so that the values are constrained in (−1, 1). In order to map the values to accident scores 𝑎𝑡 and fixation coordinates 𝑝𝑡, we linearly scale the values by 𝑎𝑡 = 0.5(a(0) 𝑡 + 1.0) 𝑝𝑡 = 𝜓(a(1) 𝑡 , a(2) 𝑡 ) (3.17) (3.18) where the equation of 𝑎𝑡 applied to tanh activation is equivalent to sigmoid activation on FC layer output. The function 𝜓 maps the scaling factors (within (−1, 1)) defined in image space 𝐻 × 𝑊 to the input space ℎ × 𝑤. This scaling process is illustrated in Fig. 3.6. Implementation. Our training algorithm is implemented based on the SAC source code2. Since the image foveation method [85] incurs computational cost due to the Gaussian pyramid filtering, we implement this algorithm as well as all the DRL environmental components by PyTorch to support for GPU acceleration. For DADA-2000 videos, the positive video clips (contains accident) 2SAC: https://github.com/pranz24/pytorch-soft-actor-critic 47 Table 3.4 SAC Hyperparameter Settings Parameters general learning rate (𝜆) temperature learning rate (𝜆𝛼) discounting factor (𝛾) replay buffer size (D) target smoothing coefficient (𝜏) temperature threshold (𝛼0) weight decay (𝑤0) anticipation loss coefficient (𝑤1) fixation loss coefficient (𝑤2) latent regularizer coefficient (𝑤𝑠) sparse fixation reward parameter (𝜂) gradient updates per time step actor gradient updates per time step dim. of FC/LSTM layers output dim. of latent embedding (z) dim. of state (s) dim. of action (a) sampling batch size video batch size values 3 · 10−4 5 · 10−5 0.99 106 0.005 10−4 10−5 1 10 10−4 0.1 4 2 64 64 128 3 64 5 Figure 3.6 The scaling process (𝜓). The continuous values a(1) 𝑡 which are within (−1, 1) defined in video frame space 𝐻 × 𝑊 are mapped into the discrete input space ℎ × 𝑤 to represent the 2-D coordinates of a fixation point. and a(2) 𝑡 are obtained by trimming the video into be 5 seconds where the beginning times are placed in the last one second with random jittering, while the negative video clips are randomly sampled without overlap with positive clips. The spatial and temporal resolutions for DADA-2000 videos are reduced with ratio 0.5 and interval 5, respectively, so that 30 time steps are utilized and for 48 𝐻𝑊original frameinputspacescaled frame𝑤ℎ each step the observation frames are with the size 330 × 792. For DAD dataset, we only reduce the temporal resolution with interval 4 so that 25 time steps of each 5-seconds video clip are used. 49 CHAPTER 4 EGOCENTRIC 3D TRAJECTORY FORECASTING 4.1 Introduction Egocentric video understanding aims to understand the camera wearers’ behavior from the first-person view. It is receiving increasing attention in recent years [95, 181, 90, 202, 177, 238, 334, 203, 50, 49, 182] due to its analogousness to the way human visually perceives the world. An important egocentric vision task is to forecast the egocentric hand trajectory of the camera wearer [205], which has great value in Augmented/Virtual Reality (AR/VR) applications. For example, the predicted 3D trajectories can help plan and stabilize a patient’s 3D hand motion who has upper-limb neuromuscular disease [181]. Besides, the early predicted 3D hand trajectory is key to reducing rendering latency in VR games for achieving an immersive gaming experience [80]. In the existing literature, egocentric 3D hand trajectory forecasting is far from being explored. The method in [205] could only predict 2D trajectory on an image and cannot forecast precise 3D hand movements. Recent works [17, 331, 379] predict the trajectory or 3D human motions from egocentric views, but they do not predict the 3D trajectory of the camera wearer. Besides, though forecasting the 3D hand pose provides fine-grained information about 3D hands [63], it is out of our scope as we focus on the camera wearers’ planning behavior revealed by 3D hand trajectory. The challenges of egocentric video-based 3D trajectory forecasting are significant. First, accurate large-scale 3D trajectory annotations are labor-intensive and expensive. They rely on wearable markers or multi-camera systems for hand motion capture in a controlled environment. Second, learning the depth of 3D trajectory from egocentric videos is challenging. On one hand, using 2D video frames to estimate 3D trajectory depth is an ill-conditioned problem similar to other monocular 3D tasks [309, 253, 313]. Even if the historical 3D hand trajectory is utilized This chapter is adapted from the following publication: "Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3D hand trajectory forecasting. In International Conference on Computer Vision (ICCV), 2023." 50 Figure 4.1 Egocentric 3D Hand Trajectory Forecasting. Our goal is to predict the future 3D hand trajectory (in red) given the past observations of egocentric video and trajectory (in blue). Compared to the 2D image space, predicting trajectory in a global 3D space is practically more valuable to understanding human intention and behavior in AR/VR applications. as the input, how to exploit the visual and trajectory information for forecasting is still nontrivial. On the other hand, due to the inevitable camera motion in an egocentric view, the background of the scene is visually dynamic which poses a significant barrier to inferring the foreground depth [187, 397]. Third, as a Seq2Seq forecasting problem, it is critical to formulate the latent transition dynamics [91, 10] that allows the variances of data due to anytime forecasting and limited observations. In this paper, we address these challenges by developing an uncertainty-aware state space transformer (USST). It follows the state-space model [263] by taking the observed egocentric RGB videos with the historical 3D trajectory as input to predict future 3D trajectory. Our model deals with the depth noise of trajectory annotation by introducing the depth robust aleatoric uncertainty in training. To fuse the information from the dense RGB pixels and sparse historical trajectory, we leverage visual and temporal transformer encoders as backbones and utilize the recent visual prompt tuning (VPT) to enhance the visual features. Following the state space model, we develop a novel attention-based state transition module and an emission module with a predictive link to predict the 3D global trajectory coordinates. Moreover, to take the hand motion inertia into consideration, we propose a velocity constraint to regularize the model training, which helps generalize to unseen scenarios. To enable egocentric 3D hand trajectory prediction, we follow [181] to develop a scalable annotation workflow to automate the annotation on RGB-D data from head-mounted Kinect devices. 51 3D space2D space3D space2D space3D space2D spacefuture⋯⋯past In particular, camera motion is estimated to transform the 3D trajectory annotations from local to global camera coordinate system. Experimental results on H2O and EgoPAT3D datasets show that our method is effective and superior to existing Transformer-based approaches [205] and other general Seq2Seq models. In summary, our contributions are as follows: • We propose an uncertainty-aware state space transformer (USST) that consists of a novel state transition and emission, aleatoric uncertainty, and visual prompt tuning, which are empirically found effective. • We collected and will release our annotations on H2O [162] and EgoPAT3D [181] datasets that will benefit the egocentric 3D hand trajectory forecasting research. • We benchmarked recent methods on the proposed task and experimental results show that our method achieves the best performance and could be generalizable to unseen egocentric scenes. 4.2 Related Work Trajectory Prediction Predicting the physical trajectory of moving objects is a long-standing research topic. It has been widely studied in applications for pedestrians [2, 393], vehicle [59, 219]. Many of them are developed for the third-person view and predict trajectory in 2D pixel space. Given that the first-person view is more realistic in AR/VR applications, recent few works [205, 203] are trying to predict the hand-object interaction from egocentric videos. Though the method in [17] predicts the pedestrian trajectory in 3D space, their method primarily addresses the social interaction of multiple pedestrians. The recent work [260] also targets the egocentric 3D trajectory for pedestrian scenarios, but their method leverages depth modality and nearby person’s trajectory as context information which are practically uneasy to collect. Besides, due to the annotation noise and uncertain nature of trajectory prediction, probabilistic modeling is widely adopted in existing literature [341, 320, 219]. In this paper, following the probabilistic setting, we step toward the egocentric 3D hand trajectory prediction using practically accessible RGB videos for AR/VR scenarios. 52 Figure 4.2 Proposed USST Model. Given the RGB frames and 3D hand locations of 𝐶 observed time steps, we extract and concatenate their features as x1:𝐶 by the prompted backbone 𝑓V and MLP model 𝑓T , which are further fed into transformer encoders to produce temporal observations o1:𝐶. Together with positional encodings PE1:𝑇 of the full horizon, our state transition layer could recursively extrapolate the latent states z𝐶+1:𝑇 for 3D trajectory forecasting along with uncertainty and velocity (𝜶 and v) in 𝑇 − 𝐶 future steps. Egocentric Video Representation Egocentric videos are recorded in a first-person view. Dif- ferent from the videos in a third-person view, learning an egocentric video representation is more challenging due to the dynamic background caused by camera motion and implicit intention of activities from camera wearers [177, 70, 244]. For 3D trajectory prediction, existing commonly used egocentric video datasets such as EPIC-Kitchens [49, 50] do not provide the depth informa- tion and camera parameters, which are essential for annotating the 3D hand trajectory. Though the recent Ego4D [95] benchmark provides a hand forecasting subset, the annotations are defined as 2D locations in image space. Therefore, we resort to a cost-effective workflow to collect 3D hand trajectory annotations from existing 3D hand pose datasets such as the EgoPAT3D [181] and H2O [162] datasets. Our annotation workflow can be deployed to any egocentric dataset collected by head-mounted RGB-D sensors. State Space Model State-space Model (SSM) originates from the control engineering field. It is conceptually general and inspires many classical SSMs such as the Kalman Filtering for prediction tasks. Recent deep SSMs [13, 43, 75] are increasingly popular by combining Recurrent Neural Networks (RNNs) with the Variational AutoEcoders (VAE). However, these approaches are limited in practice due to the complex long-term dependency on highly-structured sequential data. In [263], 53 𝑓𝒱Prompt#𝟏#𝟐#𝐂𝑓𝒯⋯3D Trajectory (𝒯!:#)Egocentric Frames (𝒱!:#)visual featurelocation featurepast observationfuture latent state⋯𝑥$𝑥!𝑥#Visual and Trajectory Embedding⋯PE!PE"PE#𝑓#Transformer Transition⋯⋯𝑜!𝑜$𝑜#𝑜!𝑜#𝑧!𝑧$𝑧#𝑧#’!𝑧(𝑔𝒯𝑔𝒱MLPProbabilistic Forecasting⋯𝜶!𝜶𝑻𝒗!𝒗#uncertaintyvelocity𝒯$%&:( a deep SSM is proposed by combining Kalman Filtering with deep neural networks. However, the linear Gaussian assumption is uneasy to hold for high-dimensional data in the real world. To address these challenges, ProTran [305] introduces a probabilistic Transformer [328] under a variational SSM for time-series prediction. However, maximizing the variational low bound of ProTran suffers from notorious KL vanishing issue [76]. AgentFormer [380] can also be regarded as a Transformer- based SSM, but its autoregressive decoding limits its efficiency to low-dimensional motion data. In this paper, we propose a Transformer-based SSM that allows for long-term dependency and latent dynamics efficiently in practice. 4.3 Approach Problem Setup As shown in Fig. 4.1, the Egocentric 3D hand trajectory forecasting model takes as input 𝐶 observed RGB frames V1:𝐶 = {I1, . . . , I𝐶 } and 3D hand trajectory T1:𝐶 = {p1, . . . , p𝐶 } to predict the future 3D hand trajectory T𝐶+1:𝑇 = {p𝐶+1, . . . , p𝑇 } in a finite horizon 𝑇. Here, I𝑡 ∈ R𝐻×𝑊×3 and p𝑡 = [𝑥𝑡, 𝑦𝑡, 𝑧𝑡]⊤ are egocentric RGB frame and 3D hand trajectory point at time step 𝑡, respectively. In practice, the 3D point p𝑡 is defined in a global 3D world coordinate system. The ultimate goal is to learn a model 𝚽 by maximizing the expectation of the likelihood over the training dataset D: EV,T ∼D [ 𝑝𝚽(T𝐶+1:𝑇 |T1:𝐶, V1:𝐶)] . max 𝚽 (4.1) In this paper, we formulate the problem as a state-space model. In the following sections, we will introduce the proposed model in detail. 4.3.1 Uncertainty-aware State Space Transformer Existing SSMs could formulate the probabilistic nature of trajectory prediction. However, they do not explicitly handle the data noise issue which is commonly encountered when using RGB-D sensors to get 3D trajectory annotations. To mitigate the uncertainty from data labeling, we follow the line of research [136, 108] and propose an uncertainty-aware state space transformer (USST) to handle the dynamics of 3D hand trajectory in egocentric scenes. Formally, following the latent variable modeling, the probability in Eq. (4.1) over a full sequence 54 T1:𝑇 can be factorized by introducing 𝑇 latent variables Z1:𝑇 : 𝑝(T1:𝑇 |T1:𝐶, V1:𝐶) = ∫ 𝑝(T1:𝑇 |Z1:𝑇 ) 𝑝(Z1:𝑇 |T1:𝐶, V1:𝐶)𝑑Z, (4.2) where the state transition 𝑝(Z1:𝑇 |T1:𝐶, V1:𝐶) and the emission 𝑝(T1:𝑇 |Z1:𝑇 ) are learned from data D. Following the SSM formulation, we propose to factorize the two terms by the independency assumptions: 𝑝𝜃 (Z1:𝑇 |T1:𝐶, V1:𝐶) = 𝑇 (cid:214) 𝑡=1 𝑝𝜃 (z𝑡 |z1:𝑡−1, p1:𝐶, I1:𝐶), 𝑝𝜙 (T1:𝑇 |Z1:𝑇 ) = 𝑇 (cid:214) 𝑡=1 𝑝𝜙 (p𝑡 |z𝑡, p𝑡−1), (4.3) where the latent variable z𝑡 ∈ Z1:𝑇 is generated by taking as input z𝑡−1 and the previous trajec- tory point p𝑡 and RGB frame I𝑡. To address the label noise issue, we formulate the emission model 𝑝𝜙 (p𝑡 |z𝑡, p𝑡−1) as a probabilistic module to learn the aleatoric uncertainty. In the following paragraphs, we will elaborate on feature embedding, state transition, and probabilistic prediction. Visual and Trajectory Embedding As advocated by [305, 380], Transformers are effective to capture the long-term dependency for sequential data. Therefore, we propose to leverage visual and temporal Transformers [328] as encoders to learn the features from the dense RGB frames and sparse trajectory points. Specifically, we first embed the observed sequence of egocentric RGB frames and 3D trajectory by models 𝑓V and 𝑓T , followed by modality-specific transformers 𝑔V and 𝑔T . This process can be expressed as [x(V) 𝑡 , x(T ) 𝑡 ] = [ 𝑓V (I𝑡), 𝑓T (p𝑡)], o(V) 1 o(T ) 1 , . . . , o(V) 𝐶 = 𝑔V (x(V) 1 , . . . , x(V) 𝐶 ), (4.4) 𝐶 ), where 𝑓V is a vision backbone, e.g., ResNet [106] and ViT [65]. 𝐶 = 𝑔T (x(T ) , . . . , x(T ) , . . . , o(T ) 1 𝑓T is implemented as MLP following [205, 181]. 𝑔V and 𝑔T are transformer encoders that consist of 𝐵 stacked multi-head attention blocks. For each block 𝑏, a single-head attention block can be expressed as Attn(Q𝑏, K𝑏, V𝑏) = softmax (cid:18) Q𝑏K⊤ 𝑏 √ 𝑑 (cid:19) V𝑏, + M (4.5) 55 where Q𝑏, K𝑏, V𝑏 ∈ R𝑇×𝑑 are the projected query, key, and value matrices from the output of the 𝑄 𝑏 Q𝑏−1; W𝐾 previous block 𝑏 − 1, i.e, [Q𝑏; K𝑏; V𝑏] = [W 𝑏 V𝑏−1]. All the W𝑏 are learnable parameters. The binary mask M ∈ R𝑇×𝑇 zeros out the last 𝑇 −𝐶 columns and rows for the trajectory 𝑏 K𝑏−1; W𝑉 prediction problem. To capture global temporal interaction, the input of the first block Q0, K0, and V0 are the same as x𝑡 + PE(𝑡) where PE(𝑡) is the positional encoding for 𝑡 ∈ [1, 𝑇]. Transformer Transition With the encoded observations o𝑡 = [o(V) 𝑡 ; o(T ) 𝑡 ] where we use [; ] to represent the feature concatenation, it is critical to formulate the state transition and future trajectory prediction based on Eq. (4.3). Inspired by [305], we propose an attention-based au- toregressive module to formulate posterior 𝑝𝜃 (z𝑡 |z1:𝑡−1, o1:𝐶). Specifically, we first embed o𝑡 with positional encoding by h𝑡 = LayerNorm(MLP(o𝑡) +PE(𝑡)). Then, the latent feature z𝑡 is recursively encapsulated by the hidden variables ¯w𝑡 and ˆw𝑡 by attention modules (illustrated in Fig. 4.3): ¯w𝑡 = LayerNorm( [z𝑡−1; Attn(z𝑡−1, z1:𝑡−1, z1:𝑡−1)]), ˆw𝑡 = LayerNorm( [ ¯w𝑡; Attn( ¯w𝑡, h1:𝐶, h1:𝐶)]), (4.6) z𝑡 = LayerNorm(MLP( [ ˆw𝑡; MLP( ˆw𝑡)]) + PE(𝑡)), where the two multi-head attention modules capture the previously generated state z1:𝑡−1 and the hidden states of all observations h1:𝐶. Contrary to ProTran, we use concatenation [; ] rather than addition before layer normalization. The insight behind this is that the queried feature from a historical context can be better preserved without being dominated by z𝑡−1 in addition operation. Moreover, we remove the stochasticity of z𝑡 and instead use a probabilistic decoder as introduced next to handle the dynamics of trajectory prediction. The benefit is that we avoid the KL divergence vanishing issue from optimizing the ELBO objective which is known to exist in variational recurrent models [76]. Probabilistic Forecasting Instead of placing the stocasticity in the transition model 𝑝𝜃 (z𝑡 |z1:𝑡−1, o1:𝐶), we propose to formulate the emission process 𝑝𝜙 (p𝑡 |z𝑡, p𝑡−1) as a probabilistic model by predicting 56 Figure 4.3 Unrolled illustration of Eq. (4.6). Bold arrows are learnable, and green dashed lines show the attention ranges. both the mean ˆp𝑡 and variance ˆ𝝈2 𝑡 of each 3D hand trajectory point: [ ˆp𝑡, ˆ𝜶𝑡] = [MLP([z𝑡; o(T ) 𝑡−1 ]), softplus(MLP( [z𝑡; o(T ) 𝑡−1 ]))], (4.7) where the uncertainty ˆ𝜶𝑡 := log ˆ𝝈2 𝑡 a predictive Gaussian distribution, i.e., p𝑡 ∼ N ( ˆp𝑡, ˆ𝝈𝑡). As the o(T ) to enable numerical stability. The trajectory point p𝑡 follows 𝑡−1 encodes the observation from p𝑡−1 and its global historical context, our emission model is thus more powerful to predict p𝑡. This predictive mode has also been successful in traditional methods such as the SRNN [75] and VRNN [43]. Discussion Compared to ProTran [305], our formulation could individually formulate the trajec- tory and visual context p1:𝐶 and I1:𝐶 in state transition by modality-specific embeddings and the predictive link from p𝑡−1 to p𝑡, while ProTran only handles single modality context p1:𝐶 in state tran- sition and emission. In addition, to learn model parameters 𝜃 and 𝜙, ProTran has to use variational posterior distribution 𝑞𝜃 (z𝑡 |z1:𝑡−1, p1:𝐶) to help approximate Eq. (4.2) by ELBO maximization. In contrast, our method does not need approximation and formulates the data uncertainty of p𝑡 from a Bayesian perspective (Eq. (4.6)), which is empirically more effective to handle data noise. 4.3.2 Model Training Depth Robust Aleatoric Uncertainty With the predicted trajectory points ˆp𝑡 along with the uncertainty ˆ𝜶, according to [136, 10], the model training is essentially to learn the heteroscedastic 57 𝑜!"#𝑜!𝑜!$#ℎ!"#ℎ!ℎ!$##𝑤!"##𝑤!#𝑤!$#%𝑤!"#%𝑤!%𝑤!$#𝑧!"#𝑧!𝑧!$#𝑃𝐸!"#𝑃𝐸!𝑃𝐸!$# aleatoric uncertainty (HAU) from data. As shown in [137, 108], by minimizing the KL divergence between the predictive Gaussian distributions and Dirac distribution of the ground truth trajectory, the objective in Eq. (4.1) is equivalent to minimizing the HAU loss at each time 𝑡: Lhaul( ˆ𝛼, ˆp, p) = 𝑒− ˆ𝛼 |p| ∑︁ 𝑖=1 ∥ 𝑝𝑖 − ˆ𝑝𝑖 ∥2 + ˆ𝛼, (4.8) where 𝑝𝑖 and ˆ𝑝𝑖 are 3D coordinate values (𝑥, 𝑦, 𝑧) of ground truth and model prediction, respectively. In our task, the trajectory depth 𝑧 is more challenging to predict than 𝑥 and 𝑦 due to 1) the weak implicit correspondence between the past visual context V1:𝐶 and future hand depth, and 2) more importantly, the inevitable annotation noise from depth sensors. To handle these challenges, we propose to decouple the aleatoric uncertainty into ˆ𝛼𝑡 which is isotropic for (𝑥, 𝑦) and ˆ𝛽𝑡 specifically for 𝑧, respectively. Then, the predictions of (𝑥, 𝑦) and 𝑧 are weighted by factors 𝑤𝑡 so that the regression loss becomes L (𝑡) DRAU ( ˆy, ˆ𝜶) = Lhaul( ˆ𝛼𝑡, ˆp(2d) 𝑡 , p(2d) 𝑡 ) + 𝑤𝑡 Lhaul( ˆ𝛽𝑡, ˆ𝑧𝑡, 𝑧𝑡), (4.9) where the weight 𝑤𝑡 is determined by the negative temporal difference of ground truth 𝑧1:𝑇 with softmax normalization: 𝑤𝑡 = exp(−Δ𝑧𝑡) 𝑡=1 exp(−Δ𝑧𝑡) (cid:205)𝑇 , Δ𝑧𝑡 = |𝑧𝑡 − 𝑧𝑡−1|. (4.10) Since Δ𝑧𝑡 indicates the stability of depth transition, the motivation of 𝑤𝑡 is to encourage large weight on the stable depth transitions (small Δ𝑧𝑡) in a trajectory, which enables the training to focus less on the unstable depth so that the model is robust to noisy depth annotations. Velocity Constraints To explicitly inject the physical rule of hand motion into the model, we additionally take the motion inertia into consideration. Specifically, we leverage the transitioned states {z1, . . . , z𝑇 } learned from Eq. (4.6) to predict the velocities {v1, . . . , v𝑇 } by an MLP. Then, we propose the following velocity constraint in training: Lvelo( ˆv, p) = 𝑇 ∑︁ 𝑡=1 (cid:16) ∥p𝑡 − p𝑡−1 − ˆv𝑡 ∥2(cid:17) + 𝛾 𝑇 ∑︁ 𝑡=𝐶+1 (cid:32) ∥p𝐶 + 𝑡 ∑︁ 𝑖=𝐶+1 (cid:33) ˆv𝑖 − ˆp𝑡 ∥2 , 58 (4.11) Figure 4.4 Annotation Workflow. For more details of the annotation procedure, please refer to our supplementary materials. where the first term uses the first-order difference of locations p𝑡 to supervise the predicted velocity ˆv and we set p0 to zero. The second term is to constrain the future predicted trajectory point ˆp𝑡 with the warped point, which is computed by adding the accumulative predicted velocities onto the last observed point p𝐶 since the time interval is one. Empirically, we found the velocity constraint enables better generalization capability to unseen data (see Table 4.3). Visual Prompt Tuning Visual prompt tuning (VPT) [125] has been recently successful to adapt large visual foundation models to downstream vision tasks. In this paper, we leverage VPT to adapt pre-trained visual backbone 𝑓V for the trajectory prediction task. The basic idea is to append learnable prompt embeddings P to the input image I and only learn a few layers of MLP head ℎ𝜓 while keeping the backbone parameters Ψ frozen as Ψ∗ in training: x(V) 𝑡 = ℎ𝜓 ( 𝑓 Ψ∗ V (I𝑡, P)) (4.12) where only the head parameters 𝜓 and visual prompt P are learned in training. Since {𝜓, P} are much smaller than Ψ, the VPT is highly efficient in training. We implemented 𝑓V with both ResNet [106] and ViT [65], without noting significant performance difference. However, applying VPT achieves better 3D hand trajectory prediction performance than traditional fine-tuning (see Fig. 4.8). This is interesting as there is no existing literature that explores the VPT for vision-based trajectory prediction problems. 59 3D hand trajectory annotationsClip Division•landmark detection •manual check2D Trajectory•RAFT flow warping•trajectory cleaningLocal 3D TrajectoryDepthRGB•find noise depth•LSF depth repairGlobal 3D Trajectory•visual odometry•cam2world transformRGB-D Recording 4.4 Experiments 4.4.1 Datasets Since there is no available egocentric 3D hand trajectory dataset, we collect annotations based on two existing datasets, i.e., H2O [162] and EgoPAT3D [181], which contain egocentric RGB-D raw recordings for annotation purpose. H2O [162] dataset is initially collected for 3D hand pose and interaction recognition using RGB-D data from both egocentric and multiple third-person views. We first use the precisely annotated 3D hand poses to compute the 3D centroids as the ground truth of the 3D hand trajectory, named H2O-PT, which is guaranteed to be of high-quality in [162] by multi-view verification. EgoPAT3D [181] dataset is much larger than H2O. It is initially collected for predicting the 3D action targets from egocentric 3D videos. However, it does not provide either the 3D hand trajectory or the 3D hand poses. Thus, similar to [181], we develop an annotation workflow as shown in Fig. 4.4. More details about the annotation workflow are in the supplement. Eventually, we obtained sufficiently large collections of 3D hand trajectories for training and evaluation, named EgoPAT3D-DT. To verify the reliability of the annotation workflow, we also apply it to H2O, resulting in a H2O-DT dataset. Dataset Split The H2O(-PT/DT) dataset consists of 184 untrimmed long videos. We temporally sample the videos into multiple 64-frame clips with a step-size of 15 frames, resulting in 8203, 1735, and 3715 samples in training, validation, and testing splits, respectively. The EgoPAT3D-DT consists of 14 scenes and we split it into 11 seen scenes containing 8807 samples and 3 unseen testing scenes containing 2334 samples. The unseen scenes are not used in training, and the seen scenes are split into 6356, 846, and 1605 samples for training, validation, and seen testing. Evaluation Setting We use the 3D Average Displacement Error (ADE) and Final Displacement Error (FDE) in meters as the evaluation metrics. The 2D trajectory results are normalized with reference to the video frame size. For all metrics, a small value indicates better performance. Each model is trained with 3D and 2D trajectory targets individually and evaluated by 3D metrics 60 Table 4.1 ADE and FDE results on H2O-PT dataset. All models are built with ResNet-18 backbone. Best and secondary results are viewed in bold black and blue colors, respectively. Model DKF [152] RVAE [168] DSAE [371] STORN [13] VRNN [43] SRNN [75] EgoPAT3D1 [181] AGF1 [380] OCT1 [205] ProTran1 [305] USST 3D(3𝐷) 0.159 0.046 0.051 0.043 0.041 0.040 0.039 0.039 0.252 0.066 0.031 ADE (↓) 2D(3𝐷) 0.186 0.055 0.060 0.053 0.050 0.048 0.046 0.046 0.311 0.088 0.037 2D(2𝐷) 0.211 0.056 0.059 0.053 0.050 0.049 0.048 0.081 0.387 0.109 0.040 3D(3𝐷) 0.137 0.067 0.057 0.094 0.051 0.036 0.034 0.069 0.278 0.099 0.052 FDE (↓) 2D(3𝐷) 0.163 0.081 0.067 0.141 0.081 0.061 0.064 0.065 0.471 0.168 0.043 2D(2𝐷) 0.185 0.037 0.076 0.076 0.068 0.044 0.060 0.146 0.381 0.123 0.043 (3D(3𝐷)) and 2D metrics (2D(2𝐷)), respectively. The 3D model is additionally evaluated by 2D metrics (2D(3𝐷)) by projecting the 3D trajectory outputs to the 2D image plane. 4.4.2 Implementation Detail The proposed method is implemented by PyTorch. In pre-processing, RGB videos are down- scaled to 64 × 64. The 3D global trajectory data are normalized and further centralized to the range [-1,1]. By default, we set the observation ratio to 60%, the feature dimensions of o(V) and o(T ) to 256, and the dimension of z to 16 for all methods. In training, we use Huber loss to compute the location error. We adopt the Adam optimizer with base learning rate 1e-4 and cosine warmup scheduler for 500 training epochs on EgoPAT3D and 350 epochs on H2O datasets, respectively. More implementation details are in the supplement. 4.4.3 Main Results Table 4.1 and 4.2 show the comparison between our method and existing sequential prediction approaches on H2O-PT and EgoPAT3D-DT datasets, respectively. The methods in the first multi- row section are general RNN-based while those in the second multi-row section show recent 1We adapted the task-specific outputs of EgoPAT3D, AGF, OCT, and ProTran to fulfill the 3D hand trajectory forecasting task in this paper. 61 Table 4.2 ADE results on EgoPAT3D-DT dataset. All models are built with ResNet-18 backbone. Best and secondary results are viewed in bold black and blue colors, respectively. Model DKF [152] RVAE [168] DSAE [371] STORN [13] VRNN [43] SRNN [75] EgoPAT3D1 [181] AGF1 [380] OCT1 [205] ProTran1 [305] USST Seen Scenes (↓) 2D(3𝐷) 0.237 0.110 0.129 0.092 0.092 0.088 0.081 0.136 0.163 0.179 0.089 2D(2𝐷) 0.157 0.121 0.143 0.083 0.083 0.079 0.079 0.099 0.098 0.135 0.082 3D(3𝐷) 0.294 0.216 0.214 0.194 0.194 0.192 0.186 6.149 0.853 0.314 0.183 Unseen Scenes (↓) 2D(3𝐷) 0.202 0.104 0.116 0.084 0.086 0.081 0.080 0.119 0.139 0.154 0.075 2D(2𝐷) 0.133 0.109 0.131 0.070 0.070 0.067 0.068 0.087 0.091 0.107 0.060 3D(3𝐷) 0.260 0.194 0.188 0.161 0.164 0.166 0.170 6.045 0.782 0.240 0.120 Transformer-based models. We put the FDE results on EgoPAT3D-DT in the supplement. The tables show our method achieves the best ADE performance and comparable FDE results with AGF and SRNN on H2O-PT, and significantly outperforms AGF, OCT, and ProTran on EgoPAT3D-DT. The competitive performance of SRNN is because of its both forward and backward passes over time such that all future positional encodings are utilized for forecasting. We notice that AGF, OCT, and ProTran do not work well on EgoPAT3D-DT, potentially due to the KL divergence vanish issue. The higher performance on the unseen split than the seen split can be attributed to the less distribution shift between unseen test trajectories and the training trajectories. 4.4.4 Model Analysis Ablation Study To validate the effectiveness of the proposed modules and loss functions, we report the results of the ablation study in Table 4.3 on the EgoPAT3D-DT dataset. We first compare the ProTran with the proposed state space transformer (SST), which is a vanilla version of USST without uncertainty modeling, velocity constraint, and VPT. For a fair comparison, we implement a deterministic (det) version of ProTran in addition to the original method that uses stochastic variational inference (svi). Table 4.3 shows a clear advantage of our method over ProTran, which demonstrates the superiority of our SSM for state transition. 62 Table 4.3 Ablation Study. All models are trained with 3D targets and tested with both 3D and 2D ADE. Variants ProTran (svi) [305] ProTran (det) [305] SST (ours) USST w/o. Lhaul USST w/o. T1:𝐶 USST w/o. p𝑡−1 USST w/o. Lvelo USST w/o. 𝑤𝑡 USST (full model) Seen (↓) Unseen (↓) 3D 0.314 0.201 0.190 0.292 0.244 0.196 0.189 0.183 0.183 2D 0.179 0.104 0.088 0.237 0.176 0.090 0.091 0.090 0.089 3D 0.240 0.195 0.174 0.267 0.267 0.169 0.168 0.130 0.120 2D 0.154 0.106 0.084 0.204 0.208 0.098 0.099 0.077 0.075 Figure 4.5 Effect of depth repair. mDE and mDZ are the mean errors of 3D displacement and depth, respectively. Next, in Table 4.3, we individually remove the new components and compare them with the full model of USST, including 1) the uncertainty loss function Lhaul, 2) the trajectory context T1:𝐶, 3) the predictive link in 𝑝𝜙 (p𝑡 |z𝑡, p𝑡−1), 4) the velocity constraint Lvelo, and 5) the depth robust weight 𝑤𝑡. It shows that uncertainty modeling is critical to guarantee reasonable forecasting results. Without historical trajectory T1:𝐶, as expected, the performance degradation is significant. The predictive link from p𝑡−1 to p𝑡 is also important for the forecasting problem, which is consistent with the recent finding in [91]. It is noticeable that the velocity constraint shows a larger performance gain (4.8cm of 3D trajectory) on the unseen test set than on the seen data (0.6cm of 3D trajectory), revealing 63 mDEmDZ04812162024distance error (cm)17.116.113.311.5H2O-DT w/o repairH2O-DT Table 4.4 Annotation Reliability. ADE results (3D(3𝐷) / 2D(2𝐷)) are from testing on the same H2O-PT test set. Metrics SRNN USST - T D O 2 H T P - O 2 H ADE FDE ADE FDE 0.087 / 0.076 0.124 / 0.045 0.040 / 0.049 0.036 / 0.044 0.033 / 0.041 0.052 / 0.041 0.031 / 0.040 0.052 / 0.043 Figure 4.6 Arbitrary Observation Ratios. We report the results of 3D ADE (left) and 2D ADE (right) on EgoPAT3D-DT dataset. Figure 4.7 Impact of loss weights. Left: set the weight of Lvelo to 1.0 and tune 𝛾. Right: set 𝛾 to 0.1 and tune the weight of Lvelo. the importance of the physical rule for generalizable trajectory prediction. Lastly, the depth robust weight 𝑤𝑡 (Eq. (4.10)) also boosts the performance of unseen data, showing the importance of modeling the depth noise from annotations. Annotation Reliability By using the accurate H2O-PT as a reference, in Fig. 4.5, we show the effect of repairing depth of the 3D trajectory annotations from raw RGB-D data. We see a clear 64 0.10.20.30.40.50.60.70.80.9Observation Ratios0.0750.1000.1250.1500.1750.2000.2250.2503D ADEseenunseen0.10.20.30.40.50.60.70.80.9Observation Ratios0.020.040.060.080.100.120.140.162D ADEseenunseen0.10.30.50.70.9weight (weight of velo=1.0)0.100.140.180.223D ADEseenunseen0.10.51.01.52.0weight of velo (=0.1)0.100.140.180.223D ADEseenunseen Figure 4.8 Impact of Prompt Length. We report the results of 3D (left) and 2D (right) ADE on EgoPAT3D-DT. The “finetune (unseen)" means finetune model on seen but test on unseen scenes. Microwave (seen) PantryShelf (seen) StoveTop (unseen) Windowsill (unseen) ] 5 7 [ N N R S ] 5 0 3 [ n a r T o r P ) s r u O ( T S S U Figure 4.9 Visualization on EgoPAT3D. For each example (in a column), we show the global 3D trajectory and its 2D projection on the first frame. The blue, green, and red trajectory points represent the past observed, future ground truth, and future predictions, respectively. improvement in mDE and mDZ measurements. In Table 4.4, we further show the performance impact of annotation quality on SRNN and our USST models. It shows that our USST achieves more consistent ADE and FDE results than SRNN over the H2O-PT and H2O-DT. For reference, in the supplement, we additionally report the full results and analysis on H2O-DT and H2O-DT w/o repair. 65 15101520Prompt Length0.150.160.170.180.190.200.210.223D ADEfinetune (seen)vpt (seen)finetune (unseen)vpt (unseen)15101520Prompt Length0.110.120.130.140.150.160.170.182D ADEfinetune (seen)vpt (seen)finetune (unseen)vpt (unseen) Table 4.5 Impact of 3D coordinate systems. “Local" and “global" mean using 3D camera and world coordinates, respectively. 3D Target Backbone Local Global Local Global R18 R18 ViT ViT Seen (↓) Unseen (↓) 3D 0.202 0.183 0.183 0.182 2D 0.083 0.089 0.081 0.087 3D 0.174 0.120 0.133 0.119 2D 0.062 0.075 0.067 0.075 Forecast at Any Time To simulate the real-world practice that forecasting trajectory at an arbitrary time, we take the advantage of the Transformer attention mask to fulfill random observation ratios ranging from 10% to 90%. The results are summarized in Fig. 4.6. It shows that with more percentage of information observed, both the 2D and 3D forecasting error are reduced as expected. It is interesting to see the slight increase of 3D ADE for the seen test data when using more observations. It could be caused by more inaccurate trajectory depth values at the end of trajectories. Loss Weights Fig. 4.7 shows the EgoPAT3D-DT results of tuning the weights in Eq. (4.11), where the best performance is achieved when 𝛾 is set to 0.1 and the weight of Lvelo is set to 1.0, respectively. We apply them to H2O-PT by default. Prompt Length of VPT As indicated in VPT literature [125], the length of the visual prompt in ViT models needs to be carefully tuned for downstream tasks. In experiments, based on the SST model, we select the prompt length from {1, 5, 10, 15, 20} and compare their performance with the baseline that fine-tunes the entire ViT backbone. Results are reported in Fig. 4.8. It shows that VPT could steadily achieve lower 2D and 3D ADE than fine-tuning, and the best performance is achieved when the prompt length is 10. Local vs Global 3D Trajectory We note that the ambiguity of learning the appearance-location mapping exists when using local 3D targets. To justify the choice of global 3D trajectory targets, in Table 4.5, we compare the 3D and 2D ADE results using both ResNet-18 and ViT as visual backbones. It clearly shows that for 3D trajectory prediction, a global 3D coordinate system is a 66 better choice, while for 2D trajectory evaluation, the local 3D target is better. These observations are expected as in the local 3D coordinate system, the projected 2D pixel locations of moving hands tend to be in the visual center due to the egocentric view so that the model training is dominated by the 2D hand locations. Qualitative Results As shown in Fig. 4.9, the proposed USST model is compared with the Transformer-based approach ProTran and the most competitive method SRNN. It clearly shows that our trajectory forecasting is much better than the three compared methods. Limitations & Discussions The dataset annotation is limited in scenarios when the RGB-D sensors or camera poses are not available. The model is limited in the recursive way of state transition, which is not hardware-friendly for parallel inference. Besides, in the future, other tasks like the 3D hand pose and interaction recognition can be jointly studied for a fine-grained egocentric understanding. 4.5 Conclusion In this paper, we propose to forecast human hand trajectory in 3D physical space from egocentric videos. For this goal, we first develop a pipeline to automate the 3D trajectory annotation. Then, we propose a novel uncertainty-aware state space transformer (USST) model to fulfill the task. Empirically, with the aleatoric uncertainty modeling, velocity constraint, and visual prompt tuning, our model achieves the best performance on both H2O and EgoPAT3D datasets and good generalization to the unseen scenes. 4.6 Supplementary Material In this section, we provide more details of the data collection and annotation, model implemen- tation, and evaluation results. 67 Figure 4.10 Dataset Examples. For each video (in a row), the global 3D trajectory and the projected 2D trajectory are visualized, where the past and future trajectory segments are in red and blue, respectively. Zoom in for more details. 4.6.1 Details of the Datasets 4.6.1.1 Annotation Workflow Following the similar pipeline in the EgoPAT3D [181], we propose to obtain the 3D hand trajectory annotations based on egocentric RGB-D recordings. In the following paragraphs, we elaborate on each processing step based on RGB-D data from the EgoPAT3D [181] and H2O [162]. Clip Division The EgoPAT3D dataset consists of RGB-D data of hand-object manipulation in 14 indoor scenes. We leverage the provided manual clip divisions and the hand landmarks to obtain more accurate trajectory divisions. Specifically, let (𝑠𝑚, 𝑒𝑚) denote the start and end of a manually annotated trajectory, and {𝑡𝑠, . . . , 𝑡𝑒} denote the indices of detected 3D hand landmarks, our trajectory start and end are determined by max(𝑠𝑚, 𝑡𝑠) and min(𝑒𝑚, 𝑡𝑒), respectively. This technique could mitigate the ambiguity of trajectory start and end. Then, we use them to obtain the RGB video clips from the raw recordings. The H2O dataset contains 184 long videos and each video is annotated with 3D poses of the left and right hand as well as the binary validity flag. The trajectory start and end are determined by the validity flag. 68 2D Trajectory For each clip, we found the hand trajectory is not stable if only using the centers of frame-wise hand landmarks as trajectory points. Therefore, for the EgoPAT3D dataset, we propose to leverage the optical flow model RAFT [308] to warp the hand landmark center as the 2D hand trajectory. Specifically, we apply the RAFT to the forward pass starting from the first 2D location p1 of the hand and backward pass starting from the last location p𝑇 of the hand, resulting in the forward trajectory {p( 𝑓 ) }𝑇 𝑡=1. Then, for each frame 𝑡, the ultimate 2D location is determined by a temporally weighted sum ˜p𝑡 = 𝑤𝑡p( 𝑓 ) + (1 − 𝑤𝑡)p(𝑏) 𝑡 where the 𝑡=1 and backward trajectory {p(𝑏) }𝑇 𝑡 𝑡 𝑡 weight 𝑤𝑡 is temporally decreasing from 1.0 to a constant 𝑐 by 𝑤𝑡 = 𝑐 + (1 − 𝑐)/(1 + exp(𝑡 − 𝑇/2)). In practice, we set 𝑐 to 0.3. The rationale of weighing is to mitigate the error accumulation from the RAFT model. It assigns more weight to the earlier locations by forward flow and more weight to the latter locations by backward flow, with the margin 𝑐 between the two passes. Local 3D Trajectory With the 2D hand trajectory, it is straightforward to obtain the 3D hand trajectory by fetching the depth of each trajectory point from the RGB-D clips. However, we noticed that due to the fast motion of the hand and camera, the recorded depth channels in those frames could be missing, i.e., depth values are zeros (see the red dots in Fig. 4.11). To obtain high-quality 3D hand trajectory annotations, we initially attempted to use the state-of-the-art depth estimation model NewCRFs [378] to estimate the missing depths from RGB frames. However, it cannot work well due to the camera motion that results in dynamic scenes in RGB frames. Instead, we found that a simple least-square fitting (LSF) by combining the third-order polynomial and sine functions, i.e., 𝑧𝛼 (𝑡) = 𝛼1𝑡3 + 𝛼2𝑡2 + 𝛼3𝑡 + 𝛼4 + 𝛼5 sin(𝛼6𝑡), could repair the missing depth. For both EgoPAT3D and H2O, we apply the LSF to repair 3D hand trajectory depth. To enable successful depth fitting, we use at least 10 valid trajectory points to fit a multinomial model on each 3D hand trajectory that contains invalid depths. Global 3D Trajectory Note that the 3D trajectory points from the previous step are defined in the local camera coordinate system. When the camera is moving in an egocentric view, using RGB videos to predict the local 3D trajectory will be ambiguous. In other words, distinct visual contents 69 EgoPAT3D H2O Table 4.6 Summary of camera intrinsics resolution focal length principle point resolution focal length principle point 𝐻 = 2160, 𝑊 = 3840 𝑓𝑥 = 1808.203, 𝑜𝑥 = 1942.287, 𝑓𝑦 = 1807.946 𝑜𝑦 = 1123.822 𝐻 = 720, 𝑊 = 1280 𝑓𝑥 = 636.659, 𝑜𝑥 = 635.284, 𝑓𝑦 = 636.252 𝑜𝑦 = 366.874 are forced to learn to predict numerically similar coordinates. To eliminate the ambiguity, similar to EgoPAT3D [181], we propose to transform the 3D trajectory targets into a global world coordinate system with reference to the first frame. This is a visual odometry procedure that computes the 3D homogeneous transformation M𝑡 ∈ R4×4 between camera poses at two successive frames 𝑡 − 1 and 𝑡. Eventually, a local 3D trajectory point p𝑙 𝑡 is transformed as a global 3D trajectory point p 𝑔 𝑡 by the accumulative matrix product p 𝑡. In experiments, we use the global 3D trajectory 𝑔 𝑡 = (cid:206)𝑡 𝑘=1 M𝑘 p𝑙 {p 𝑔 𝑡 }𝑇 𝑡=1 as the ground truth for model training, evaluation, and visualization by default. Fig. 4.10 shows three video examples with global 3D trajectory annotations. 4.6.1.2 Camera Intrinsics and Poses For both the EgoPAT3D and H2O, the camera intrinsics are fixed across all samples. Table 4.6 summarizes the camera intrinsics of the dataset we used in this paper. Note that the intrinsics are scaled with the factor 0.25 when we down-scale the RGB videos to the input resolution. For camera poses of EgoPAT3D, we use Open3D [410] library to perform visual odometry1 by using adjacent RGB-D pairs so that the camera motion is obtained. The camera poses of H2O dataset are given for each video frame. 4.6.2 Additional Implementation Details Data Structure To enable efficient parallel training with batches of data input that contain videos of varying lengths, we adopt the mask mechanism in our implementation. Specifically, we set the maximum length of each video to 40 and 64 for EgoPAT3D and H2O, respectively. The lengths of the past observation and future frames are determined by the actual video length. For instance, 1In practice, we followed the EgoPAT3D to use the Open3D API (RGBDOdometryJacobianFromHybridTerm) to compute the 3D camera motion. 70 (a) Bathroom Cabinet (b) Bathroom Counter (c) Bin (d) Kitchen Cupboard (e) Microwave (f) Nightstand Figure 4.11 Examples of comparison between the Least Square Fitting (LSF) and the depth esti- mation model NewCRFs [378] for repairing the noisy depth values from EgoPAT3D RGB-D data. It’s clear that on this video dataset with dynamic background, a simple LSF with a multinomial model could achieve a much better depth repairing effect than the state-of-the-art deep learning model NewCRFs. when the observation ratio is set to 0.6, a sample with 35 frames in total has 21 observed frames, 14 unobserved frames, and 5 zero-padded frames. Since the visual background of RGB videos is relatively clean, we resize videos into the size of 64 × 64 in training and inference. Model Structure For the ResNet-18 backbone, we replace the global pooling layer after the last residual block with torch.flatten, in order to preserve as much visual contextual information as possible. When the visual prompt tuning (VPT) is utilized, the width of the padded learnable pixels is set to 5 as suggested by [125], resulting in 13802 additional parameters to learn. For the ViT backbone, we adopt the vit/b16-224 architecture provided by TIMM, which is pre-trained on the 2For 64 × 64 input, the number of learnable parameters in prompt embeddings is computed by (64 + 5 × 2)2 − 642 = 1380. 71 0102030frame id0.00.20.40.6hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera010203040frame id0.00.20.40.6hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera02040frame id0.00.20.40.6hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera05101520frame id0.00.20.4hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera0102030frame id0.00.10.20.30.4hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera0102030frame id0.00.20.40.6hand depth (m)Repair by NewCRFsRepair by LSFRGB-D Camera Table 4.7 Results of models training on H2O-DT dataset. We report all results of models trained by annotations from H2O-DT (left) and its version without depth repair (right), and tested on the accurate H2O-PT test set. All models are built with ResNet-18 backbone. Best and secondary results are viewed in bold black and blue colors, respectively. Models ADE (↓) FDE (↓) ADE (↓) FDE (↓) H2O-DT H2O-DT (w/o depth repair) 3D(3𝐷) 2D(3𝐷) 2D(2𝐷) 3D(3𝐷) 2D(3𝐷) 2D(2𝐷) 3D(3𝐷) 2D(3𝐷) 2D(2𝐷) 3D(3𝐷) 2D(3𝐷) 2D(2𝐷) DKF [152] 0.187 0.236 RVAE [168] 0.059 0.125 DSAE [371] 0.077 0.081 STORN [13] 0.121 0.091 VRNN [43] 0.087 0.080 SRNN [75] 0.135 0.087 0.056 AGF [380] 0.108 OCT [205] 0.505 0.360 0.041 ProTran [305] 0.080 0.041 0.033 USST 0.030 0.082 0.059 0.078 0.035 0.072 0.061 0.362 0.031 0.050 0.020 0.047 0.040 0.040 0.039 0.045 0.214 0.520 0.107 0.041 0.138 0.057 0.043 0.245 0.042 0.124 0.171 0.348 0.023 0.052 0.199 0.051 0.072 0.067 0.065 0.055 0.099 0.381 0.070 0.032 0.208 0.059 0.063 0.054 0.054 0.061 0.065 0.403 0.064 0.040 0.186 0.060 0.068 0.061 0.063 0.059 0.075 0.519 0.093 0.041 0.269 0.094 0.078 0.070 0.068 0.076 0.080 0.350 0.099 0.041 0.235 0.209 0.113 0.100 0.092 0.097 0.065 0.473 0.082 0.041 0.181 0.058 0.067 0.135 0.133 0.083 0.186 0.403 0.162 0.053 0.153 0.071 0.047 0.097 0.087 0.089 0.044 0.521 0.146 0.041 ImageNet-21K dataset. For either ResNet-18 or ViT-based frame encoder 𝑓V, the output feature is embedded by a two-layer MLP with 512 and 256 hidden units. For the trajectory encoder 𝑓T , we use a two-layer MLP with 128 and 256 hidden units. For both visual and trajectory transformer encoders, we utilize the standard transformer encoder architecture, which consists of 6 multi-head self-attention blocks where the number of heads is 8 and the MLP ratio is 4. For the decoder, we implement the three prediction branches, i.e., future trajectory prediction, uncertainty prediction, and velocity prediction, using three MLP heads, each of which consists of 128 and 3 hidden units. For trajectory and velocity prediction outputs, we use tanh activation, while for the uncertainty output, we use softplus activation. Besides, for the velocity prediction, layer normalization is applied to each hidden layer. Learning and Inference In training, we set the 𝛿 parameter of Huber loss to 1𝑒 − 5, and set the 𝛾 coefficient of the velocity-based warping loss to 0.1. For the cosine learning rate scheduler, we adopt warm-up training in the first 10 epochs. For 500 training epochs in total, our model training can be completed within 5 hours on a single RTX A6000 GPU. In testing, we evaluate the predicted 3D trajectory in the global coordinate system by referring to the camera at the first time step, while 72 Table 4.8 FDE results of 2D hand trajectory forecasting. Compared models are built with ResNet-18 (R18) backbone. Best and secondary results are in bold black and blue colors, respectively. Model DKF [152] RVAE [168] DSAE [371] STORN [13] VRNN [43] SRNN [75] OCT [205] ProTran [305] USST (R18) USST (ViT) Seen (↓) 0.150 0.152 0.144 0.145 0.155 0.157 0.090 0.134 0.075 0.066 Unseen (↓) 0.239 0.201 0.233 0.266 0.237 0.198 0.147 0.049 0.107 0.114 Figure 4.12 Inference speed in milliseconds/video (ms/v) and the number of model parameters in million (M), tested on a single RTX 6000Ada GPU with input video size 64 × 64 × 64. visualizing the 2D trajectory by first projecting the global 3D trajectory into the local 3D trajectory, and then projecting the local 3D coordinates onto a video frame as 2D pixel coordinates. 4.6.3 Additional Evaluation Results Full results on H2O-DT We additionally provide full experimental results by training models on H2O-DT and H2O-DT w/o depth repair in Table 4.7. It shows that our USST method could still achieve the best performance using training data with inaccurate trajectory annotations. FDE results on EgoPAT3D-DT We additionally provide the Final Displacement Error (FDE) results for 2D hand trajectory forecasting as shown in Table 4.8. Our method could achieve the best performance on the seen test data while being competitive on the unseen test data. Besides, ProTran shows the best result on the unseen data, which could be attributed to its extra trajectory supervision from the full observation of the latent Gaussian distributions. 73 AGFOCTProTranUSST025507510012515017568.2124.12.42.8speed (ms/v)01020304050607015.256.632.223.7# params (M) Inference speed In Fig. 4.12, we compared with the Transformer-based methods. It shows the USST achieves competitive speed to ProTran while comparable model size to AGF. With certain improvements, our method could potentially benefit the rendering latency in AR/VR. 74 CHAPTER 5 OPEN-SET ACTION RECOGNITION 5.1 Introduction Video action recognition aims to classify a video that contains a human action into one of the pre-defined action categories (closed set). However, in a real-world scenario, it is essentially an open set problem [291], which requires the classifier to simultaneously recognize actions from known classes and identify actions from unknown ones [278, 86]. In practice, open set recognition (OSR) is more challenging than closed set recognition, while it is important for applications such as face recognition [207], e-commerce product classification [353], autonomous driving [272], and so on. OSR was originally formalized in [278] and many existing approaches have been proposed using image datasets such as MNIST [166] and CIFAR-10 [156]. However, unlike OSR, limited progress has been achieved for open set action recognition (OSAR) which is increasingly valuable in practice. In fact, novel challenges arise in OSAR from the following key aspects. First, the temporal nature of videos may lead to a high diversity of human action patterns. Hence, an OSAR model needs to capture the temporal regularities of closed set actions but also be aware of what it does not know when presented with unknown actions from an open set scenario. Second, the visual appearance of natural videos typically contains static biased cues [183, 42] (e.g., “surfing water" in totally different scenes as shown in Fig. 5.2). Without addressing the temporal dynamics of human actions, the static bias could seriously hamper the capability of an OSAR model to recognize unknown actions from an unbiased open set. Due to these challenges, existing effort on OSAR is quite limited with few exceptions [291, 153, 366]. They simply regard each video as a standalone sample and primarily rely on image-based OSR approaches. As a result, they fall short in addressing the inherent video-specific challenges in the open set context as outlined above. This chapter is adapted from the following publication: "Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learning for open set action recognition. Conference on Computer Vision (ICCV), Oral, 2021." In International 75 Figure 5.1 Open Set Action Recognition Performance. HMDB-51 [159] and MiT-v2 [230] are separately used as small- and large-scale unknown data for models trained on the closed set UCF- 101 [296]. Our DEAR method (★) significantly outperforms existing approaches on multiple action recognition models. In this paper, we propose a Deep Evidential Action Recognition (DEAR) method for the open-set action recognition task. To enable the model to “know unknown" in an OSAR task, our method formulates it as an uncertainty estimation problem by leveraging evidential deep learning (EDL) [283, 401, 287, 4, 282]. EDL utilizes deep neural networks to predict a Dirichlet distribution of class probabilities, which can be regarded as an evidence-collection process. The learned evidence is informative to quantify the predictive uncertainty of diverse human actions so that unknown actions would incur high uncertainty, i.e., the model knows the unknown. Furthermore, to overcome the potential over-fitting risk of EDL in a closed set, we propose a novel model calibration method to regularize the evidential learning process. Besides, to mitigate the static bias problem for video actions, we propose a plug-and-play module to debias the learned representation through contrastive learning. Benefiting from the evidential theory, our DEAR method is practically flexible to implement and provides a principled way to quantify the uncertainty for identifying unknown actions. Experimental results show that the DEAR method boosts the performance of existing powerful action recognition models with both small and large-scale unknown videos (see Fig. 5.1), while still maintaining a high performance in traditional closed set recognition settings. Distinct from existing OSR methods [291, 153], the proposed DEAR is the first evidential learning model for large-scale video action recognition. DEAR is superior to existing Bayesian 76 72747678808284Open-Set AUC Score (%)606570758085Open maF1 (%)HMDB-51 as UnknownDEAR (Ours)SoftMaxOpenMaxRPLMC DropoutBNN SVII3DTSMTPNSlowFast75798387Open-Set AUC Score (%)6064687276Open maF1 (%)MiT-v2 as UnknownDEAR (Ours)SoftMaxOpenMaxRPLMC DropoutBNN SVII3DTSMTPNSlowFast (a) Kinetics [27] (b) Mimetics [339] Figure 5.2 An Example of Static Bias. To recognize the human action (i.e., “Surfing Water"), the recognition model which is biased to the background of water and sky in the closed set (Kinetics) would be unable to recognize the same action with the indoor scene in open set (Mimetics as unknown). uncertainty-based methods [153] in that model uncertainty can be directly inferred through ev- idence prediction that avoids inexact posterior approximation or time-consuming Monte Carlo sampling [4]. Moreover, our proposed model calibration method ensures that DEAR is confident in accurate predictions while being uncertain about inaccurate ones. Compared to [291] which in- crementally learns a classifier for unknown classes, our method is more flexible in training without access to unknown actions. Moreover, our proposed debiasing module could reduce the detrimental static bias of video actions so that the model is robust to out-of-context actions in the open set setting. In summary, the contribution of this paper is threefold: • Our Deep Evidential Action Recognition (DEAR) method performs novel evidential learning to support open-set action recognition with principled and efficient uncertainty evaluation. • The proposed Evidential Uncertainty Calibration (EUC) and Contrastive Evidential Debias- ing (CED) modules effectively mitigate over-confident predictions and static bias problems, respectively. • The DEAR method is extensively validated and consistently boosts the performance of state- of-the-art action recognition models on challenging benchmarks. 77 5.2 Related Work Open Set Recognition. OSR problem originates from face recognition scenario [171] and it is firstly formalized by Scheirer et al. [278]. In [278], to reject the unknown classes, a binary support vector machine (SVM) was introduced by adding an extra hyper-plane for each new class. Based on this work, the Weibull-calibrated SVM (W-SVM) [279] and 𝑃𝐼-SVM [123] are further proposed to calibrate the class confidence scores by leveraging the statistical extreme value theory (EVT). With the recent success of deep learning, deep neural networks (DNN) are widely used in OSR problems. To overcome the drawbacks of softmax in DNN, Bendale et al. [16] proposed OpenMax to upper-bound the open space risk for DNN models. Based on this work, G-OpenMax [83] adopted a generative method to synthesize unknown samples in the training of DNNs. Similarly, recent deep generative adversarial networks (GANs) were used to generate samples of unknown class for OSR task [240, 64]. To reject the unknown, variational auto-encoder (VAE) was recently used to learn the reconstruction error in OSR task [249, 372, 302]. Different from these methods, our method is the first work to introduce evidential deep learning (EDL) for the OSR task and show the advantage over existing approaches. For the open set action recognition (OSAR) problem, it is much more challenging than the OSR problem while only a few existing literature explored it. Shu et al. [291] proposed ODN by incrementally adding new classes to the action recognition head. To capture the uncertainty of unknown classes, Bayesian deep learning is recently introduced to identify the unknown actions in [153, 298, 154]. Busto et al. [21] proposed an open-set domain adaptation method. However, existing methods ignore the importance of uncertainty calibration and static bias of human actions in video data. In a broader context, uncertainty-based OSR is also closely related to out-of- distribution (OOD) [326]. Other less related topics such as anomaly detection [252], generalized zero-shot learning [221], and open world learning [15] are out of the scope in this paper and comprehensively reviewed in [86]. Deep Learning Uncertainty. To distinguish between the unknown and the known samples, an appropriate OOD scoring function is important. A recent line of research works [153, 218, 78 Figure 5.3 Proposed DEAR method. We use 3-class (𝐾 = 3) action recognition (AR) for illustration. On top of the AR backbone, the Evidential Neural Network (ENN) head predicts the evidence e to build the Dirichlet distribution of class probability p. The evidential uncertainty (𝑢) from the Dirichlet is used for rejecting the unknown in open-set testing. 31, 287, 282] show that the predictive uncertainty learned by deep neural networks (DNN) can be a promising scoring function to identify OOD samples. It is assumed that OOD samples should be highly uncertain during inference. Bayesian neural networks (BNN) has been introduced to model the epistemic and aleatoric uncertainty for multiple computer vision tasks [136, 151, 315]. However, BNN is limited by the intractability of exact posterior inference, the difficulty of choosing suitable weight priors, and the expensive sampling for uncertainty quantification [4]. Recently, evidential deep learning (EDL) has been developed by incorporating the evidential theory into deep neural networks with promising results in both classification [283] and regression [4] tasks. In this paper, to the best of our knowledge, we are the first to incorporate evidential learning for large-scale and uncertainty-aware action recognition. Video Action Recognition. Video action recognition has been widely studied in closed set setting [344, 148, 390]. In this paper, we select several representative and powerful methods, including the 3D convolution method I3D [27], the 2D convolution method TSM [193], the two- stream method SlowFast [74], and the method focusing on neck structure of a recognition model TPN [360]. Note that our method can be easily applied to any existing video action recognition models to enable them for open-set action recognition. 79 VideoTWH⋯AR backboneENN headCED⋮⋮evidence (𝒆)Uncertainty+ 5.3 Approach Overview. The proposed DEAR method is illustrated in Fig. 5.3. Given a video as input, the Evidential Neural Network (ENN) head on top of an Action Recognition (AR) backbone1 predicts the class-wise evidence, which formulates a Dirichlet distribution so that the multi-class probabilities and predictive uncertainty of the input can be determined. For the open set inference, high uncertainty videos can be regarded as unknown actions while low uncertainty videos are classified by the learned categorical probabilities. The model is trained by Evidential Deep Learning (EDL) [283] loss regularized by our proposed Evidential Uncertainty Calibration (EUC) method. In training, we also propose a plug-and-play Contrastive Evidence Debiasing (CED) module to debias the representation of human actions in videos. 5.3.1 Deep Evidential Action Recognition Background of Evidential Deep Learning. Existing deep learning-based models typically use a softmax layer on top of deep neural networks (DNNs) for classification. However, these softmax-based DNNs are not able to estimate the predictive uncertainty for a classification problem because the softmax score is essentially a point estimation of a predictive distribution [78] and the softmax outputs tend to be over-confident in false prediction [99]. Recent evidential deep learning (EDL) [283, 4] was developed to overcome the limitations of softmax-based DNNs by introducing the evidence framework of Dempster-Shafer Theory (DST) [284] and the subjective logic (SL) [130]. EDL provides a principled way to jointly formulate the multi-class classification and uncertainty modeling. In particular, given a sample x(𝑖) for 𝐾-class classification, assuming that class probability follows a prior Dirichlet distribution, the cross-entropy loss to be minimized for learning evidence e(𝑖) ∈ R𝐾 + eventually reduces to the following form: L (𝑖) 𝐸 𝐷 𝐿 (y(𝑖), e(𝑖); 𝜃) = 𝐾 ∑︁ 𝑘=1 (cid:16) y(𝑖) 𝑘 log 𝑆(𝑖) − log(e(𝑖) 𝑘 + 1) (cid:17) (5.1) where y(𝑖) is an one-hot 𝐾-dimensional label for sample x(𝑖) and e(𝑖) can be expressed as e(𝑖) = 1In our experiments, we use four different action recognition models which are I3D [27], TSM [193], SlowFast [74], and TPN [360]. 80 (a) AC (b) AU (c) IC (d) IU Figure 5.4 Examples of Probability Simplex. We use a 3-class classification as an example and assume the first class as the correct label. A well-calibrated model should give Accurate and Certain (AC) predictions (Fig. 5.4a) or Inaccurate and Uncertain (IU) predictions (Fig. 5.4d), while the AU (Fig. 5.4b) and IC (Fig. 5.4c) cases need to be reduced. (cid:16) 𝑔 𝑓 (x(𝑖); 𝜃) (cid:17) . Here, 𝑓 is the output of a DNN parameterized by 𝜃 and 𝑔 is the evidence function to keep evidence e𝑘 non-negative. 𝑆 is the total strength of a Dirichlet distribution Dir(p|𝜶), which is parameterized by 𝜶 ∈ R𝐾, and 𝑆 is defined as 𝑆 = (cid:205)𝐾 𝑘=1 𝜶𝑘 . Based on DST and SL theory, the 𝜶𝑘 is linked to the learned evidence e𝑘 by the equality 𝜶𝑘 = e𝑘 + 1. In the inference, the predicted probability of the 𝑘-th class is ˆp𝑘 = 𝜶𝑘 /𝑆 and the predictive uncertainty 𝑢 can be deterministically given as 𝑢 = 𝐾/𝑆. More detailed derivations can be found in our supplementary material. EDL for Action Recognition. In this paper, we propose to formulate action recognition from the EDL perspective. In the training phase, by applying the EDL objective in (5.1) for the action dataset, we are essentially trying to collect evidence of each action category for an action video. In the testing phase, since the action probability p ∈ R𝐾 is assumed to follow a Dirichlet, i.e., p ∼ Dir(p|𝜶), the categorical probability and uncertainty of a human action can be jointly expressed by a (𝐾 − 1)-simplex (see the triangular heat map in Fig. 5.3). The EDL uncertainty enables the action recognition model to “know the unknown". However, due to the deterministic nature of EDL, the potential over-fitting issue would hamper the generalization capability for achieving good OSAR performance. Besides, the static bias problem in video data is still not addressed by EDL. To this end, we propose a model calibration method and a representation debiasing module below. 81 =[10,1.2,1.2]u=0.2=[1.8,1.2,1.2]u=0.7=[10,10,10]u=0.1=[1.2,1.2,1.2]u=0.8 5.3.2 Evidential Uncertainty Calibration Though the evidential uncertainty from EDL can be directly learned without sampling, the uncertainty may not be well calibrated to handle the unknown samples in the OSAR setting. As pointed out in existing model calibration literature [231, 155], a well-calibrated model should be confident in its predictions when being accurate, and be uncertain about inaccurate ones. Besides, existing DNN models have empirically demonstrated that miscalibration is linked to the over-fitting of the negative log-likelihood (NLL) [99, 232]. Since the EDL objective in (5.1) is equivalent to minimizing the NLL [283], the trained model is likely to be over-fitted with poor generalization for OSAR tasks. To address this issue, we propose to calibrate the EDL model by considering the relationship between accuracy and uncertainty. To this end, we follow the same goal as [231, 155] to maximize the Accuracy versus Uncertainty (AvU) utility function for calibrating the uncertainty: AvU = 𝑛𝐴𝐶 + 𝑛𝐼𝑈 𝑛𝐴𝐶 + 𝑛𝐴𝑈 + 𝑛𝐼𝐶 + 𝑛𝐼𝑈 (5.2) where the 𝑛𝐴𝐶, 𝑛𝐴𝑈, 𝑛𝐼𝐶, and 𝑛𝐼𝑈 represent the numbers of samples in four predicted cases, i.e., (1) Accurate and Certain (AC), (2) Accurate and Uncertain (AU), (3) Inaccurate and Certain (IC), and (4) Inaccurate and Uncertain (IU). A well-calibrated model could achieve high AvU utility so that the predictive uncertainty can be consistent with accuracy. Fig. 5.4 shows a toy example of the four possible EDL outputs. To calibrate the predictive uncertainty, the EDL model is encouraged to learn a skewed and sharp Dirichlet simplex for accurate prediction (Fig. 5.4a), and to provide an unskewed and flat Dirichlet simplex for incorrect prediction (Fig. 5.4d). To this end, we propose to regularize EDL training by minimizing the expectations of AU and IC cases (Fig. 5.4b and Fig. 5.4c) such that the other two cases can be encouraged. Therefore, if a video is assigned with high EDL uncertainty, it is more likely to be incorrect so that an unknown action is identified. In particular, we propose an Evidential Uncertainty Calibration (EUC) method to minimize the following sum of AU and IC cases by considering the logarithm constraint between the confidence 82 𝑝𝑖 and uncertainty 𝑢𝑖: L𝐸𝑈𝐶 = −𝜆𝑡 −(1 − 𝜆𝑡) ∑︁ 𝑖∈{ ˆ𝑦𝑖=𝑦𝑖 } ∑︁ 𝑖∈{ ˆ𝑦𝑖≠𝑦𝑖 } 𝑝𝑖 log(1 − 𝑢𝑖) (1 − 𝑝𝑖) log(𝑢𝑖) (5.3) where 𝑝𝑖 is the maximum class probability of an input sample x(𝑖) and 𝑢𝑖 is the associated evidential uncertainty. The first term aims to give low uncertainty (𝑢𝑖 → 0) when the model makes an accurate prediction ( ˆ𝑦𝑖 = 𝑦𝑖, 𝑝𝑖 → 1), while the second term tries to give high uncertainty (𝑢𝑖 → 1) when the model makes inaccurate prediction ( ˆ𝑦𝑖 ≠ 𝑦𝑖, 𝑝𝑖 → 0). Note that the annealing factor 𝜆𝑡 ∈ [𝜆0, 1] is defined as 𝜆𝑡 = 𝜆0 exp {−(ln 𝜆0/𝑇)𝑡}. Here, 𝜆0 is a small positive constant, i.e., 𝜆0 ≪ 1, such that 𝜆𝑡 is monotonically increasing w.r.t. training epoch 𝑡, and 𝑇 is the total number of training epochs. As the training epoch 𝑡 increases to 𝑇, the factor 𝜆𝑡 will be exponentially increasing from 𝜆0 to 1.0. The motivation behind the annealing weighting is that the dominant periods of accurate and inaccurate predictions in model training are different. In the early training stages, the inaccurate predictions are the dominant cases so that the IC loss (second term) should be more penalized, while in the late training stages, the accurate predictions are the dominant so the AU loss (first term) should be more penalized. Therefore, the annealing weighing factor 𝜆𝑡 dynamically balances the two terms in training. Discussion. Our EUC method is advantageous over existing approaches [231] and AvUC [155] in following aspects. First, compared with [231], our EUC method takes the same merit of AvUC in that it is a fully differentiable regularization term. Second, compared with both [231] and AvUC, the EUC loss does not rely on the distribution shifted validation set during training which is not reasonable for the OSAR model to access the OOD samples. Therefore, our method provides better flexibility to calibrate deep learning models on large-scale datasets, such as the real-world videos of human actions addressed in this paper. Our experimental results (Table 5.3) show that the model calibration performance of the EUC method is more significant for open-set recognition than for closed-set recognition. 83 Figure 5.5 Contrastive Evidence Debiasing (CED) Module. The module consists of three branches with similar structures. In contrast to the middle branch, the top and bottom ones aim to learn biased evidence by temporally shuffled feature input and 2D convolution (Conv2D), respectively. The generated feature f is contrastively pushed to be independent of biased feature h. 5.3.3 Contrastive Evidence Debiasing For OSAR task, static bias (see example in Fig. 5.2) in a video dataset is one of the most challenging problems that limit the generalization capability of a model in an open set setting. According to [183], static bias can be categorized into scene bias, object bias, and human bias. Existing research work [42, 183, 139, 6] has empirically shown that debiasing the model by input data or learned representation can significantly improve the action recognition performance. As pointed out in [183], it is intrinsically nothing wrong about the bias if it can be “over-fitted" by an action recognition model for achieving a “good" performance in a traditional closed-set setting. However, in an open set setting, the static bias could result in a vulnerable model that falsely recognizes an action video containing similar static features but out-of-contextual temporal dynamics. In this paper, we propose a Contrastive Evidence Debiasing (CED) module to mitigate the static bias problem. As shown in Fig. 5.5, the CED consists of three branches. The middle branch is a commonly-used 3D convolutional structure (Conv3D) to predict unbiased evidence (e) while the top and bottom branches predict biased evidence (˜e and ¯e). In particular, the top branch keeps the same network structure as the middle one but takes temporally shuffled features (˜x) as input. The bottom branch keeps the same input feature (x) as the middle one but replaces the Conv3D with a 2D convolutional structure (Conv2D). Finally, with the HSIC-based min-max optimization, the 84 videobackboneshufflingǁ𝑒𝑒EDL(෤𝑥)TWH⋯Conv3DConv3DConv2DFCsFCsFCs𝑥ҧ𝑒EDLEDLreplicateunbiasedbiasedmin HSICmax HSIC𝑥(෨ℎ)𝑓ℎ feature f for predicting unbiased evidence is encouraged to be contrastive to the features h and ˜h for predicting biased evidence. In particular, motivated by the recent method ReBias [6], the min-max optimization is defined by using the Hilbert-Schmidt Independence Criterion (HSIC). The HSIC function measures the degree of independence between two continuous random variables. With radial basis function (RBF) kernel 𝑘1 and 𝑘2, HSIC𝑘1,𝑘2 (f, h) = 0 if and only if f ⊥⊥ h. The detailed mathematical form of HSIC can be found in [97, 295] (or see the Section 5.6.1.3 of the supplementary material) . For the middle branch, the goal is to learn a discriminative and unbiased feature f by minimizing L (𝜃 𝑓 , 𝜙 𝑓 ) = L𝐸 𝐷 𝐿 (y, e; 𝜃 𝑓 , 𝜙 𝑓 ) + 𝜆 ∑︁ HSIC(f, h; 𝜃 𝑓 ), (5.4) h∈Ω where 𝜃 𝑓 and 𝜙 𝑓 are parameters of neural networks to produce unbiased feature f and to predict evidence e. y is the multi-class label. The second term encourages feature f to be independent of the biased feature h from the set of features generated by top branch ℎ3𝐷 ( ˜x) and the bottom branch ℎ2𝐷 (x), i.e., Ω = {ℎ3𝐷 ( ˜x), ℎ2𝐷 (x)}. For the top and bottom branches, the goal is to learn the above two types of biased feature h by L (𝜃ℎ, 𝜙ℎ) = ∑︁ h∈Ω {L𝐸 𝐷 𝐿 (y, eℎ; 𝜃ℎ, 𝜙ℎ) − 𝜆HSIC(f, h; 𝜃ℎ)} (5.5) where 𝜃ℎ denotes the network parameters of ℎ3𝐷 ( ˜x) and ℎ2𝐷 (x) to generate biased features h, and the 𝜙ℎ denotes the parameters of neural networks to predict corresponding evidence eℎ ∈ {ˆe, ¯e}. The first term in (5.5) aims to avoid the biased feature h to predict arbitrary evidence, while the second term guarantees that h is similar enough to f so that f has to be pushed faraway from h by (5.4). The two objectives in (5.4) and (5.5) are alternatively optimized so that feature h is learned to be biased to guide the debiasing of feature f. In practice, we also implemented a joint training strategy that aims to optimize the objective of (5.4) and (5.5) jointly and we empirically found it can achieve better performance. Discussion. Compared with recent work [42] that leverages adversarial learning to remove scene bias, our method does not rely on object bounding boxes and pseudo-scene labels as auxiliary 85 training input. The representation bias addressed in our paper implicitly encompasses all sources of biases, not just the scene bias. Compared with ReBias [6], our CED module shares a similar idea of removing bias with bias. However, the HSIC in our CED module considers not only the bias-characterizing model (i.e., ℎ2𝐷 (x)) as in [6], but also the biased feature input by temporal shuffling. This consideration will further encourage the backbone to focus more on temporal dynamics. Besides, our CED is a plug-and-play module and can be flexibly inserted into any state-of-the-art deep learning-based action recognition model with little coding effort. 5.4 Experiments Dataset. We evaluate the proposed DEAR method on three commonly used real-world video action datasets, including UCF-101 [296], HMDB-51 [159], and MiT-v2 [230]. All models are trained on UCF-101 training split. MiT-v2 has 305 classes and its testing split contains 30,500 video samples, which are about 20 times larger than the HMDB-51 testing set. In testing, we use the UCF-101 testing set as known samples, and the testing splits of HMDB-51 and MiT-v2 datasets as two sources of unknown. Note that there could be a few overlapping classes between UCF-101 and the other two datasets, but for standardizing the evaluation and reproducibility, we do not manually clean the data. Evaluation Protocol. To evaluate the classification performance on both closed and open set settings, we separately report the Closed Set Accuracy for 𝐾-class classification and the Open Set area under ROC curve (AUC) for distinguishing known and unknown (2 classes). Furthermore, to comprehensively evaluate the (𝐾 + 1)-class classification performance, i.e., the unknown as the (𝐾 + 1)-th class, we plot the curve of macro-F1 scores by gradually increasing the openness similar to existing literature [291, 372, 302]. For each openness point, 𝑖 new classes are randomly selected from HMDB-51 (where 𝑖 ≤ 51) or MiT-v2 (where 𝑖 ≤ 305) test set and we compute the macro-F1 score for each of 10 randomized selections. Since there is no existing quantitative metric to summarize the performance of the F1 curve, in this paper we propose an Open maF1 score: Open maF1 = (cid:205)𝑖 𝜔(𝑖) 𝑂 · 𝐹 (𝑖) 1 (cid:205)𝑖 𝜔(𝑖) 𝑂 86 (5.6) Table 5.1 Comparison with state-of-the-art methods. Models are trained on the closed set UCF-101 [296] and tested on two different open sets where the samples of unknown class are from HMDB-51 [159] and MiT-v2 [230], respectively. For Open maF1 scores, both the mean and standard deviation of 10 random trials of unknown class selection are reported. Closed set accuracy (CS-Acc) is for reference only. Models OSAR HMDB-51 [159] MiT-v2 [230] Open maF1 (%)OS-AUC (%)Open maF1 (%)OS-AUC (%) CS-Acc (%) I3D [27] TSM [193] SlowFast [74] TPN [360] 67.85 ± 0.12 OpenMax [16] 71.13 ± 0.15 MC Dropout BNN SVI [153] 71.57 ± 0.17 73.19 ± 0.17 SoftMax 71.48 ± 0.15 RPL [34] 77.24 ± 0.18 DEAR (ours) 74.17 ± 0.17 OpenMax [16] 71.52 ± 0.18 MC Dropout BNN SVI [153] 69.11 ± 0.16 78.27 ± 0.20 SoftMax 69.34 ± 0.17 RPL [34] 84.69 ± 0.20 DEAR (ours) 73.57 ± 0.10 OpenMax [16] 70.55 ± 0.14 MC Dropout BNN SVI [153] 69.19 ± 0.13 78.04 ± 0.16 SoftMax 68.32 ± 0.13 RPL [34] 85.48 ± 0.19 DEAR (ours) 65.27 ± 0.09 OpenMax [16] 68.45 ± 0.12 MC Dropout BNN SVI [153] 63.81 ± 0.11 76.23 ± 0.14 SoftMax 70.31 ± 0.13 RPL [34] 81.79 ± 0.15 DEAR (ours) 74.34 75.07 74.66 75.68 75.20 77.08 77.07 73.85 73.42 77.99 73.62 78.65 78.76 75.41 74.78 79.16 74.23 82.94 74.12 74.13 72.68 77.97 75.32 79.23 66.22 ± 0.16 68.11 ± 0.20 68.65 ± 0.21 68.84 ± 0.23 68.11 ± 0.20 69.98 ± 0.23 71.81 ± 0.20 65.32 ± 0.25 64.28 ± 0.23 71.68 ± 0.27 63.92 ± 0.25 70.15 ± 0.30 72.48 ± 0.12 67.53 ± 0.17 65.22 ± 0.21 74.42 ± 0.22 66.33 ± 0.17 77.28 ± 0.26 64.80 ± 0.10 65.77 ± 0.17 61.40 ± 0.15 70.82 ± 0.21 66.21 ± 0.21 71.18 ± 0.23 77.76 79.14 79.50 79.94 79.16 81.54 83.05 78.35 77.39 82.38 77.28 83.92 80.62 78.49 77.39 82.88 77.42 86.99 76.26 77.76 75.32 81.35 78.21 81.80 56.60 94.11 93.89 94.11 94.26 93.89 65.48 95.06 94.71 95.03 95.59 94.48 62.09 96.75 96.43 96.70 96.93 96.48 53.24 95.43 94.61 95.51 95.48 96.30 𝑂 denotes the openness when 𝑖 new classes are introduced and it is defined as 𝜔(𝑖) 𝑂 = where 𝜔(𝑖) 1−√︁2𝐾/(2𝐾 + 𝑖) according to [278]. 𝐹 (𝑖) 1 is the macro-F1 score by considering the samples from all new classes as unknown. The basic idea of weighting 𝐹1 by 𝜔𝑂 is that the result is essentially the normalized area under the curve of macro-F1 vs. openness. The Open maF1 quantitatively evaluates the performance of (𝐾 + 1)-class classification in an open set setting. 87 (a) HMDB-51 as Unknown (b) MiT-v2 as Unknown Figure 5.6 Open macro-F1 scores against varying Openness. The maximum openness is deter- mined by the number of unknown classes, i.e., in 𝜔(𝑖) 𝑂 , 𝑖 = 51 for HMDB-51 and 𝑖 = 305 for MiT-v2. Implementation Details. Our method is implemented with the PyTorch codebase MMAc- tion2 [45]. The adopted action models are experimented with ResNet-50 backbone pre-trained on Kinetics-400 [27] dataset and fine-tuned on UCF-101 training set. Our proposed EDL loss L𝐸 𝐷 𝐿 is used to replace the original cross-entropy loss, and our proposed CED module is inserted into the layer before the classification heads of recognition models. During training, we use a base learning rate of 0.001 and it is step-wisely decayed for every 20 epochs with a total of 50 epochs. We set the batch size as 8 during training. The rest of the hyperparameters are kept the same as the default configuration provided by MMAction2. During inference, our CED module is removed. Other implementation details are provided in the supplementary material. 5.4.1 Comparison with State-of-the-art The proposed DEAR method is compared with baselines as shown in the second column of Table 5.1. The open set performances are also summarized in Fig. 5.1. For these baselines, SoftMax, OpenMax, and MC Dropout share the same trained model since they are only different in the testing phase. For the MC Dropout and BNN SVI which incorporate stochastic sampling in testing, we set the 10 forward passes through the model and adopt the BALD [114] method to quantify the model uncertainty as suggested by [153]. Following [302], the threshold of the scoring function is determined by ensuring 95% training data to be recognized as known. Open Set Action Recognition. In Table 5.1, we report the results of both closed-set and open 88 0246810Openness (%)6065707580Open maF1 (%)DEAR (full)SoftMaxRPLBNN SVIMC DropoutOpenMax05101520253035Openness (%)6065707580Open maF1 (%)DEAR (full)SoftMaxRPLBNN SVIMC DropoutOpenMax set performance. It shows that with different action recognition models, our method consistently and significantly outperforms baselines on Open maF1 score for (𝐾+1)-class classification and Open Set AUC score for rejecting the unknowns, while only sacrificing less than 1% performance decrease on Closed Set Accuracy. When equipped with the SlowFast model, our method could improve the MC Dropout method by almost 8% of open set AUC and 15% of Open maF1 score. OpenMax and RPL are the recent state-of-the-art OSR methods, however, we find that their performances are far behind our DEAR method on the OSAR task. Note that the closed set accuracy of OpenMax is dramatically lower than other baselines, this is because OpenMax directly modifies the activation layer before softmax and appends the unknown class as output, which could destroy the accurate predictions of known samples. Besides, we also note that with TSM model, the Open maF1 score of DEAR method is slightly inferior to OpenMax on the MiT-v2 dataset. This indicates that for large-scale unknown testing data such as MiT-v2, the 2D convolution-based TSM is not a good choice for the DEAR method as compared to those 3D convolution-based architectures such as I3D, SlowFast, and TPN. Based on I3D model, as depicted in Fig. 5.6, we plot the average Open maF1 scores against varying openness by incrementally introducing HMDB-51 and MiT-v2 testing sets as unknown. It clearly shows that the proposed DEAR method achieves the best performance. Note that for the large-scale MiT-v2 dataset, as the openness increases, the performances of different methods converge to be close to each other. This is because the macro-F1 is sensitive to class imbalance and it will be gradually dominated by the increasing unknown classes from a total of 305 categories in MiT-v2. Nevertheless, our method DEAR still keeps better than all other baselines. Out-of-distribution Detection. This task aims to distinguish between the in-distribution samples (known) and out-of-distribution (OOD) samples (unknown). Similar to the baseline MC Dropout and BNN SVI [153], which use uncertainty as a scoring function to identify the unknown, the OOD detection performance can be evaluated by showing the Open Set AUC in Table 5.1 and the histogram statistics in Fig. 5.7. The AUC numbers and figures clearly show that our DEAR method with EDL uncertainty can better detect the OOD samples. Compared with the vanilla 89 (a) MC Dropout (b) BNN SVI [153] (c) DEAR (vanilla) (d) DEAR (full) Figure 5.7 Out-of-distribution Detection by Uncertainty. The DEAR (vanilla) is the variant of DEAR (full) that only L𝐸 𝐷 𝐿 is used for model training. We use MiT-v2 as unknown and I3D as the recognition model. Uncertainty values are normalized to [0,1] within each distribution. DEAR which only uses L𝐸 𝐷 𝐿 for model training, the estimated uncertainties of OOD samples skew closer to 1.0. 5.4.2 Ablation Study Contribution of Each Component. In Table 5.2, it shows the OSAR performance of each DEAR variant. The experiments are conducted with the TPN model and evaluated using the HMDB-51 testing set as unknown. The results demonstrate that all the proposed components could contribute to the OSAR performance gain. In particular, the ℎ2𝐷 (x) of our CED module contributes the most. Besides, the joint training of CED module shows slightly better than the alternative training. Therefore, by default joint training is adopted throughout other experiments. Model Calibration. Though the proposed EUC module can improve the performance on 90 0.00.20.40.60.81.0BALD uncertainty0246810DensityAUC = 79.14in-distribution (UCF-101)out-of-distribution (MiT-v2)0.00.20.40.60.81.0BALD uncertainty0246810DensityAUC = 79.50in-distribution (UCF-101)out-of-distribution (MiT-v2)0.00.20.40.60.81.0EDL uncertainty0246810DensityAUC = 81.43in-distribution (UCF-101)out-of-distribution (MiT-v2)0.00.20.40.60.81.0EDL uncertainty0246810DensityAUC = 81.54in-distribution (UCF-101)out-of-distribution (MiT-v2) Table 5.2 Ablation studies. Based on TPN [360] model, HMDB-51 [159] is used as the unknown. The best results are shown in bold. L𝐸𝑈𝐶 CED Joint Train Open maF1 (%) OS-AUC (%) ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ 74.95 ± 0.18 75.88 ± 0.16 81.18 ± 0.15 81.79 ± 0.15 77.12 77.49 79.02 79.23 Table 5.3 Expected Calibration Error (ECE) results. Small ECE indicates the model is better calibrated. The numbers in brackets indicate the number of classes involved in the evaluation. Model variants Open Set (K+1) Open Set (2) Closed Set (K) DEAR (w/o L𝐸𝑈𝐶) DEAR (full) 0.284 0.268 0.256 0.239 0.030 0.029 Table 5.4 Accuracy (%) on Biased and Unbiased dataset. Methods DEAR (w/o CED) DEAR (full) Biased (Kinetics) top-1 91.18 91.18 top-5 99.30 99.54 Unbiased (Mimetics) top-5 top-1 26.56 34.38 69.53 75.00 OSAR tasks (as shown in Table 5.2), we further dig into the question of if the performance gain of EUC results from better calibrating a classification model. To this end, we adopt the widely used Expected Calibration Error (ECE) [99] to evaluate the model calibration performance of our full method DEAR (full) and its variant without EUC loss L𝐸𝑈𝐶. Quantitative results are reported in Table 5.3. It shows that L𝐸𝑈𝐶 can reduce the ECE values with both open-set and closed-set recognition settings. In particular, the calibration capability is more significant in an open set setting than in a closed set setting. This validates our claim that the proposed L𝐸𝑈𝐶 could calibrate an OSAR model. Representation Debiasing. To further validate if the performance gain of our CED module is rooted in the representation debiasing, we use Kinetics [27] as a biased dataset and Mimetics [339] as an unbiased dataset. Similar to [6], we select 10 human action categories from Kinetics for training and biased testing, and select the same categories from Mimetics for unbiased testing. 91 Figure 5.8 Confusion Matrix for Known and Unknown. The 𝑥-axis shows the ground truth classes of both UCF-101 (known) and HMD-51 (unknown), and the 𝑦-axis represents the predicted classes defined by UCF-101. This figure highlights the top 5 unknown classes (blue text) that are misclassified as the known (red text). Without the pre-trained model from the Kinetics dataset, we apply our DEAR method with and without CED on the TSM model. The top-1 and top-5 accuracy results are reported in Table 5.4. It shows that models trained on biased datasets (Kinetics) are vulnerable to unbiased datasets (Mimetics). However, when equipped with the proposed CED module, the performance on the unbiased dataset can be significantly improved while performance on the biased dataset still keeps minor changes. What Types of Unknown are Mis-classified? As shown in Fig. 5.8, the confusion matrix is visualized by considering both the known classes from UCF-101 and unknown classes from HMDB-51 datasets. It shows that in spite of high closed set accuracy (the diagonal line), the actions from unknown classes could be easily classified as known categories. For example, shoot ball is the top-1 mis-classified unknown class in HMDB-51, which is the most frequently mis- classified as the known class Archery in UCF-101. It is convincing that the misclassification is caused by their similar background scene, i.e., a large area of grassland, which is static bias as addressed in this paper. 92 020406080100120140UCF-101 (known) + HMDB-51 (unknown)020406080100Predicted Classesshoot_ballArcherypourPullUpspullupBoxingPunchingBagpushPushUpsflic_flacGolfSwing 5.5 Conclusion In this paper, we proposed a Deep Evidential Action Recognition (DEAR) method for the open set action recognition (OSAR) problem. OSAR is more challenging than image OSR problems due to the uncertain nature of temporal action dynamics and the static bias of background scenes. To this end, we conduct Evidential Deep Learning (EDL) to learn a discriminative action classifier with quantified predictive uncertainty, where the uncertainty is used to distinguish between the known and unknown samples. As novel extensions of EDL, an Evidential Uncertainty Calibration (EUC) method and a contrastive evidential debiasing (CED) module are proposed to address the unique challenges in OSAR. Extensive experimental results demonstrate that our DEAR method works for most existing action recognition models in an open set setting. 5.6 Supplementary Material In this document, additional materials are provided to supplement our main paper. In sec- tion 5.6.1, the preliminary knowledge about the evidential deep learning and model calibration are described in detail, which are helpful to understand the methodology of our main paper. In section 5.6.2, additional implementation details are provided, which are useful to reproduce our proposed method. 5.6.1 Detailed Methodology 5.6.1.1 Preliminaries of Evidential Deep Learning Existing video action recognition models typically use softmax on top of deep neural networks (DNN) for classification. However, the softmax function is heavily limited in the following aspects. First, the predicted categorical probabilities have been squashed by the denominator of softmax. This is known to result in an over-confident prediction for the unknown data, which is even more detrimental to open set recognition problem than the closed set recognition. Second, the softmax output is essentially a point estimate of the multinomial distribution over the categorical probabilities so that softmax cannot capture the uncertainty of categorical probabilities, i.e., second- order uncertainty. To overcome these limitations, recent evidential deep learning (EDL) [283] is developed from the 93 evidence framework of Dempster-Shafer Theory (DST) [284] and the subjective logic (SL) [130]. For a 𝐾-class classification problem, the EDL treats the input x as a proposition and regards the classification task as to give a multinomial subjective opinion in a 𝐾-dimensional domain {1, . . . , 𝐾 }. The subjective opinion is expressed as a triplet 𝜔 = (b, 𝑢, a), where b = {𝑏1, . . . , 𝑏𝐾 } is the belief mass, 𝑢 represents the uncertainty, and a = {𝑎1, . . . , 𝑎𝐾 } is the base rate distribution. For any 𝑘 ∈ [1, 2, . . . , 𝐾], the probability mass of a multinomial opinion is defined as 𝑝𝑘 = 𝑏𝑘 + 𝑎𝑘𝑢, ∀𝑦 ∈ Y (5.7) To enable the probability meaning of 𝑝𝑘 , i.e., (cid:205)𝑘 𝑝𝑘 = 1, the base rate 𝑎𝑘 is typically set to 1/𝐾 and the subjective opinion is constrained by 𝑢 + 𝐾 ∑︁ 𝑘=1 𝑏𝑘 = 1 (5.8) Besides, for a 𝐾-class setting, the probability mass p = [ 𝑝1, 𝑝2, . . . , 𝑝𝐾] is assumed to fol- low a Dirichlet distribution parameterised by a 𝐾-dimensional Dirichlet strength vector 𝜶 = {𝛼1, . . . , 𝛼𝐾 }: Dir(p|𝜶) = 1 𝐵(𝜶) 𝐾 (cid:214) 𝑘=1 0, 𝑝𝛼𝑘−1 𝑘 , for p ∈ S𝐾, otherwise, (5.9)    where 𝐵(𝜶) is a 𝐾-dimensional Beta function, S𝐾 is a 𝐾-dimensional unit simplex. The total strength of the Dirichlet is defined as 𝑆 = (cid:205)𝐾 𝛼𝑘 . Note that for the special case when 𝐾 = 2, 𝑘=1 the Dirichlet distribution reduces to a Beta distribution and a binomial subjective opinion will be formulated in this case. According to the evidence theory, the term evidence is introduced to describe the amount of supporting observations for classifying the data x into a class. Let e = {𝑒1, . . . , 𝑒𝐾 } be the evidence for 𝐾 classes. Each entry 𝑒𝑘 ≥ 0 and the Dirichlet strength 𝜶 are linked according to the evidence theory by the following identity: 𝜶 = e + a𝑊 (5.10) 94 where 𝑊 is the weight of uncertain evidence. With the Dirichlet assumption, the expectation of the multinomial probability p is given by E( 𝑝𝑘 ) = 𝛼𝑘 (cid:205)𝐾 𝑘=1 𝛼𝑘 = 𝑒𝑘 + 𝑎𝑘𝑊 𝑊 + (cid:205)𝐾 𝑘=1 𝑒𝑘 (5.11) With loss of generality, the weight 𝑊 is set to 𝐾 and considering the assumption of the subjective opinion constraint in Eq. (5.8) that 𝑎𝑘 = 1/𝐾, we have the Dirichlet strength 𝛼𝑘 = 𝑒𝑘 + 1 according to Eq. (5.10). In this way, the Dirichlet evidence can be mapped to the subjective opinion by setting the following equality’s: 𝑏𝑘 = 𝑒𝑘 𝑆 and 𝑢 = 𝐾 𝑆 (5.12) Therefore, we can see that if the evidence 𝑒𝑘 for the 𝑘-th class is predicted, the corresponding expected class probability in Eq. (5.7) (or Eq. (5.11)) can be rewritten as 𝑝𝑘 = 𝛼𝑘 /𝑆. From Eq. (5.12), it is clear that the predictive uncertainty 𝑢 can be determined after 𝛼𝑘 is obtained. Inspired by this idea, the EDL leverages deep neural networks (DNN) to directly predict the evidence e from the given data x for a 𝐾-class classification problem. In particular, the output of the DNN is activated by a non-negative evidence function. Considering the Dirichlet prior, the DNN is trained by minimizing the negative log-likelihood: L (𝑖) 𝐸 𝐷 𝐿 (y, e; 𝜃) = − log (cid:32)∫ 𝐾 (cid:214) 𝑘=1 𝑝𝑦𝑖𝑘 𝑖𝑘 1 𝐵(𝜶𝑖) 𝐾 (cid:214) 𝑘=1 (cid:33) 𝑝𝛼𝑖𝑘−1 𝑖𝑘 𝑑p𝑖 = 𝐾 ∑︁ 𝑘=1 𝑦𝑖𝑘 (log(𝑆𝑖) − log(𝑒𝑖𝑘 + 1)) (5.13) where y𝑖 = {𝑦𝑖1, . . . , 𝑦𝑖𝐾 } is an one-hot 𝐾-dimensional label for sample 𝑖 and e𝑖 can be expressed as e𝑖 = 𝑔 ( 𝑓 (x𝑖; 𝜃)). Here, 𝑓 is the DNN parameterized by 𝜃 and 𝑔 is the evidence function such as exp, softplus, or ReLU. Note that in [283], there are two other forms of EDL loss function. In our main paper, we found the Eq. (5.13) achieves better training empirical performance. 5.6.1.2 EDL for Open Set Action Recognition To implement the EDL method on video action recognition tasks, we removed the Kull- back–Leibler (KL) divergence regularizer term defined in [283], because the digamma function involved in the KL divergence is not numerically stable for large-scale video data. Instead, to 95 compensate for the over-fitting risk, we propose the Evidential Uncertainty Calibration (EUC) as a new regularization. Together with the Contrastive Evidence Debiasing module, the complete training objective of our DEAR method can be expressed as L = ∑︁ 𝑖 L (𝑖) 𝐸 𝐷 𝐿 + 𝑤1L𝐸𝑈𝐶 + 𝑤2L𝐶𝐸 𝐷 (5.14) where L𝐸𝑈𝐶 is defined in Eq. (5.3) in our main paper, and L𝐶𝐸 𝐷 is the sum of (or one of for alternative training) L (𝜃 𝑓 , 𝜙 𝑓 ) and L (𝜃ℎ, 𝜙ℎ) defined in Eq. (5.4) and Eq. (5.5) respectively in our main paper. The hyperparameters 𝑤1 and 𝑤2 are set to 1.0 and 0.1, respectively. During the training process, the DEAR model aims to accurately construct the Dirichlet param- eters 𝜶 by collecting the evidence from human action video training set. In the inference phase, the probability of each action class is predicted as ˆ𝑝𝑘 = 𝛼𝑘 /𝑆 while the predictive uncertainty is simultaneously computed as 𝑢 = 𝐾/𝑆. If an input action video is assigned with high uncertainty, which means a vacuity of evidence to support for closed-set classification, the action is likely to be unknown from the open testing set. Compared with existing DNN-based uncertainty estimation method such as Bayesian neural networks (BNN) or deep Gaussian process (DGP), the advantage of EDL is that the predictive un- certainty is deterministically learned without inexact posterior approximation and computationally expensive sampling. These merits enable the EDL method to be efficient for training recognition models from large-scale vision data such as the human action videos. 5.6.1.3 Hilbert-Schmidt Independence Criterion Hilbert-Schmidt Independence Criterion (HSIC) is a commonly-used dependency measurement of two high-dimensional variables. In practice, we used the unbiased HSIC estimator in [295] with 𝑚 samples: HSIC𝑘,𝑙 (𝑈, 𝑉) = (cid:20) 1 𝑚(𝑚 − 3) tr( ˜𝑈 ˜𝑉𝑇 ) + 1𝑇 ˜𝑈11𝑇 ˜𝑉1 (𝑚 − 1) (𝑚 − 2) − 2 𝑚 − 2 1𝑇 ˜𝑈 ˜𝑉𝑇 1 (cid:21) , (5.15) where ˜𝑈 is the kernelized matrix of 𝑈 with RBF kernel 𝑘 by ˜𝑈𝑖 𝑗 = (1 − 𝛿𝑖 𝑗 )𝑘 (𝑢𝑖, 𝑢 𝑗 ), {𝑢𝑖} ∼𝑈 and the (1 − 𝛿𝑖 𝑗 ) sets the diagonal of ˜𝑈 to zeros. ˜𝑉 is defined similarly with kernel 𝑙, and 1 is an all-one vector. The HSIC value is equal to zero if and only if the two variables are independent. 96 5.6.1.4 Evaluation of Model Calibration In our main paper, we used the expected calibration error (ECE) to quantitatively evaluate the model calibration performance of our proposed EUC method. According to [235, 99], the basic idea of model calibration is that, if the confidence estimation ˆ𝑝 (probability of correctness) is well calibrated, we hope ˆ𝑝 represent the true probability of the case when the predicted label ˆ𝑦 is correct. Formally, this can be expressed as P( ˆ𝑦 = 𝑦| ˆ𝑝 = 𝑝) = 𝑝 (5.16) Since perfect calibration is infeasible due to the finite sample space, a practical way is to group all predicted confidence ˆ𝑝 into 𝑀 bins in the range of [0,1] such that the width of each bin is 1/𝑀. Therefore, for the 𝑚-th bin, the accuracy can be estimated by acc(𝐵𝑚) = 1 |𝐵𝑚 | ∑︁ 𝑖∈𝐵𝑚 I( ˆ𝑦𝑖 = 𝑦𝑖) (5.17) where 𝐵𝑚 is the set of indices of prediction ˆ𝑝 when it falls into the 𝑚-th bin. ˆ𝑦𝑖 and 𝑦𝑖 are predicted and ground truth labels. Besides, the average confidence for the 𝑚-th bin can be expressed as conf(𝐵𝑚) = 1 |𝐵𝑚 | ∑︁ 𝑖∈𝐵𝑚 ˆ𝑝𝑖 (5.18) To evaluate the mis-calibration error, the ECE is defined as the expectation of the gap between the accuracy and confidence in 𝑀 bins for all 𝑁 samples: ECE = 𝑀 ∑︁ 𝑚=1 |𝐵𝑚 | 𝑁 |acc(𝐵𝑚) − conf(𝐵𝑚)| (5.19) A perfect calibrated model means that ECE=0 and higher ECE value indicates that the model is less calibrated. 5.6.2 Implementation Details Network Architecture. As presented in our main paper, the proposed DEAR method as well as all other baselines are implemented on top of the four recent video action recognition models, i.e., I3D, TSM, SlowFast, and TPN. For simplicity, these models use ResNet-50 as the backbone 97 architecture and the network weights are initialized with the pre-trained model from the Kinetics- 400 benchmark. To avoid the impact of the validation experiments on the Kinetics and Mimetics datasets, the pre-trained model is not used and we train the model from scratch using the same hyperparameters. Specifically, for the I3D model, it is straightforward to implement our method by replacing the cross-entropy loss with the proposed EUC regularized EDL loss, and inserting the proposed CED module before the recognition head (fully-connected layers). For the TSM model, since the architecture of TSM is based on 2D convolution where the output feature embedding is with the size (𝐵, 𝑀𝐶, 𝐻, 𝑊), we recover the number of video segments 𝑀 as the temporal dimension such that the 5-dimensional tensor with size (𝐵, 𝐶, 𝑀, 𝐻, 𝑊) could be compatible with our proposed CED module for contrastive debiasing. For the SlowFast model, our CED module is inserted after the slow pathway because the feature embedding of slow pathway is more likely to be biased since it captures the static cues of video content. For the TPN model, we used the ResNet-50-like SlowOnly model as the recognition backbone and the auxiliary cross-entropy loss in the TPN head is kept unchanged. Training and Inference. In the training phase, we choose the exp function as the evidence function because we empirically found exp is numerically more stable when using the proposed EDL loss L𝐸 𝐷 𝐿. We set the hyperparameter 𝜆0 to 0.01 in EUC loss L𝐸𝑈𝐶 and set 𝜆 to 1.0 in the two CED losses. The weight of L𝐸𝑈𝐶 is set to 1.0 and the weight of the sum of the two CED losses is empirically set to 0.1. In practice, we found the model performance is robust to these hyperparameters. We used mini-batch SGD with nesterov strategy to train all the 3D convolution models. For all models, weight decay is set to 0.0001 and momentum factor is set to 0.9 by default. Our experiments are supported by two GeoForce RTX 3090 and two Tesla A100 GPUs. Since no additional parameters are introduced during inference, the inference speed of existing action recognition models is not affected. Dataset Information. For the UCF-101 and HMDB-51 datasets, we used the split1 for all experiments. For the MiT-v2 dataset, we only use the testing set for evaluation. To validate the 98 proposed CED module, we refer to [6] and select 10 action categories which are included in both Kinetics and Mimetics dataset. These categories are canoeing or kayaking, climbing a rope, driving car, golf driving, opening bottle, playing piano, playing volleyball, shooting goal (soccer), surfing water, and writing. The recognition model is trained from scratch on the 10 categories of Kinetics training set, and tested on these categories of both Kinetics and Mimetics testing set. 99 CHAPTER 6 OPEN-SET TEMPORAL ACTION LOCALIZATION 6.1 Introduction Temporal Action Localization (TAL) aims to temporally localize and recognize human actions in an untrimmed video. With the success of deep learning in video understanding [27, 74, 148, 316, 35] and object detection [268, 24, 314], TAL has experienced remarkable advance in recent years [30, 388, 356, 192]. However, these works are rooted in the closed-set assumption that testing videos are assumed to contain only the pre-defined action categories, which is impractical in an open world where unknown human actions are inevitable to appear. In this paper, we for the first time step forward the Open Set Temporal Action Localization (OSTAL) problem. OSTAL aims to not only temporally localize and recognize the known actions but also reject the localized unknown actions. As shown in Fig. 6.1, given an untrimmed video (the top row) from open world, traditional TAL (the middle row) could falsely accept the unknown action clip HammerThrow as one of the known actions such as the LongJump, while the proposed OSTAL (the bottom row) could correctly reject the clip as the Unknown. Besides, both tasks need to differentiate between foreground actions and the Backgrounds which are purely background frames. The proposed OSTAL task is fundamentally more challenging than both the TAL and the closely relevant open set recognition (OSR) [278] problems. On one hand, the recognition and localization of known actions become harder due to the mixture of background frames and unknown foreground actions. Existing TAL methods typically assign the mixture with a non-informative Background label or a wrong action label, which are unable to differentiate between them. On the other hand, different from the OSR problem, rejecting an unknown action is conditioned on positively localizing a foreground action so that the localization quality is critical to the OSTAL. To tackle these challenges, we propose a general framework OpenTAL by decoupling the overall This chapter is adapted from the following publication: "Wentao Bao, Qi Yu, and Yu Kong. OpenTAL: Towards open set temporal action localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oral, 2022." 100 Figure 6.1 OSTAL and TAL Tasks. The OSTAL task is different from the TAL in that, there exist unknown actions in untrimmed open-world videos and the OSTAL models need to reject the positively localized action (e.g., HammerThrow) as the Unknown, rather than falsely assign it a known label such as the LongJump. OSTAL objective into three interconnected components: uncertainty-aware action classification, actionness prediction, and temporal location regression. In essence, the foreground actions are distinguished from the background by the actionness prediction and localized by the temporal localization, while the known and unknown foreground actions are discriminated by the learned evidential uncertainty from the classification module. To achieve these goals, we propose three novel technical approaches as follows. First, action classification is developed to recognize known actions and quantify the classification uncertainty by recent evidential deep learning (EDL) [283, 401, 4, 316]. To enable this module to learn from important samples, we propose an importance-balanced EDL method by leveraging the magnitude of EDL gradient and evidential features. Second, actionness prediction is to differentiate between foreground actions (positives) and background frames (negatives). In the open set setting, due to the mixture of unknown foreground actions (unlabeled) and background frames, learning from the labeled known actions and the mixture intrinsically reduces to a positive-unlabeled (PU) learning problem [14]. To this end, we propose a PU learning method by selecting the top negative samples from the mixture as the true negatives. Third, the temporal localization module is trained to not only localize the known actions but also calibrate the classification uncertainty. We propose an IoU-aware uncertainty calibration (IoUC) method by using the temporal Intersection-over-Union (IoU) as the localization quality to calibrate the uncertainty. 101 LongJumpHighJumpHammerThrowKnown ActionsUnknown ActionsOpen World VideoLongJumpHighJumpLongJumpTAL TaskBackgroundLongJumpHighJumpUnknownOSTAL TaskBackground Based on the existing TAL datasets THUMOS14 [128] and ActivityNet1.3 [22], we set up a new benchmark to evaluate baselines and the proposed OpenTAL method for the OSTAL task, where the Open Set Detection Rate is introduced to comprehensively evaluate the OSTAL perfor- mance. Experimental results show significant superiority of our method and indicate large room for improvement in this direction. Our main contribution is threefold: • To the best of our knowledge, this work is the first attempt on open set temporal action localization (OSTAL), which is fundamentally more challenging but highly valuable in open-world settings. • We propose a general OpenTAL framework to address the unique challenges of OSTAL as compared with existing TAL and OSR problems. It is flexible to enable existing TAL models for open-set scenarios. • The proposed importance-balanced EDL, PU learning, and IoUC methods are found effective for OSTAL tasks based on the OpenTAL framework. 6.2 Related Work Temporal Action Localization The goal of Temporal Action Localization (TAL) is to recognize and temporally localize all the action instances in an untrimmed video. Existing TAL methods fall into two dominant paradigms: one-stage and two-stage approaches. The two-stage approaches [356, 388, 297, 354, 210] generate class-agnostic temporal proposals[7, 109, 195, 194] at first and then perform the classification and boundary refinement of each proposal. The heuristic anchor design and the closed-set definition of the pre-trained proposal generation limit their applicability to the open-set problem. One-stage methods [370, 20, 210, 192] do not rely on the action proposal generation and can be typically trained in an end-to-end manner. These methods obtain the temporal boundaries first based on frame-level features and then perform global reasoning by multi-stage refinement or modeling the temporal transitions. Recently, AFSD [192] is proposed following the anchor-free design without actionness and proposals, which is a lightweight and flexible framework. 102 Figure 6.2 Proposed OpenTAL. Given untrimmed videos as input, the OpenTAL method is developed on existing TAL models (such as AFSD [192]) toward the OSTAL scenario. It consists of action classification, actionness prediction, and location regression, which are learned by the proposed MIB-EDL loss (Eq. (6.5)), PU learning (Eq. (6.6)), and localization loss (Eq. (6.7)), respectively. Furthermore, the IoU-aware uncertainty calibration is proposed to calibrate the uncertainty estimation by considering localization quality (Eq. (6.8)). In inference, with a two-step decision procedure by leveraging the uncertainty and actionness, video actions from the known and unknown classes, as well as background frames can be distinguished in the OSTAL setting (see Algorithm 6.1). While a lot of recent methods focus on improving the proposal generation [7, 109, 195, 194, 399] or boundary refinement [192], a few focus on boosting the classification accuracy [402, 288, 413]. The above approaches assume that all of the action instances in untrimmed videos belong to pre-defined categories, which impedes their application to open-world scenarios. Though the open set is considered in [414], their method is designed for efficient annotation in few-shot learning tasks. In this paper, an OSTAL problem is formulated to handle the unknown actions in TAL applications. Open Set Recognition Open set recognition (OSR) aims to recognize known classes and reject the unknown. The pioneering work by Scheirer et al. [278] formalized the definition of OSR and introduced an “one-vs-set" machine based on binary SVM, which inspired a line of SVM-based OSR methods [279, 123, 133]. Benefited by the deep neural networks (DNNs), Bendale et al. [16] proposed the first DNN-based OSR method OpenMax, which leverages Extreme Value Theory (EVT) to expand the 𝐾-class softmax classifier. Recently, Fang et al. [69] theoretically proved the learnability of OSR classifier and the generalization bound. Existing generative OSR methods [83, 64, 257, 147, 33, 407, 381] utilize GAN [92], generative causal model, or mixup augmentation to 103 TAL Model⋮Location RegressionActionnessPredictionevidencesegment featuresuntrimmed videouncertainty⋮⋮𝜶𝐷𝑖𝑟(𝒑|𝜶)⋮⋮⋮𝐿*𝐿action proposals⋮⋮𝒫𝒰𝑎actionnessAction ClassificationIoU-aware Uncertainty CalibrationuncertaintytIoUℒ!"#$%&’ℒ()ℒ’*+ℒ",)+ground truth locations knownunknownbackgroundknownunknownbackgroundeasyhardeasyhard generate the samples of the unknown. From the reconstruction perspective, some literature [372, 249, 302] leverage VAE [143] or self-supervised learning to reconstruct the representation of known class data to identify the unknown. Prototype learning and metric learning methods [361, 290, 362, 34, 33, 391, 28] aim to identify the unknown by producing large distance to the prototype of known class data. Recently, uncertainty estimation methods [233, 316, 335] by probabilistic and evidential deep learning show promising results on OSR problems. In this paper, we step further toward the OSTAL problem. We are aware of analogous extensions from OSR to open set object detection [224, 61, 131] and segmentation [258, 246, 333, 120]. However, it is the uniqueness of the localization in an open world that makes the OSTAL problem even more challenging and valuable in practice. 6.3 Approach Setup Given an untrimmed video, the OSTAL task requires a model to localize all actions with temporal locations 𝑙𝑖 = (𝑠𝑖, 𝑒𝑖), assign the actions with labels 𝑦𝑖 ∈ {0, 1, . . . , 𝐾 } where 𝑦𝑖 = 0 indicates the action consisting of background frames, and reject the actions from novel classes as the unknown. In the training, the model only has access to the video data and the annotations of known actions, while the annotations of unknown actions are not given. This setting is different from the OSR problem where both annotations and data of unknown classes are not given because it is impractical in the TAL task to discard video segments of unknown actions. Overview Fig. 6.2 shows an overview of the proposed OpenTAL. Given an untrimmed video, the features of action proposals are obtained from an existing TAL model such as the AFSD [192]. To fulfill OSTAL, we decouple the objective into three sub-tasks by a trident head, including action classification, actionness prediction, and location regression. The three branches are learned by multi-task loss functions, which will be introduced in detail. Motivations Existing TAL models typically adopt a (𝐾 +1)-way action classification by assigning the background video frames with the (𝐾 + 1)-th class Background. However, this paradigm is unable to handle the OSTAL case when unknown actions exist in the Background class. 104 To solve this problem, on one hand, one would attempt to append the 𝐾 known classes with an additional Unknown category in an existing TAL system. However, this solution is practically infeasible under the OSTAL setting, because finding the video segments to train a classifier with the class Unknown relies on the temporal boundary annotations of unknown actions, which are not available under our OSTAL setting. Though one could relax the OSTAL setting by providing temporal annotations of the unknowns in training, learning a (𝐾 + 1)-way classifier is nontrivial due to the vague semantics of the Unknown, and this relaxation has little practical significance in an open-world where we have nothing about the prior knowledge of unknown actions. On the other hand, one may remove the Unknown or the Background class from training data, which are both infeasible under the OSTAL setting because (i) we have no temporal annotations of the unknown actions to remove them, and (ii) the pure background frames provide indispensable temporal context for action localization. Therefore, in contrast to the OSR problem, a unique technical challenge of OSTAL lies in distinguishing between actions of known and unknown classes, as well as the background frames. Moreover, since the unknown actions are mixed with background frames without annotations, learning to distinguish foreground actions essentially reduces to a semi-supervised OSR prob- lem [376, 276], that the model is trained with the labeled “known known" actions and the unlabeled “known unknown" actions while testing with data containing the “unknown unknown" actions1. To tackle these unique challenges, we propose to decouple the (𝐾 + 1)-way action classification into 𝐾-way uncertainty-aware classification (Sec. 6.3.1) and actionness prediction (Sec. 6.3.2). Thus, we could address the first challenge above by jointly leveraging the uncertainty and actionness in a two-level decision-making (see Table 6.1) and the second challenge by the PU learning (Sec. 6.3.2). 6.3.1 Action Classification 𝐾-way Uncertainty-aware Classification Following the existing Evidential Deep Learning (EDL) [283, 316], which is efficient to quantify the classification uncertainty, we assume a Dirichlet distribution 1Refer to [86, 61] for more detailed discussions on these terminologies. 105 Table 6.1 Our technical motivations for the OSTAL task. The notations ↓ and ↑ denote small and large values, repsectively. Known Action Unknown Action Background uncertainty (𝑢) actionness (𝑎) ↓ ↑ ↑ ↑ ↑ ↓ Dir(p|𝜶) over the categorical probability p ∈ R𝐾, where 𝜶 ∈ R𝐾 is the Dirichlet strength. The EDL aims to directly predict 𝜶 by deep neural networks (DNNs). The model is trained by minimizing the following negative log-likelihood of data {𝑥𝑖, 𝑦𝑖}: L (𝑖) EDL (𝜶𝑖) = 𝐾 ∑︁ 𝑗=1 𝑡𝑖 𝑗 (log(𝑆𝑖) − log(𝛼𝑖 𝑗 )), (6.1) where 𝑡𝑖 𝑗 is a binary element of the one-hot form of label 𝑦𝑖, and 𝑡𝑖 𝑗 = 1 only when 𝑦𝑖 = 𝑗, and 𝑆𝑖 = (cid:205) 𝑗 𝛼𝑖 𝑗 is the total strength over 𝐾 classes. In testing, given the sample 𝑥∗ 𝑖 , the action classification branch (DNN) produces non-negative evidence output e𝑖 ∈ R𝐾 + . Then, the expectation of the classification probability is obtained by E[p𝑖] = 𝜶𝑖/𝑆𝑖 where 𝜶𝑖 = e𝑖 + 1 according to the evidence theory [284] and subjective logic [130]. The classification uncertainty is estimated by 𝑢𝑖 = 𝐾/𝑆𝑖. However, the above EDL method is empirically found ineffective in the OSTAL task since Eq. (6.1) gives equal consideration to each sample, which is practically not the case in OSTAL. In this paper, we propose to improve the generalization capability of EDL by encouraging the model to focus more on important samples in a principled way. Momentum Importance-Balanced EDL Inspired by the recent advances in imbalanced visual classification [146, 254], the sample importance can be measured by the influence function which is determined by the gradient norm. Specifically, let h𝑖 ∈ R𝐷 be the feature input of the last DNN layer, an exponential evidence function is applied to predict the evidence, i.e., e𝑖 ≜ exp(w𝑇 h𝑖) where w ∈ R𝐷×𝐾 are the learnable weights of the DNN layer. The gradient g𝑖 of the EDL loss L (𝑖) EDL w.r.t. the logits z𝑖 ≜ w𝑇 h𝑖 is derived: 𝑔𝑖 𝑗 = 𝜕L (𝑖 𝑗) EDL 𝜕𝑧𝑖 𝑗 = 𝑡𝑖 𝑗 (cid:21) (cid:20) 𝑆𝑖 − 𝐾𝛼𝑖 𝑗 𝑆𝑖𝛼𝑖 𝑗 = 𝑡𝑖 𝑗 (cid:21) , − 𝑢𝑖 (cid:20) 1 𝛼𝑖 𝑗 (6.2) 106 where the chain rule and the equality 𝑢𝑖 = 𝐾/𝑆𝑖 are used. Since 𝑡𝑖 𝑗 = 0 when 𝑗 ≠ 𝑦𝑖, it is interesting to see a simple but meaningful gradient form, i.e., 𝑔𝑖𝑘 = 1/𝛼𝑖𝑘 − 𝑢𝑖 where 𝑘 = 𝑦𝑖, and in our supplement, we proved that |𝑔𝑖𝑘 | ∈ [0, 1). Furthermore, inspired by [254], we consider the influence function given by the gradient norm of EDL loss w.r.t. the network parameters w. According to the chain rule of z𝑖 = w𝑇 h𝑖, the influence value 𝜔𝑖 can be derived: 𝜔𝑖 = (cid:32) 𝐾 ∑︁ 𝑘=1 |𝑔𝑖𝑘 | (cid:33) (cid:32) 𝐷 ∑︁ 𝑑=1 (cid:33) |ℎ𝑖𝑑 | = ∥g𝑖 ∥1 · ∥h𝑖 ∥1. (6.3) Detailed proof can be found in the supplement. We define the loss weight of sample 𝑥𝑖 as the moving mean of influence values within the neighboring region of ∥g𝑖 ∥1: 𝑖 = 𝜖 · ˜𝜔(𝑡−1) ˜𝜔(𝑡) 𝑖 + (1 − 𝜖) · 1 |Ω𝑚 | ∑︁ Ω𝑚, (6.4) where Ω𝑚 is a subset of 𝜔𝑖 whose gradient norm ∥g𝑖 ∥1 falls into the 𝑚-th bin out of total 𝑀 bins in the region [0, 1], i.e., Ω𝑚 = {𝜔𝑖 |∥g𝑖 ∥1 ∈ [ 𝑚−1 𝑀 ], 𝑚 = 1, . . . , 𝑀 }. The 𝜖 is a momentum factor within [0, 1], 𝑀 is a constant, and 𝑡 is the training iteration. We set the initial weight ˜𝜔(0) 𝑀 , 𝑚 𝑖 as the 1.0. A larger 𝜖 means the set of influence values 𝜔𝑖 are less considered, while 𝑀 controls the granularity of the neighborhood of the gradient norm. Eventually, the proposed Momentum Importance-Balanced (MIB) EDL loss is defined as: LMIB-EDL = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑖 L (𝑖) ˜𝜔(𝑡) EDL (𝜶𝑖). (6.5) The proposed MIB-EDL loss encourages the model to smoothly focus on important samples as the training iteration increases. In practice, to stabilize the training, the re-weighting is applied after 𝑇0 training iterations. Different to [254] that uses the inverse of 𝜔𝑖 to down-weight the influential samples for a balanced closed-set recognition, we use Eq. (6.3) to up-weight these samples for open-set recognition, and (6.4) to achieve a smooth update of the sample weight. 6.3.2 Actionness Prediction Due to the mixture of unknown actions and pure background frames, it is not sufficient to distinguish between them by the evidential uncertainty over 𝐾 known classes. Therefore, predicting 107 the actionness that indicates how likely a sample is a foreground action is critical. We notice the fact that data from known classes are positive data while the samples from the “background" mixture are unlabeled. This intrinsically reduces to a semi-supervised learning problem called positive- unlabeled (PU) learning [14]. In this paper, we propose a simple yet effective PU learning method to predict the actionness. Let ˆ𝑎𝑖 ∈ [0, 1] be the predicted actionness score of the sample 𝑥𝑖, the actionness in a training batch ˆA = { ˆ𝑎𝑖} can be splitted into the positive set ˆP = { ˆ𝑎𝑖 |𝑦𝑖 ≥ 1} and the unlabeled background set ˆU = { ˆ𝑎𝑖 |𝑦𝑖 = 0}. In this paper, we propose to ascendingly sort the ˆU and select top-𝑀 samples to form the most likely negative set ˆN = { ˆ𝑎𝑖 | ˆ𝑎𝑖 ∈ 𝑠𝑜𝑟𝑡 ( ˆU)1,...,𝑀}. Then, a binary cross-entropy (BCE) loss could be applied to the ˆP and ˆN : LACT( ˆP, ˆN ) = − 1 | ˆP | ∑︁ ˆ𝑎𝑖 ∈ ˆP log ˆ𝑎𝑖 − 1 | ˆN | ∑︁ ˆ𝑎𝑖 ∈ ˆN log(1 − ˆ𝑎𝑖). (6.6) Here, to achieve a balanced BCE training, we set the size of negative set to 𝑀 = | ˆN | := min(| ˆP |, | ˆU|) considering that in most training batches we have | ˆU| ≫ | ˆP |. This BCE loss will push the probably pure background samples far away from positive actions. Though this method is straightforward, the learned actionness scores are found discriminative enough to distinguish between the foreground actions and background frames in the OSTAL setting (see Fig. 6.4a). 6.3.3 Location Regression To maintain the flexibility of our method on existing TAL models, the temporal location regression follows the design of the TAL models. Take the state-of-the-art TAL model AFSD [192] as an example, it consists of a coarse stage to predict the location proposals ˆ𝑙𝑖 = [ ˆ𝑠𝑖, ˆ𝑒𝑖] and a refined stage to predict the temporal offset ˆ𝛿𝑖 = [ ˆ𝛿(𝑠) 𝑖 ] with respect to the ˆ𝑙𝑖. The coarse stage , ˆ𝛿(𝑒) 𝑖 is learned by temporal Intersection-over-Union (tIoU) loss, while the refined stage is learned by an 𝐿1 loss: LLOC({ ˆ𝑙𝑖}) = LLOC({ ˆ𝛿𝑖}) = 1 𝑁𝐶 1 𝑁𝑅    𝑖 ∑︁ 𝑖 ∑︁ I[𝑦𝑖 ≥ 1] (cid:18) 1 − (cid:19) | ˆ𝑙𝑖 ∩ 𝑙𝑖 | | ˆ𝑙𝑖 ∪ 𝑙𝑖 | (6.7) I[𝑦𝑖 ≥ 1] (| ˆ𝛿𝑖 − 𝛿𝑖 |), 108 where 𝑁𝐶 and 𝑁𝑅 are the corresponding number of samples that are matched with the ground truth action locations by an IoU threshold. The indicator function I[𝑦𝑖 ≥ 1] filters out the unmatched samples which are treated as the “background" data. In testing, the predicted location is recovered by 𝑙∗ 𝑖 = [ ˆ𝑠𝑖 + 0.5( ˆ𝑒𝑖 − ˆ𝑠𝑖) ˆ𝛿(𝑠) 𝑖 , ˆ𝑒𝑖 + 0.5( ˆ𝑒𝑖 − ˆ𝑠𝑖) ˆ𝛿(𝑒) 𝑖 ]. Note that our OpenTAL framework is not limited to specific TAL models but is general in design. 6.3.4 IoU-aware Uncertainty Calibration Though the loss functions defined by Eqs. (6.5)(6.6)(6.7) is sufficient for a complete OSTAL task, the learned uncertainty in the classification module is not calibrated by considering the localization performance. Intuitively, an action proposal of high temporal overlap with the ground truth location should contain more evidence and thus low uncertainty. To this end, we propose a novel IoU-aware uncertainty calibration method: L (𝑖) IoUC ( ˆ𝑙𝑖, 𝑢𝑖) = −𝑤 ˆ𝑙𝑖,𝑙𝑖 log(1 − 𝑢𝑖) − (1 − 𝑤 ˆ𝑙𝑖,𝑙𝑖 ) log(𝑢𝑖) (6.8) where the weight 𝑤 is a clipped form of the temporal IoU between the predicted and ground truth locations: 𝑤 ˆ𝑙𝑖,𝑙𝑖 = max (cid:16) 𝛾, IoU( ˆ𝑙𝑖, 𝑙𝑖) (cid:17) (6.9) where the 𝛾 is a small non-negative constant. The cross-entropy form in Eq. (6.8) and (6.9) will encourage the model to produce high uncertainty (𝑢𝑖 → 1) for action proposals with low localization quality (𝑤 → 𝛾). The motivation behind the clipping by max() is that given the ground truth of known actions, both the proposals of background frames and unknown actions are not overlapped with the ground truth such that IoU( ˆ𝑙𝑖, 𝑙𝑖) ≤ 0, the clipping could avoid reversing the loss value from positive to negative, while still maintaining a low localization quality 𝛾. Besides, it is reasonable to encourage high uncertainty 𝑢𝑖 by small 𝛾 for the location proposals of the background and the unknown actions in the OSTAL setting. 109 Algorithm 6.1 Inference Procedure 𝑖=1 by OpenTAL. Require: Untrimmed test video. Require: Trained OpenTAL model. Require: Threshold 𝜏 from training data by Eq. (6.11) 1: Data pre-processing (if applicable). 𝑖 , ˆ𝑦𝑖, 𝑢𝑖, ˆ𝑎𝑖}|𝑁 2: Predict proposals G = {𝑙∗ 3: Post-processing (if applicable). 4: for each proposal G𝑖 ∈ G do if ˆ𝑎𝑖 < 0.5 then 5: 6: 7: 8: 9: 10: 11: 12: 13: end for G𝑖 is Known by ˆ𝑦𝑖 = arg max 𝑗 E[p𝑖 𝑗 ]. G𝑖 is a Background; continue. end if if 𝑢𝑖 > 𝜏 then G𝑖 is Unknown. end if else ⊲ Background ⊲ Unknown Action ⊲ Known Action 6.3.5 Training and Inference The training procedure is to minimize the weighted sum of losses defined by Eqs. (6.5)(6.6)(6.7)(6.8): L = 𝜇LMIB-EDL + LACT + LLOC + E[L (𝑖) IoUC ], (6.10) where 𝜇 is a hyperparameter, and E[·] is to take the mean loss values over the input samples. In inference, the untrimmed video input is fed into a TAL model, and our OpenTAL method trained on the TAL model could produce multiple action locations {𝑙∗ 𝑖 }, classification labels ˆ𝑦𝑖 = arg max 𝑗 ∈[1,...,𝐾] E[p𝑖 𝑗 ], classification uncertainty 𝑢𝑖, and actionness score ˆ𝑎𝑖. Together with the 𝑢𝑖 and ˆ𝑎𝑖, a positively localized foreground action 𝑥𝑖, i.e., 𝑎𝑖 > 0.5, can be accepted as known class ˆ𝑦𝑖, or rejected as the unknown by the following simple scoring function: 𝑃(𝑥𝑖 |𝑎𝑖 > 0.5) = 𝑢𝑛𝑘𝑛𝑜𝑤𝑛, if 𝑢𝑖 > 𝜏, ˆ𝑦𝑖, otherwise.    (6.11) The complete inference procedure is shown in Algorithm 6.1. In addition to this two-level decision, one-level decisions by the functional formulas of 𝑃(𝑥𝑖) w.r.t. to 𝑢𝑖 and 𝑎𝑖 are also plausible (see Table 6.5). However, we empirically found that Eq. (6.11) is the most effective formula while maintaining the explainable nature of decision-making. 110 6.4 Experiments 6.4.1 Implementation Details Our method is implemented on the AFSD [192] model2, which is a state-of-the-art TAL model. Pre-trained I3D [27] backbone is used in AFSD. The proposed OpenTAL is applied to both the coarse and refined stages of AFSD. Specifically, the proposed MIB re-weighting is applied after 10 training epochs. We empirically set the momentum 𝜖 to 0.99 and the number of bins 𝑀 to 50. The small constant 𝛾 in Eq. (6.9) is set to 0.001. The loss weight 𝜇 in Eq. (6.10) is set to 10. We trained the model 25 epochs to ensure full convergence. The rest settings in AFSD are kept unchanged. 6.4.2 Datasets THUMOS14 [128] and ActivityNet1.3 [22] are two commonly-used datasets for TAL evalua- tion. The THUMOS14 dataset contains 200 training videos and 212 testing videos. ActivityNet1.3 dataset contains about 20K videos with 200 human activity categories. Since our method is not limited by data modality, we use RGB videos for training and testing by default. To enable OSTAL evaluation, we randomly select 3/4 THUMOS14 categories of the training videos as the known data. This random selection is repeated to generate three THUMOS14 open set splits between the known and the unknown. Considering that ActivityNet1.3 is newer and covers most THU- MOS14 categories, ActivityNet1.3 is not suitable to be the closed-set training data when the model is tested on THUMOS14. Therefore, we train models on the THUMOS14 known split and use the THUMOS14 unknown split and the disjoint categories of ActivityNet1.3 as two sources of open set testing data. To get the disjoint categories from ActivityNet1.3, we manually removed 14 semantically overlapping categories by referring to the THUMOS14 categories. Detailed dataset information could be found in our supplement. 6.4.3 Evaluation Protocols The mean Average Precision (mAP) is typically used for the evaluation of closed-set TAL performance. To enable OSTAL performance evaluation, the Area Under the Receiver Operating Characteristic (AUROC) curve and the Area Under the Precision-Recall (AUPR) are introduced to 2AFSD: https://github.com/TencentYoutuResearch/ActionDetection-AFSD 111 evaluate the performance of detecting the unknown from the known actions for positively localized actions. To address the operational meaning in practice, we additionally report the False Alarm Rate at a True Positive Rate of 95% (FAR@95), by which a smaller value indicates better performance. However, we noticed the metrics above are insufficient for the OSTAL task because the multi-class classification performance of the known classes in the OSTAL setting is ignored. Inspired by the Open Set Classification Rate [62, 240, 33], we propose the Open Set Detection Rate (OSDR), which is defined as the area under the curve of Correct Detection Rate (CDR) and False Positive Rate (FPR). Given an operation point 𝜏 of the scoring function 𝑃(𝑥) for detecting the unknown and an operation point 𝑡0 of the tIoU for localizing the foreground actions, CDR and FPR are defined as:    CDR(𝜏, 𝑡0) = |{𝑥|(𝑥 ∈ F𝑘 ) ∧ ( ˆ𝑓𝑥|𝑦 = 𝑦) ∧ 𝑃(𝑥) < 𝜏}| |F𝑘 | |{𝑥|(𝑥 ∈ F𝑢) ∧ 𝑃(𝑥) < 𝜏}| |F𝑢 | where F𝑘 is the set of positively localized known actions, i.e., F𝑘 = {𝑥|(tIoU > 𝑡0) ∧ (𝑦 ∈ FPR(𝜏, 𝑡0) = (6.12) [1, . . . , 𝐾])}, and F𝑢 is the set of positively localized unknown actions, i.e., F𝑢 = {𝑥|(tIoU > 𝑡0) ∧ (𝑦 = 0)}. The CDR indicates the fraction of known actions that are positively localized and correctly classified into their known classes, while the FPR denotes the fraction of unknown actions that are positively localized but falsely accepted as an arbitrary known class. Higher OSDR indicates better performance for the OSTAL task. For stable evaluation, all results are reported by averaging the results of each evaluation metric over the three THUMOS14 splits. Results are reported at the tIoU threshold 0.3 for THUMOS14 and 0.5 for ActivityNet1.4, and results by other thresholds are in the supplement. 6.4.4 Comparison with State-of-the-arts The OpenTAL method is compared with the following baselines based on the AFSD: (1) SoftMax: use the softmax confidence score to identify the unknown. (2) OpenMax: use Open- Max [16] in testing to append the softmax scores with unknown class. (3) EDL: similar to [316], the vanilla EDL method is used to replace the traditional cross-entropy loss for uncertainty quan- tification. Models are tested using both the THUMOS14 unknown spits and the ActivityNet1.3 112 Table 6.2 OSTAL Results (%). Models trained on the THUMOS14 closed set are tested on the open sets by including the unknown classes from THUMOS14 and ActivityNet1.3, respectively. The mAP is provided as the reference of the TAL results on the THUMOS14 closed set. Methods SoftMax OpenMax [16] EDL [316] OpenTAL THUMOS14 as the Unknown ActivityNet1.3 as the Unknown FAR@95 (↓) AUROC AUPR OSDR FAR@95 (↓) AUROC AUPR OSDR mAP 85.58 90.34 81.42 70.96 54.70 53.26 64.05 78.33 31.85 23.40 33.17 13.66 40.05 36.26 58.62 42.91 85.05 91.36 84.01 63.11 56.97 51.24 62.82 82.97 53.54 27.63 55.81 54.88 15.73 36.36 53.97 38.56 52.24 80.41 50.49 55.02 (a) ROC Curves (b) OSDR Curves Figure 6.3 ROC and OSDR curves on one THUMOS14 split. Numbers in the brackets are AUROC or OSDR values. disjoint subset. Results are reported in Table 6.2. The results show that the OpenTAL outperforms the baselines by large margins on all OSTAL metrics, while still keeping comparable closed set TAL performance (less than 1% mAP decrease). The results also show that OpenMax does not work well on the OSTAL task, especially when the large-scale ActivityNet1.3 dataset is used as the unknown. The EDL works well but is still far behind the proposed OpenTAL. Fig. 6.3 shows the detailed evaluation by the curves of AUROC and OSDR on one THUMOS14 split. They clearly show that the proposed OpenTAL on different operation points of scoring values and different open set splits is consistently better than the baselines. 113 0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateOpenTAL (79.22)EDL (68.65)OpenMax (49.56)SoftMax (55.96)0.00.20.40.60.81.0False Positive Rate0.00.10.20.30.40.50.60.7Correct Detection RateOpenTAL (38.54)EDL (32.73)OpenMax (3.15)SoftMax (21.63) Table 6.3 Ablation Results (%). The proposed EDL re-weighting method (MIB), the actionness prediction (ACT), and the IoUC loss are individually ablated from the OpenTAL. Variants (1) (2) (3) OpenTAL MIB ✓ ✓ ✓ ACT ✓ ✓ ✓ IoUC ✓ ✓ ✓ FAR@95 (↓) 77.20 82.85 79.64 70.96 AUROC 76.41 58.12 62.73 78.33 AUPR 56.65 31.80 37.86 58.62 OSDR 12.10 37.89 39.39 42.91 6.4.5 Ablation Study Component Ablation. By individually removing the major components of OpenTAL, three model variants are compared. (1) Without MIB: the proposed MIB re-weighting is removed so that the vanilla EDL loss (Eq. (6.1)) is used. (2) Without ACT: the actionness prediction is removed so that the (𝐾 + 1)-way classification in LMIB-EDL (Eq. (6.5)) is adopted. (3) Without IoUC: the loss LIoUC (Eq. (6.8)) is removed from the training. Results are reported in Table 6.3. They show that OpenTAL achieves the best performance. Specifically, the MIB re-weighting strategy contributes the most to the OSDR performance gain by around 30%. The actionness prediction (ACT) contributes the most to the FAR@95, AUROC, and AUPR metrics. Besides, the proposed IoUC loss also leads to significant performance gains on all metrics. These observations demonstrate the effectiveness of the three components for the OSTAL task. Choices of Re-weighting Methods. We compare the proposed MIB re-weighting method (MIB (soft)) with the MIB (hard) and existing literature on sample re-weighting in Table 6.4. The results show that the focal loss (Focal) [197] does not work well with the OpenTAL framework. GHM [170] and IB [254] methods could achieve comparable FAR@95, AUROC, and AUPR performance, but their OSDR results are still largely far behind ours. Note that these methods are all designed for closed-set recognition, thus the proposed MIB is more suitable for open-set scenarios. Besides, the hard version of MIB, i.e., the momentum mechanism is removed by setting the 𝜖 to 0, could improve about 4% FAR@95 while sacrificing the AUROC, AUPR, and OSDR. 114 Table 6.4 Results of Different Re-weightings (%). MIB (hard) means the momentum factor 𝜖 = 0 in Eq. (6.4) such that the sample weight is updated in a hard manner, while the MIB (soft) sets the 𝜖 to 0.99 to enable a soft update, and wo. Re-weight means 𝜖 = 1.0. Methods wo. Re-weight. Focal [197] GHM [170] IB [254] MIB (hard) MIB (soft) FAR@95 (↓) 77.20 91.05 78.33 80.23 66.34 70.96 AUROC 76.41 56.67 73.52 75.91 78.16 78.33 AUPR 56.65 35.55 54.03 58.00 57.66 58.62 OSDR 12.10 2.04 1.41 2.18 38.90 42.91 Choices of Scoring Function. The scoring function is critical to identify the known and unknown actions, as well as the background frames in model inference. In addition to the proposed two- level decision by (6.11), we compare it with four reasonable one-level decision methods by utilizing actionness 𝑎𝑖 and uncertainty 𝑢𝑖. The results in Table 6.5 show that using the maximum classification confidence (the 1st row) or other compositions of 𝑢𝑖 and 𝑎𝑖 (the 2nd and 3rd rows) cannot achieve favorable performance. The proposed method (the last row) is slightly better than the product between 𝑢𝑖 and 𝑎𝑖 (the 4-th row) with comparable FAR@95 performance. Though there are certainly other alternatives, our scoring function achieves the best performance while maintaining a good decision-making explanation, which means that the foreground actions are identified first by 𝑎𝑖, based on which the known and unknown actions are further distinguished by 𝑢𝑖. Distributions of Actionness and Uncertainty. To show the quality of the learned actionness and uncertainty, we visualized their distributions on the test set in Fig. 6.4. Specifically, the dominant modes in Fig. 6.4a show that foreground actions are majorly assigned with high actionness while the background frames are with low actionness, and the dominant modes in Fig 6.4b show that the actions of known classes are majorly assigned with low uncertainty while those of the unknowns are with high uncertainty. These observations align well with the expectations of our OpenTAL method. 115 Table 6.5 Scoring Functions. It shows when conditioned on 𝑎𝑖 > 0.5, uncertainty 𝑢𝑖 is the best scoring function for the OSTAL task. Scoring Functions 𝑃(𝑥𝑖) = 1 − max 𝑗 (𝜶𝑖/𝑆𝑖) 𝑃(𝑥𝑖) = 𝑢𝑖/(1 − 𝑎𝑖) 𝑃(𝑥𝑖) = 𝑎𝑖/(1 − 𝑢𝑖) 𝑃(𝑥𝑖) = 𝑢𝑖 · 𝑎𝑖 𝑃(𝑥𝑖 |𝑎𝑖 > 0.5) = 𝑢𝑖 FAR@95 (↓) 77.90 79.16 90.39 70.64 70.96 AUROC 59.50 61.94 72.71 77.52 78.33 AUPR 35.82 38.52 56.19 58.17 58.62 OSDR 31.38 30.64 38.24 42.44 42.91 (a) Actionness (b) Uncertainty Figure 6.4 Distributions of Actionness and Uncertainty. The two figures show significant separation between the foreground actions and background frames by actionness score, as well as the separation between the known and unknown actions by uncertainty. Qualitative Results. Fig. 6.5 shows the qualitative results of the proposed OpenTAL and baseline approaches. The three video samples are from the THUMOS14 dataset. The results clearly show that OpenTAL is superior to baselines in terms of both recognizing the known actions (colored segments in the 1st video), and rejecting the unknown actions (black segments in the 2nd and 3rd videos). Limitations. We note that all those methods do not show remarkably high OSDR performance, which indicates the challenging nature of the OSTAL task and there exists large room for improve- ment in the OpenTAL. 6.5 Conclusion In this paper, we introduce the Open Set Temporal Action Localization (OSTAL) task. It aims to simultaneously localize and recognize human actions, and to reject the unknown actions from 116 0.00.20.40.60.81.0012345678densityForegroundBackground0.00.20.40.60.81.01.201234densityKnown ActionsUnknown Actions Figure 6.5 Qualitative Results. We show the actions of unknown classes with black color, while the rest colors are actions of known classes. The 𝑥-axis represents the timestamps (seconds). untrimmed videos in an open world. The unique challenge lies in discriminating between known and unknown actions as well as background video frames. To this end, we propose a general OpenTAL framework to enable existing TAL models for the OSTAL task. The OpenTAL predicts the locations, classifications with uncertainties, and actionness to jointly achieve the goal. For comprehensive OSTAL evaluation, the Open Set Detection Rate is introduced. The OpenTAL is empirically 117 0.037.575.0112.4149.9187.4Ground TruthSoftMaxOpenMaxEDLOpenTAL0.0166.4332.9499.3665.8832.2Ground TruthSoftMaxOpenMaxEDLOpenTAL0.026.653.279.9106.5133.1Ground TruthSoftMaxOpenMaxEDLOpenTAL demonstrated to be effective and significantly outperforms existing baselines. We believe the generality of the OpenTAL design could inspire relevant research fields such as spatiotemporal action detection, video object detection, and video grounding toward open set scenarios. 6.6 Supplementary Material In this section, we provide the detailed proof of the gradient of the EDL loss (Sec. 6.6.1), the dataset description of the open set setting (Sec. 6.6.2), implementation details (Sec. 6.6.3), additional results and discussions (Sec. 6.6.4). 6.6.1 Gradient of EDL Given the DNN logits z𝑖 ∈ R𝐾 of sample 𝑥𝑖, an evidence function defined by exp is applied to the logits to get the class-wise evidence prediction, i.e., e𝑖 = exp(z𝑖). Following the maximum likelihood loss form of Evidential Deep Learning (EDL) [283], we have the EDL loss: L (𝑖) EDL (𝜶𝑖) = 𝐾 ∑︁ 𝑗=1 𝑡𝑖 𝑗 (log(𝑆𝑖) − log(𝛼𝑖 𝑗 )), (6.13) where 𝑡𝑖 𝑗 = 1 iff. the class label 𝑦𝑖 = 𝑗, otherwise 𝑡𝑖 𝑗 = 0. The total Dirichlet strength 𝑆𝑖 = (cid:205) 𝑗 𝛼𝑖 𝑗 and the class-wise strength 𝜶𝑖 = e𝑖 + 1. Therefore, according to the simple chain rule, we have the partial derivative: 𝜕𝛼𝑖 𝑗 𝜕𝑧𝑖 𝑗 = 𝜕𝛼𝑖 𝑗 𝜕𝑒𝑖 𝑗 𝜕𝑒𝑖 𝑗 𝜕𝑧𝑖 𝑗 · = 𝑒𝑖 𝑗 (6.14) Then, the gradient of the 𝑗-th entry in Eq. (6.13), i.e., L (𝑖 𝑗) EDL, w.r.t. the logits 𝑧𝑖 𝑗 can be derived as (cid:21) (cid:35) (6.15) follows: 𝑔𝑖 𝑗 = 𝜕L (𝑖 𝑗) EDL 𝜕𝑧𝑖 𝑗 = 𝑡𝑖 𝑗 = 𝑡𝑖 𝑗 = 𝑡𝑖 𝑗 (cid:20) 1 𝑆𝑖 (cid:34) (cid:34) 1 𝑆𝑖 1 𝑆𝑖 𝜕𝑆𝑖 𝜕𝑧𝑖 𝑗 𝐾 ∑︁ 𝑘=1 𝐾 ∑︁ 𝑘=1 118 − 1 𝛼𝑖 𝑗 𝜕𝛼𝑖𝑘 𝜕𝑧𝑖 𝑗 𝑒𝑖𝑘 − 𝜕𝛼𝑖 𝑗 𝜕𝑧𝑖 𝑗 𝑒𝑖 𝑗 𝛼𝑖 𝑗 (cid:35) − 𝑒𝑖 𝑗 𝛼𝑖 𝑗 Consider that 𝑆𝑖 = (cid:205)𝑘 𝛼𝑖𝑘 = (cid:205) 𝑗 𝑒𝑖𝑘 + 𝐾, and the evidential uncertainty 𝑢𝑖 = 𝐾/𝑆𝑖, we further simplify the 𝑔𝑖 𝑗 as follows: 𝑔𝑖 𝑗 = 𝑡𝑖 𝑗 = 𝑡𝑖 𝑗 = 𝑡𝑖 𝑗 (cid:21) 𝛼𝑖 𝑗 − 1 𝛼𝑖 𝑗 (cid:21) − (cid:20) 𝑆𝑖 − 𝐾 𝑆𝑖 (cid:20) 𝑆𝑖 − 𝐾𝛼𝑖 𝑗 𝑆𝑖𝛼𝑖 𝑗 (cid:21) , − 𝑢𝑖 (cid:20) 1 𝛼𝑖 𝑗 (6.16) which has proved the equation of 𝑔𝑖 𝑗 in our main paper. From this conclusion, when considering that 𝛼𝑖 𝑗 ∈ (1, ∞) and 𝑢𝑖 ∈ (0, 1), we have the property |𝑔𝑖 𝑗 | ∈ [0, 1). Furthermore, consider the last DNN layer parameters w ∈ R𝐷×𝐾 such that z𝑖 = w𝑇 h𝑖 where h𝑖 ∈ R𝐷 is the high-dimensional feature of 𝑥𝑖, we can derive the gradient of EDL loss w.r.t. parameters w: ∇𝑤L = 𝜕L (𝑖 𝑗) EDL 𝜕𝑤𝑑𝑘 𝜕L (𝑖𝑘) EDL 𝜕𝑧𝑖𝑘 · 𝜕𝑧𝑖𝑘 𝜕𝑤𝑑𝑘 = = 𝑔𝑖𝑘 · ℎ𝑖𝑑, (6.17) where 𝑤𝑑𝑘 and ℎ𝑖𝑑 are elements of the matrix w and the vector h𝑖. Similar to [254], we consider the influence function [146] by ignoring the inverse of Hessian and using the magnitude (𝐿1 norm) of the gradient: 𝜔𝑖 = ∥∇𝑤L ∥1 = 𝐾 ∑︁ 𝐷 ∑︁ |𝑔𝑖𝑘 · ℎ𝑖𝑑 | 𝑑=1 𝑘=1 (cid:32) 𝐾 ∑︁ |𝑔𝑖𝑘 | = (cid:33) (cid:32) 𝐷 ∑︁ (cid:33) |ℎ𝑖𝑑 | (6.18) 𝑘=1 𝑑=1 = ∥g𝑖 ∥1 · ∥h𝑖 ∥1, which has proved the equation of 𝜔𝑖 in our main paper. 6.6.2 Dataset Details To enable the existing Temporal Action Localization (TAL) datasets such as THUMOS14 [128] and ActivityNet1.3 [22] for the open set TAL setting, a subset of action categories has to be reserved as the unknown used in open set testing. In practice, we randomly splitted the THUMOS14 three times into known and unknown subsets of categories. For each split, a model will be trained on the closed set (which only contains known categories), and tested on the open set that contains both 119 Table 6.6 THUMOS14 Splits for Open Set TAL. For each split, five out of twenty action categories are randomly selected as the unknown (U) used in open set testing, while the rest fifteen categories are the known (K) used in model training. BaseballPitch BasketballDunk Billiards CricketBowling CricketShot FrisbeeCatch GolfSwing HammerThrow HighJump JavelinThrow PoleVault Shotput TennisSwing ThrowDiscus VolleyballSpiking CleanAndJerk CliffDiving Diving LongJump SoccerPenalty Split 1 K K K K K K K K K K K K K K K U U U U U Split 2 K K K U K K K U K U K K K K K K U U K K Split 3 K K K K U K K K K U U U K K K K K K U K known and unknown categories. Table 6.6 shows the detailed information of the three dataset splits from THUMOS14. To further increase the openness in testing, we incorporate activity categories from Activi- tyNet1.3 that are non-overlapped with THUMOS14 into the open set testing. Specifically, the following 14 overlapping activity categories are removed: Table soccer, Javelin throw, Clean and jerk, Springboard diving, Pole vault, Cricket, High jump, Shot put, Long jump, Hammer throw, Snatch, Volleyball, Plataform diving, Discus throw. Note that we did not use ActivityNet1.3 for similar model training as the THUMOS14, e.g., train a model on multiple random splits of ActivityNet1.3, due to the limited computational resource. 120 Table 6.7 AUROC Results (%) vs. Different tIoU Thresholds. Models trained on the THU- MOS14 closed set are tested by including the unknown classes from THUMOS14 and Activi- tyNet1.3, respectively. Results are averaged over the three dataset splits. THUMOS14 as the Unknown ActivityNet1.3 as the Unknown Methods 0.4 0.5 SoftMax OpenMax [16] 53.26 EDL [316] OpenTAL Avg. 0.3 0.95 0.7 54.70 55.46 56.41 57.12 57.11 56.16 56.97 58.41 55.97 52.13 51.89 52.53 52.38 51.24 52.39 49.13 64.05 64.27 65.13 66.21 66.81 65.29 62.82 66.23 67.92 78.33 79.04 79.30 79.40 79.82 79.18 82.97 83.21 83.38 0.75 52.1 0.6 0.5 Avg. 57.77 51.59 65.69 83.22 Table 6.8 AUPR Results (%) vs. Different tIoU Thresholds. Models trained on the THUMOS14 closed set are tested by including the unknown classes from THUMOS14 and ActivityNet1.3, respectively. Results are averaged over the three dataset splits. THUMOS14 as the Unknown ActivityNet1.3 as the Unknown Methods 0.4 0.5 0.95 0.7 0.3 SoftMax 31.85 31.81 31.11 29.78 27.99 30.51 53.54 44.15 34.54 OpenMax [16] 33.17 31.61 30.59 29.15 28.45 30.60 54.88 48.37 40.07 40.05 39.45 38.05 37.58 36.35 38.30 53.97 47.22 45.59 EDL [316] 58.62 59.40 58.78 57.54 55.88 58.04 80.41 74.20 73.92 OpenTAL Avg. 0.75 0.6 0.5 Avg. 44.77 48.48 48.46 75.54 Table 6.9 OSDR Results (%) vs. Different tIoU Thresholds. Models trained on the THUMOS14 closed set are tested by including the unknown classes from THUMOS14 and ActivityNet1.3, respectively. Results are averaged over the three dataset splits. THUMOS14 as the Unknown ActivityNet1.3 as the Unknown Methods 0.4 0.5 0.3 0.95 0.7 SoftMax 23.40 25.19 27.43 29.97 32.08 27.61 27.63 33.73 31.59 OpenMax [16] 13.66 14.58 15.91 17.71 20.41 16.45 15.73 21.49 18.07 36.26 37.58 39.16 41.18 42.99 39.43 38.56 43.72 42.20 EDL [316] 42.91 46.19 49.50 52.50 56.78 49.57 50.49 59.87 62.17 OpenTAL Avg. 0.75 0.6 0.5 Avg. 32.01 19.35 42.18 57.89 6.6.3 Implementation Details Detailed Architecture The proposed OpenTAL is primarily implemented on the AFSD [192] framework. It uses a pre-trained I3D [27] as the feature extraction backbone and a 6-layer temporal FPN architecture is applied to the I3D for action classification and localization. Each level consists of a coarse stage, a saliency-based proposal refinement module, and a refined stage. The first two pyramid levels use 3D convolutional (Conv3D) block while the rest four levels use 1D convolutional 121 (Conv1D) block. Group Normalization and ReLU activation are utilized in each block. The temporal localization head and action classification head are implemented by a shared Conv1D block across all 6 levels. To implement OpenTAL method, the (𝐾 + 1)-way classification head is replaced with 𝐾-way evidential neural network head, while the localization head is kept unchanged. We additionally add an actionness prediction branch which consists of a Conv1D block for both the coarse and the refined stages. Training and Testing In training, the proposed classification loss LMIB-EDL and actionness prediction loss LACT are applied to both the coarse and refined stages in AFSD, while the IoU- aware uncertainty calibration loss LIoUC is only applied to the refined stage because this loss function is dependent of the pre-computed temporal IoU using the predicted action locations in the coarse stage. Similar to AFSD, we used temporal IoU threshold 0.5 in the training to identify the foreground actions from the proposals. Besides, we reduced the weight of triplet loss in AFSD to 0.001 since the contrastive learning loss would not work well when there are unknown action clips in the background. The whole model is trained by Adam optimizer with base learning rate 1e-5 and weight decay 1e-3. All models are trained with 25 epochs to ensure full convergence and the model snapshot of the last epoch is used for testing and evaluation. In testing, the actionness score is multiplied to the confidence score before the soft-NMS post-processing module. The 𝜎 and top-𝑁 hyperparameters are set to 0.5 and 5000, which are recommended by the AFSD. 6.6.4 Additional Results Impact of tIoU Thresholds Since the proposed OSTAL task cares not only the classification but also the temporal localization, we present the experimental results under different temporal IoU (tIoU) thresholds. Following existing TAL literature, we set five tIoU thresholds [0.3 : 0.1 : 0.7] when the unknown classes are from THUMOS14 and ten tIoU thresholds [0.5 : 0.05 : 0.95] when the unknown classes are from ActivityNet1.3, respectively. Evaluation results by AUROC, AUPR, and OSDR are reported in Table 6.7, 6.8, and 6.9, respectively. The results show that AUROC 122 performances are stable across different tIoU thresholds, while the AUPR and OSDR performances vary significantly as the tIoU threshold changes. Besides, as the tIoU threshold increasing, AUROC and OSDR values would increase accordingly. For all those tIoU thresholds and evaluation metrics, the proposed OpenTAL could consistently outperform baselines. 123 CHAPTER 7 COMPOSITIONAL ZERO-SHOT LEARNING 7.1 Introduction Compositional visual recognition is a fundamental characteristic of human intelligence [164] but it is challenging for modern deep learning systems. For example, humans can easily recognize unseen sliced tomatoes after seeing sliced potatoes and red tomatoes. Such a compo- sitional zero-shot learning (CZSL) capability is valuable in that, novel visual concepts from a huge combinatorial semantic space could be recognized without “seeing” any of their training data. For example, the C-GQA [234] dataset contains 413 states and 674 objects. This implies a total of at least 278K compositional classes in an open world while only 2% of them are accessible in training. Therefore, CZSL can significantly reduce the need for large-scale training data. Traditional vision-based methods either directly learn the visual feature of compositions, or try to first decompose the visual data into representations of simple primitives, i.e., , states and objects, and then learn to re-compose the compositions [226, 5, 415, 119, 135, 321, 234, 396, 220, 174]. Thanks to the recent large pre-trained vision-language models (VLM) such as CLIP [261], state- of-the-art CZSL methods have been developed [239, 214, 352, 116]. For instance, CSP [239] inherits the hard prompt template of the CLIP, i.e., , a photo of [state] [object] where only the embeddings of the states and objects are trained. The following methods [214, 352, 116] use soft prompt introduced in CoOp [409], where the embeddings of the prompt template are jointly optimized, leading to a better CZSL performance. The impressive performance of CLIP- based CZSL methods benefits from the sufficiently good feature alignment between the image and text modalities, and the prompting techniques for adapting the aligned features to recognizing compositional classes. Despite the success of existing CLIP-based methods, we find several key considerations to This chapter is adapted from the following publication: "Wentao Bao, Lichang Chen, Heng Huang, and Yu Kong. Prompting language-informed distribution for compositional zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), 2024." 124 (a) images of the same compositional Figure 7.1 Challenges of compositional recognition. (b) red tomatoes class appear differently due to diverse visual backgrounds or foregrounds. and sliced tomatoes are visually correlated because 1) both are tomatoes object, and 2) the object tomatoes is inherently entangled with the state red, resulting in the need of primitive decomposition. prompt the pre-trained CLIP for better CZSL modeling. First, the diversity and informativeness of prompts are both important to distinguish between compositional classes. CZSL can be treated as zero-shot learning on fine-grained categories, which requires a fine-grained context to prompt the CLIP model [261, 215]. However, to contextualize a class with fine granularity, the hard prompt in [261] suffers from the heuristic design of prompt templates, and a single prompt for each class lacks diversity to capture the intra-class variance of visual data (Fig. 7.1a). Though the ProDA [215] proposes to learn a collection of prompts that formulate class-specific distribution to address the diversity, the lack of language informativeness in their prompts limits their performance on fine- grained compositional categories. Second, the entanglement between visual primitives, e.g., red and tomatoes in Fig. 7.1b, incurs difficulty in learning decomposable visual representations that are useful for compositional generalization [209, 135], while such a capability is missing in [239, 352]. Though the more recent work [214, 116] learn to decompose the primitives and considers the re-composed compositional predictions, their language-only decomposition and probability-level mixup potentially limit the generalizability in the open-world. In this paper, we propose a novel CLIP-based method for the CZSL task by prompting the language-informed distributions (PLID) over both the compositional and primitive categories. To learn the diverse and informative textual class representations, the PLID leverages off-the-shelf large language models (LLM) to build the class-specific distributions and to enhance the class embeddings. Furthermore, we propose a visual language primitive decomposition (VLPD) module 125 red tomatoesred apple (on the tree)red apple (on the plate)⋯(a pair of black) polyester sandals⋯(a brown) polyester sandaldiverse backgrounddiverse foregroundsliced tomatoes⋯(a) intra-class variety(b) inter-class correlation to decompose the image data into simple primitives for recognition of state and objects. Eventually, the compositional classification is performed by fusing the decisions from both the compositional and primitive spaces. The proposed PLID shows state-of-the-art performance on CZSL benchmarks such as MIT-States [121], UT-Zappos [374], and C-GQA [234]. Note that our method is orthogonal to the existing hard prompt [261], soft prompt tuning [409], and prompt distribution learning [215, 161, 208, 60]. We advocate prompting the distribution of informative LLM-based class descriptions. From a classification perspective, this is grounded on the classification-by-description [223, 222, 358, 107], that LLM-generated text enables more infor- mative class representations. Compared to the deterministic soft or hard prompt aforementioned, our distribution modeling could capture the intra-class diversity and inter-class correlation for better zero-shot generalization. Compared to the existing prompt distribution learning approaches, the class context is more linguistically interpretable and provides fine-grained descriptive information about the class. Our method is also parameter-efficient without the need to optimize a large collec- tion of prompts. Specific to the CZSL task, the enhanced class embeddings by LLM descriptions enable visual language primitive decomposition and decision fusion in both compositional and primitive space, which eventually benefits the generalization to the unseen. In summary, the contributions are as follows. (i) We develop a PLID method that advo- cates prompting the language-informed distribution for compositional zero-shot learning, which is orthogonal to existing soft or hard prompting and distributional prompt learning. (ii) We pro- pose primitive decomposition with stochastic logit mixup to fuse the classification decision from compositional and primitive predictions. (iii) We empirically show that PLID could achieve su- perior performance to prior arts in both the closed-world and open-world settings on MIT-States, UT-Zappos, and C-GQA datasets. 7.2 Related Work Prompt Learning in VLM. Vision-Language Models (VLM) such as the CLIP [261] pre- trained on web-scale datasets recently gained substantial attention for their strong zero-shot recog- nition capability on various downstream tasks. Such a capability is typically achieved by performing 126 prompt engineering to adapt pre-trained VLMs. Early prompting technique such as the hard prompt in CLIP uses the heuristic template “a photo of [CLS]” as the textual input. Recently, the soft prompt tuning method in CoOp [409], CoCoOp [408], and ResPT [266] that uses learnable embed- ding as the textual context of class names significantly improved the model adaptation performance. This technique is further utilized in MaPLe [138] that enables multi-modal prompt learning for both image and text. However, the prompts of these methods are deterministic and lack the diversity to capture the appearance variety in fine-grained visual data, so they are prone to overfitting the training data. To handle this issue, ProDA [215] explicitly introduces a collection of soft prompts to construct the class-specific Gaussian distribution, which results in better zero-shot performance and inspires the recent success of PPL [161] in the dense prediction task. Similarly, the PBPrompt [208] uses neural networks to predict the class-specific prompt distribution and utilizes optimal transport to align the stochastically sampled soft prompts and image patch tokens. The recent work [60] assumes the latent embedding of prompt input follows a Gaussian prior and adopts variational infer- ence to learn the latent distribution. In this paper, in order to take the merits of the informativeness of hard prompt and the diversity of distributional modeling, we adopt the soft prompt to adapt the distributions supported by LLM-generated class descriptions. Compositional Zero-Shot Learning (CZSL). For a long period, the CZSL task has been studied from a vision-based perspective in literature. They either directly learn the compositional visual features or disentangle the visual features into simple primitives, i.e., , states and objects. For example, [237, 185, 234] performs a direct classification by projecting the compositional visual features into a common feature space, and [213, 226, 5, 119, 415, 135, 209] decompose the visual feature into simple primitives so that the compositional recognition can be achieved by learning to recompose from the primitives. Though the recent large-scale pre-trained CLIP model shows impressive zero-shot capability, it is found to struggle to work well for compositional reasoning [217, 382, 169]. Thanks to the recent prompt learning [409], the CZSL task has been dominated by CLIP-based approaches [239, 214, 352, 116, 186, 404]. The common idea is to prompt the frozen CLIP model to separately learn the textual embeddings of simple primitives, 127 which empirically show strong compositionality for zero-shot generalization. Different to [404, 186] that develop primitive adapters and [214, 352, 116] that use learnable prompts for deterministic vision-language alignment, our method takes the benefit of learnable prompt and LLM-generated text for distributional alignment, addressing the importance of diversity and informativeness for zero-shot generalization. 7.3 Preliminaries CZSL Task Formulation. The CZSL task aims to recognize images of a compositional category 𝑦 ∈ C, where the semantic space C is a Cartesian product between the state space S = {𝑠1, . . . , 𝑠|S|} and object space O = {𝑜1, . . . , 𝑜|O|}, i.e., , C = S × O. For example, as shown in Fig. 7.1, a model trained on images of red apple and sliced tomatoes needs to additionally recognize an image of sliced apple. In training, only a set of seen compositions is available. In closed-world testing, the model needs to recognize images from both the seen compositions in C (𝑠) and the unseen compositions in C (𝑢) that are assumed to be feasible, where the cardinality |C (𝑠) ∪ C (𝑢) | ≪ |C| since most of the compositions in C are practically not feasible. In open-world testing, the model needs to recognize images given any composition in C. VLMs for CZSL. Large pre-trained VLMs such as CLIP [261] have recently been utilized by CSP [239] for the CZSL task. The core idea of CSP is to represent the text embeddings of states in S and objects in O as learnable parameters and contextualize them with the hard prompt template “a photo of [s] [o]” as the input of the CLIP text encoder, where [s] ∈ S and [o] ∈ O. Given an image x, by using the cosine similarity (cos) as the logit, the class probability of the composition 𝑦 is defined as 𝑝𝜽 (𝑦|x) = softmax(cos(v, t𝑦)), where 𝜽 are the |S| + |O| learnable parameters, v and t𝑦 are the image feature and class text embedding, respectively. In training, the prediction 𝑝𝜽 ( ˆ𝑦|x) is supervised by multi-class cross-entropy loss. In CZSL testing, a test image is recognized by finding the compositional class 𝑐 ∈ C which has the maximum cos(v, t𝑐). The CSP method is simple, parameter-efficient, and it largely outperforms traditional approaches. However, due to the lack of diversity and informativeness in prompting, the zero-shot capability of CLIP is not fully exploited by CSP for the CZSL task. 128 Figure 7.2 Overview of PLID. The model is developed for the CZSL task by aligning the semantics of image x (e.g., , image on the right) and compositional class 𝑦 = (𝑠, 𝑜) (e.g., , “red apple”) via a frozen CLIP [261]. It constructs language-informed text distributions in both compositional and primitive (attribute and object) spaces (middle part) by soft prompting and LLM-generated class descriptions (left part). The features of the image and text are enhanced by text and visual feature enhancement (TFE and VFE). Eventually, the compositional decisions from the two spaces are fused as the prediction. 7.4 Proposed Method Overview. Fig. 7.2 shows an overview of the PLID. The basic idea is to use LLMs to generate sentence-level descriptions for each compositional class, and learn to prompt the class-wise text distributions (supported by the descriptions) to be aligned with image data. Besides, we introduce visual language primitive decomposition (VLPD) and stochastic logit mixup (SLM) to enable recognition at both compositional and primitive levels. In testing, an image is recognized by fusing the decisions from the directly predicted and the recomposed compositions. 7.4.1 Prompting Language-Informed Distribution Motivation. To adapt the large pre-trained CLIP [261] to downstream tasks, recent distributional prompt learning [215, 161, 208, 60] shows the importance of context diversity by distribution modeling for strong generalization. Motivated by the inherent fine-granularity of compositional recognition in the CZSL task, we argue that not only the context diversity but also the context informativeness by language modeling, are both important factors to adapt CLIP to the zero-shot learning task. The insight behind this is that the sentence-level descriptions could contextualize compositional classes in a more fine-grained manner than the prior arts. Therefore, we propose to address the two factors by learning to Prompt the Language-Informed Distributions (PLID) for the 129 compositional space[1]The photo shows a redapple.[2]A redapple is pictured.…[M]An apple in the photo is red.TFEsoft-prompted compositional embeddingtext encoder (frozen)compositional classesred appleTextEncoderImageEncoderoldwetcatdogmudblackstateobjectDec. TextredappleVFEDec. ImageAugmentcomposition fusion⋯(red,apple)(red,wine) ⋯(sliced,tomatoes) ⋯LLM compositional descriptionsprimitive space“red apple”Large Language Modelfrozenlearnable[red]#!#"##⋯[apple][red]&!&"&#⋯[apple]⋯(red,apple)(red,wine)⋯(sliced,tomatoes)⋯list of classes CZSL task. Compositional Class Description. To generate diverse and informative text descriptions for each compositional class, we adopt a similar way as [223] by prompting an LLM that shows instruction-following capability. An example below shows the format of the LLM instruction. Keywords: sliced, potato, picture Output: The picture features a beautifully arranged plate of thinly sliced potatoes. For each composition 𝑦 = (𝑠, 𝑜), we generate 𝑀 descriptions denoted as 𝑆(𝑦) = {𝑆(𝑦) 1 where 𝑆(𝑦) 𝑚 is a linguistically complete sentence. Different to [223] that aims to interpret the , . . . , 𝑆(𝑦) 𝑀 } zero-shot recognition by attribute phrases from LLMs, we utilize the LLM-based sentence-level descriptions in the CZSL task for two benefits: (i) provide diverse and informative textual context for modeling the class distributions, and (ii) enhance the class embedding with fine-grained descriptive information. Language-Informed Distribution (LID). For both the image and text modalities, we use the frozen CLIP model and learnable feature enhancement modules to represent the visual and language features, which are also adopted in existing CZSL literature [214, 116]. Specifically, for the text modality, each composition 𝑦 is tokenized and embedded by CLIP embedding layer and further prompted by concatenating with learnable context vectors, i.e., , “[p1] . . . [p𝐿] [s] [o]”, where p1:𝐿 is initialized by “a photo of” and shared with all classes. Fol- lowed by the frozen CLIP text encoder E𝑇 , the embedding of class 𝑦 is q𝑦 = E𝑇 ( [p1] . . . [p𝐿] [s] [o]) where q𝑦 ∈ R𝑑. Following the CZSL literature [352, 214], here the soft prompt p1:𝐿 and primitive embeddings [s] [o] are learnable while E𝑇 is frozen in training. To simultaneously address the lack of diversity and informativeness of the soft prompts, we propose to formulate the class-specific distributions supported by the texts 𝑆(𝑦) and learn to prompt these distributions. Specifically, we encode 𝑆(𝑦) by the frozen CLIP text encoder: D(𝑦) = E𝑇 (𝑆(𝑦)), where D(𝑦) ∈ R𝑀×𝑑. Then, we use D(𝑦) to enhance q𝑦 by t𝑦 = ΨTFE(q𝑦, D(𝑦)) where ΨTFE is the text feature enhancement (TFE) implemented by a single-layer cross attention Transformer [328]. 130 Figure 7.3 Prompting for intra- and inter-class covariance optimization. Similarly, given an image x, to mitigate the loss of fine-grained cues, we augment it with 𝑁 views to be X = {x(1), . . . , x(𝑁) }. Followed by the frozen CLIP visual encoder E𝑉 , the feature of x is enhanced by v = ΨVFE(E𝑉 (x), E𝑉 (X)) where ΨVFE is the visual feature enhancement (VFE) by cross attention [328], implemented with the same structure as TFE for simplicity. We treat the enhanced text feature t𝑦 of class 𝑦 as the class mean and t𝑦 + D(𝑦) as the distribution support points (DSP) that follow the Gaussian N (t𝑦, 𝚺𝑦) where 𝚺𝑦 is the text variance of the class 𝑦. The motivation of t𝑦 + D(𝑦) is to enable the flexibility of DSP to traverse around in the 𝑑 dimensional space in training since t𝑦 is trainable while D(𝑦) are pre-trained. For all |C (𝑠) | (denoted as 𝐶) seen compositional classes, we build joint Gaussians N ( 𝝁1:𝐶, 𝚺1:𝐶) similar to ProDA [215], where the means 𝝁1:𝐶 ∈ R𝐶×𝑑 are given by t𝑦 over 𝐶 classes, and the covariance 𝚺1:𝐶 ∈ R𝑑×𝐶×𝐶 is defined across 𝐶 classes for each feature dimension from DSP. Discussions. Compared to the ProDA [215] that learns a collection of non-informative prompts, our DSPs are language-informed by D(𝑦) that provides more fine-grained descriptive information to help recognition and decomposition. Besides, our method is more parameter-efficient than ProDA since we only have a single soft prompt to learn. This is especially important for the CZSL task where there is a huge number of compositional classes. Lastly, we highlight the benefit of performing the intra- and inter-class covariance optimization induced by the learning objective of distribution modeling, which will be introduced below. Learning Objective. Given the visual feature v ∈ R𝑑 of image x and the text embeddings t1:𝐶 131 ⋯⋯⋯DSP(,!)DSP(,")ℰ,⋯soft prompthard promptCLIP text encoder Figure 7.4 VLPD for recomposing. from class-wise joint distributions N ( 𝝁1:𝐶, 𝚺1:𝐶), minimizing the cross-entropy loss is equivalent to minimizing the upper bound of negative log-likelihood (NLL): − log Et1:𝐶 𝑝(𝑦|v, t1:𝐶) ≤ − log exp(ℎ𝑦/𝜏) 𝑘=1 exp((ℎ𝑘 + ℎ(𝑚) (cid:205)𝐶 𝑘,𝑦 )/𝜏) := L𝑦 (x, 𝑦), (7.1) where the compositional logit ℎ𝑦 = cos(v, t𝑦), the pairwise margin ℎ(𝑚) 𝑘,𝑦 = v⊤A𝑘,𝑦v/(2𝜏) and A ∈ R𝑑×𝐶×𝐶 is given by A𝑘,𝑦 = 𝚺𝑘 𝑘 + 𝚺𝑦𝑦 − 𝚺𝑘 𝑦 − 𝚺𝑦𝑘 . The covariance A𝑘,𝑦 indicates the correlation between the 𝑘-th out of 𝐶 classes and the target class 𝑦 on each of 𝑑 feature dimensions. The insight of minimizing L𝑦 (x, 𝑦) is illustrated in Fig. 7.3, which encourages minimizing intra- class variance by 𝚺𝑦𝑦 and 𝚺𝑘 𝑘 , and maximizing inter-class separability indicated by 𝚺𝑘 𝑦 and 𝚺𝑦𝑘 . 7.4.2 Primitives Decomposition for Fused Recognition Motivation. Considering the fundamental challenge in the CZSL task, that the visual primitives are inherently entangled in an image, an unseen composition in testing can be hardly identified if its object (or its state) embedding is overfitted to the visual data of seen compositions. To this end, it is better to inherit the benefits of the decompose-recompose paradigm [415, 135, 209] by decomposing visual features into simple primitives, i.e., , states and objects, from which the recomposed decision can be leveraged for zero-shot recognition. Thanks to the compositionality of CLIP [342, 325], such motivation can be achieved by the visual-language primitive decomposition (VLPD). See Fig. 7.4 and we explain it below. Based on VLPD, we propose the stochastic logit mixup to fuse the directly learned compositions and the recomposed ones. VLPD. Specifically, we use two parallel neural networks 𝑓𝑠 and 𝑓𝑜 to decompose v into the 132 𝒗𝑓!𝑓"oldwetcatdogmudblackstateobject𝐡(")𝐡($)𝐇("#)text embedding of seen composition classstate logitobject logitrecomposed logit state visual feature 𝑓𝑠 (v) and object visual feature 𝑓𝑜 (v), respectively. To get the primitive-level supervisions, given the training compositions C (𝑠) (see the circle dots in Fig. 7.4), we group their enhanced embeddings {t𝑦} over the subset Y𝑜, in which all compositions share the same given object 𝑜 (see vertical ellipses in Fig. 7.4), and group {t𝑦} over the subset Y𝑠, in which all compositions share the same given state 𝑠 (see horizontal ellipses in Fig. 7.4). Thus, given a state 𝑠 and an object 𝑜, the predicted object logit ℎ𝑠 and state logit ℎ𝑜 are computed by ℎ𝑠 = cos (cid:169) (cid:173) (cid:171) 𝑓𝑠 (v), 1 |Y𝑠 | ∑︁ 𝑦∈Y𝑠 , t𝑦(cid:170) (cid:174) (cid:172) ℎ𝑜 = cos (cid:169) (cid:173) (cid:171) 𝑓𝑜 (v), 1 |Y𝑜 | ∑︁ 𝑦∈Y𝑜 t𝑦(cid:170) (cid:174) (cid:172) . (7.2) Different from DFSP [214] that only decomposes text features, we additionally use 𝑓𝑠 and 𝑓𝑜 to decompose visual features v and empirically show the superiority of performing both visual and language decomposition (see Table 7.6). Following the spirit of distribution modeling, we also introduce the distributions over state and object categories, where the corresponding DSP, denoted as D(𝑠) and D(𝑜), are obtained by grouping D(𝑦) over Y𝑠 and Y𝑜, respectively. This leads to the following upper-bounded cross-entropy losses: L𝑠 (𝑥, 𝑠) = − log L𝑜 (𝑥, 𝑜) = − log , 𝑘,𝑠 )/𝜏) exp(ℎ𝑠/𝜏) 𝑘=1 exp((ℎ𝑘 + ℎ(𝑚) (cid:205)|S| exp(ℎ𝑜/𝜏) 𝑘=1 exp((ℎ𝑘 + ℎ(𝑚) (cid:205)|O| 𝑘,𝑜 )/𝜏) , (7.3) where ℎ(𝑚) 𝑘,𝑠 and ℎ(𝑚) 𝑘,𝑜 are determined the same way as ℎ(𝑚) 𝑘,𝑦 in Eq. (7.1). By this way, the merits of language-informed distribution modeling, i.e., , the inter- and intra-class covariance optimization constraints, can be introduced into primitive space for fused recognition as introduced below. Composition Fusion. With the individually supervised 𝑓𝑠 and 𝑓𝑜, we have 𝑝(𝑦|v) = 𝑝(𝑠|v) · 𝑝(𝑜|v) according to conditional independence, that induces 𝑝(𝑦|v) ∝ exp((ℎ𝑠 + ℎ𝑜)/𝜏). Therefore, the recomposed logit matrix H(𝑟𝑐) ∈ R|S|×|O| is a Cartesian (element-wise combinatorial) sum between h(𝑠) ∈ R|S| and h(𝑜) ∈ R|O|, i.e., , H(𝑟𝑐) = h(𝑠) ⊕ h(𝑜)⊤, where h(𝑠) contains all state logits and h(𝑜) contains all object logits. See the red and blue squares in Fig. 7.4. Given the recomposed logit ℎ(𝑟𝑐) 𝑦 ∈ H(𝑟𝑐) and the directly learned compositional logit ℎ𝑦 by Eq. 7.1, we propose to stochastic fusion method in training by sampling a coefficient 𝜆 from a Beta 133 prior distribution: ˜ℎ𝑦 = (1 − 𝜆)ℎ𝑦 + 𝜆ℎ(𝑟𝑐) 𝑦 , 𝜆 ∼ Beta(𝑎, 𝑏), (7.4) where (𝑎, 𝑏) are hyperparameters indicating the prior preference for each decision. In training, we replace the ℎ𝑦 and ℎ𝑘 of Eq. (7.1) with the mixed logit ˜ℎ𝑦 and ˜ℎ𝑘 , respectively. In testing, no stochasticity is needed so that we use the Beta expectation of 𝜆 which is 𝑎/(𝑎 + 𝑏) to get the fused logit ˜ℎ𝑦. The insights behind the stochasticity are that the Beta distribution indicates a prior to ℎ𝑦 or ℎ(𝑟𝑐) 𝑦 . It provides the flexibility of which compositional decision to trust in, and the stochasticity of the coefficient 𝜆 inherently introduces a regularization effect in training [26]. Moreover, compared to softmax probability mixup [116], our logit mixup avoids the limitation of softmax normalization over a huge number of compositional classes, that rich information of class relationship is lost after softmax normalization according to [8]. Such class relationships are even more important in the CZSL problem as indicated in [234]. 7.5 Experiments Datasets. We perform experiments on three CZSL datasets, i.e., , MIT-States [121], UT- Zappos [374], and C-GQA [234], following the standard splitting protocols in CZSL litera- ture [259, 239, 214]. MIT-States consists of 115 states and 245 objects, with 53,753 images in total. Following [259, 239, 214], it is split into 1,262 seen and 300/400 unseen compositions for training and validation/testing, respectively. UT-Zappos contains 16 states and 12 objects for 50,025 images in total, and it is split into 83 seen and 15/18 unseen compositions for training and validation/testing. C-GQA contains 453 states and 870 objects for 39,298 images, and it is split into 5,592 seen and 1,040/923 unseen compositions for training and validation/testing, respectively, resulting in 7,555 and 278,362 target compositions in closed- and open-world settings. Evaluation. We report the metrics in both closed-world (CW) and open-world (OW) settings, including the best seen accuracy (S), the best unseen accuracy (U), the best harmonic mean (H) between the seen and unseen accuracy, and the area under the curve (AUC) of unseen versus seen accuracy. For OW evaluation, following the CSP [239], we adopt the feasibility calibration by 134 Table 7.1 CZSL results of Closed- and Open-World settings on three datasets. Baseline results are from published literature except for ProDA. Note that “–” indicates no results reported by the PCVL paper or not applicable by ProDA for more than 278K compositional classes on the C-GQA dataset. Method MIT-States UT-Zappos C-GQA S U H AUC S U H AUC S U H AUC Closed Open 30.2 46.0 26.1 11.0 CLIP [261] CoOp [409] 34.4 47.6 29.8 13.5 ProDA1 [215] 37.4 51.7 32.7 16.1 46.6 49.9 36.3 19.4 CSP [239] 48.5 47.2 35.3 18.3 PCVL [352] 47.5 50.6 37.3 20.2 HPL [330] 46.9 52.0 37.3 20.6 DFSP [214] PLID 49.7 52.4 39.0 22.1 30.1 14.3 12.8 3.0 CLIP [261] CoOp [409] 34.6 9.3 12.3 2.8 ProDA1 [215] 37.5 18.3 17.3 5.1 46.3 15.7 17.4 5.7 CSP [239] 48.5 16.0 17.7 6.1 PCVL [352] 46.4 18.9 19.8 6.9 HPL [330] 47.5 18.5 19.3 6.8 DFSP [214] PLID 49.1 18.7 20.4 7.3 15.8 49.1 15.6 5.0 52.1 49.3 34.6 18.8 63.7 60.7 47.6 32.7 64.2 66.2 46.6 33.0 64.4 64.0 46.1 32.2 63.0 68.8 48.2 35.0 66.7 71.7 47.2 36.0 67.3 68.8 52.4 38.7 15.7 20.6 11.2 2.2 52.1 31.5 28.9 13.2 63.9 34.6 34.3 18.4 64.1 44.1 38.9 22.7 64.6 44.0 37.1 21.6 63.4 48.1 40.2 24.6 66.8 60.0 44.0 30.3 67.6 55.5 46.6 30.8 – – – – – – 7.5 25.0 8.6 1.4 20.5 26.8 17.1 4.4 – 28.8 26.8 20.5 6.2 – 30.8 28.4 22.4 7.2 38.2 32.0 27.1 10.5 38.8 33.0 27.9 11.0 0.3 7.5 4.6 4.0 0.7 21.0 4.6 5.5 – – – 1.2 28.7 5.2 6.9 – – – 30.1 5.8 7.5 1.4 38.3 7.2 10.4 2.4 39.1 7.5 10.6 2.5 – – GloVe [256] to filter out infeasible compositions. Implementation Details. We implement the PLID based on the CSP codebase in PyTorch. The CLIP architecture ViT-L/14 is used by default. On the MIT-States, we generate 𝑀 = 64 texts and augment an image with 𝑁 = 8 views, and adopt Beta(1, 9) as prior. The dropout rates of cross-attention layers in TFE and VFE are set at 0.5, and the dropout rate of 0.3 for the learnable state and object embeddings. For the soft prompt embeddings, we set the context length of text encoder to 8 for all datasets. Following [214], we use Adam optimizer with base learning rate 5e-5 and weight decay 2e-5, and step-wise decay it with the factor of 0.5 every 5 training epochs for a total of 20 epochs. 135 Table 7.2 Ablation study. the baseline that uses mean pooling of text embeddings from T5-generated sentences. (b): add language-informed distribution (LID). (c): use text and visual feature enhancement module (FE). (d): change the LLM from T5-base to the OPT-1.3B. (e): apply primitive decomposition for fused decisions (PDF). (a): LID FE OPT PDF (a) (b) (c) (d) (e) ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7.5.1 Main Results ✓ ✓ ✓ Hcw 35.41 37.06 37.87 38.80 38.97 AUCcw 18.56 20.43 21.09 21.67 22.12 How 17.37 18.65 19.70 19.61 20.41 AUCow 5.56 6.50 6.95 7.01 7.34 The results are reported in Table 7.1. We compare with the CZSL baselines that are developed on the same frozen CLIP model. The table shows that under both the closed-world and open-world test settings, our proposed PLID method achieves the best performance in most metrics on the three datasets. Note that ProDA [215] also formulates the class-wise Gaussian distributions to address the intra-class diversity, but it can only outperform CLIP and CoOp on all metrics. This indicates the importance of both diversity and informativeness for the CZSL task. On the UT-Zappos dataset, the PLID outperforms the DFSP in terms of S, H, and AUC by 0.6%, 5.2%, and 2.7% respectively, while inferior to the DFSP on the best unseen metric. The potential reason is that DFSP fuses the text features into the image images, which better preserves the generalizability of CLIP for the small downstream UT-Zappos dataset. Note that the HPL method uses prompt learning and recognition at both compositional and primitive levels, but it performs only slightly better than CSP and way worse than our method, indicating that traditional prompt learning helps but is not enough to adapt the CLIP model to the CZSL task. 7.5.2 Model Analysis To comprehensively analyze the proposed PLID, we perform extensive ablation study and design analysis on the middle-sized MIT-States dataset in this section. Major Components. In Table 7.2, we show the contribution of the major components in the PLID model. It is clear that they are all beneficial. We highlight some important observations: (1) The LID method in row (b) significantly improves the performance compared to the baseline (a) that 136 Table 7.3 Effect of LID on states (N𝑠), ob- jects (N𝑜), and compositions (N𝑦). The first row: the model w/o LID. Table 7.4 Effect of LLMs. Note that GPT- 3.5 is not open-sourced so that we use its API call to get text descriptions. N𝑠 N𝑜 N𝑦 Hcw AUCcw How AUCow LLMs Hcw AUCcw How AUCow ✓ ✓ 38.44 21.67 19.53 6.99 38.30 21.62 19.49 6.95 ✓ 38.49 21.90 19.93 7.20 ✓ ✓ ✓ 38.97 22.12 20.41 7.34 Mistral-7B 37.22 20.78 19.22 6.74 37.38 20.61 19.38 6.80 GPT-3.5 38.41 21.53 20.46 7.34 T5-base OPT-1.3B 38.97 22.12 20.41 7.34 does not formulate Gaussian distribution in training, and they are much better than ProDA (20.43% vs 16.1% of AUCcw) when referring to Table 7.1. This implies that addressing the context diversity by modeling the Gaussian distribution like the ProDA is not sufficient, but context informativeness is critical and preferred for the CZSL task. (2) Rows (c)(d) show that feature enhancement (FE) and the better LLM OPT-1.3B can also brings performance gains. (3) Rows (e) show that the primitive decomposition for fused decision (PDF) could further improve the CZSL performance in both closed- and open-world settings. In the following paragraphs, we further validate the effect or design choices of these components in detail. Effect of LID. In Table 7.3, we investigate at which semantic level the language-informed distribution (LID) should be applied. Denote the Gaussian distribution on state, object, and composition as N𝑠, N𝑜, and N𝑦, respectively. The Table 7.3 results clearly show the superiority of applying LID on all three semantic levels. This indicates the generality of LID towards many potential zero-shot or open-vocabulary recognition problems. Effect of LLM. In Table 7.4, we analyze the choice of LLMs by comparing PLID variants using different LLMs, including the T5-base [262], OPT-1.3B [395], GPT-3.5 [247], and Mistral- 7B [126]. It shows the performance varies across different LLMs. Note that the capacity of GPT-3.5 and Mistral-7B on general language processing tasks is much better than T5-base and OPT-1.3B. However, we do not see improvements by using these generally larger and better LLMs, but a small OPT-1.3B is sufficient to achieve the best performance. TFE and VFE. In Table 7.5, we explore the design choices of the text and visual feature enhancement (TFE and VFE) modules. The results show that using one layer of randomly initialized 137 Table 7.5 Design choices of feature en- hancement (FE). We explore the use of text or visual feature enhancement (TFE/VFE) and the number of their cross- attention layers. TFEVFElayers Hcw AUCcw How AUCow ✓ ✓ ✓ ✓ ✓ ✓ 1 1 3 1 37.89 21.07 19.37 6.78 37.48 21.04 19.43 6.72 37.46 20.65 19.15 6.70 38.97 22.12 20.41 7.34 Table 7.6 Effect of VLPD and fusion strategies. We explore the modalities (text or image) of the decomposition, and whether deterministic (det.) or stochastic (stoc.) compositional fusion. VLPD fusion text image det. stoc. Hcw AUCcw How AUCow ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 37.94 20.98 19.67 6.98 ✓ 38.40 21.31 19.99 7.13 38.42 21.69 20.24 7.31 38.67 21.90 19.99 7.15 ✓ 38.97 22.12 20.41 7.34 (a) AUC vs. 𝑀 (b) AUC vs. 𝑁 (a) best HM (b) best AUC Figure 7.5 Impact of 𝑀 and 𝑁. We set 𝑁 = 8 for the Fig. 7.5a, while we set 𝑀 = 64 for the Fig. 7.5b. Figure 7.6 Impact of (𝑎, 𝑏). Here (1, 1) im- plies random sampling while (5, 5) implies equally trusted. cross-attention for both TFE and VFE performs the best. Using more cross-attention layers will causes significant performance drop (see the 3rd row). We attribute the cause to the overfitting issue when more learnable parameters introduced. VLPD and Fusion. In Table 7.6, we validate the design choices of visual language primitive decomposition (VLPD) and the stochastic compositional fusion. Comparing with the results of the first two rows, it show clear advantages of primitive decomposition over both image and text modalities. Note that DFSP [214] also has primitive decomposition but only on text modality. Our better performance than DFSP and the results in Table 7.6 thus tell the need for decomposition on both visual and image. Besides, to validate our stochastic compositional fusion, we compare with the model without fusion in the 3rd row and the model with only deterministic fusion (weighted average without Beta sampling) in the 4th row. They also show the benefit of fusion with stochasticity. Hyperparameters. In Fig. 7.5, we show the impact of the number of generated text descriptions 138 20.521.021.522.0481632646.56.87.17.4OWCW20.521.021.522.0248166.56.87.17.4OWCW(1,1)(1,9)(9,1)(5,5)152025303540CWOW(1,1)(1,9)(9,1)(5,5)510152025CWOW (a) LLM compositions (b) LLM states (c) LLM objects (d) Learned composition DSP (e) Learned state DSP (f) Learned object DSP Figure 7.7 tSNE visualization of the text embeddings with (the 2nd row) and without (the 1st row) learnable distribution modeling over compositions (the 1st column), states (the 2nd column), and objects (the 3rd column). This figure clearly shows that our method achieves good performance by distribution modeling. 𝑀 and the number of augmented image views 𝑁. It shows that the best performance is achieved when 𝑀 = 64 and 𝑁 = 8. We note that more augmented image views slightly decrease the performance, which could be attributed to the overfitting of the seen compositions. In Fig. 7.6, we show the impact of the Beta prior parameters (𝑎, 𝑏). We set them to (1, 1) for random sampling, (1, 9) for preference to the composition, (9, 1) for preference to re-composition, and (5, 5) for equal preference, respectively. It reveals that trusting more of the directly learned composition by Beta(1, 9) achieves the best results. Class Distributions. We use the tSNE to visualize the generated text embeddings D and the learned DSP from or PLID model in Fig. 7.7, where the same set of 10 compositional (or state/object) classes are randomly selected from MIT-States dataset. It shows that by learning the distribution of each composition, state, and object from LLM-generated texts using Eq. (7.1) and (7.3) and TFE module, class embeddings can be distributed more compactly in each class (small 139 100101001001234567891001010010012345678910010100100123456789100101001001234567891001010010012345678910010100100123456789 (a) Success and failure cases. (b) Comparison with model without LID. In Fig. 7.8a, we show the cases from the MIT-States dataset that our Figure 7.8 Case studies. method succeeds or fails. In Fig. 7.8b, we compare the proposed method with and without language-informed distribution (LID) modeling. Correct predictions are in green color, while incorrect predictions on state or object part are marked in red. intra-class variance), and better separated among multiple classes (large inter-class distance). This clearly show why our proposed language-informed distribution modeling works in the CZSL task. Case Study. In Fig. 7.8a, we show some success and failure cases of our PLID model. They reveal PLID still could make mistakes on the state prediction (cooked pasta) and object prediction (engraved floor), which indicates there are still rooms for improvement. In Fig. 7.8b, we show that PLID could work much better than the model without LID. For example, the sunny creek and frayed wire are incorrect potentially due to the lack of handling (i) intra-class variety, as the dry creek images can be sunny and the frayed hose class could contain wire images, and (ii) inter-class correlation, as the sunny is correlated to both the dry creek images and other sunny images. 7.6 Conclusion In this work, we propose a novel CLIP-based compositional zero-shot learning (CZSL) method named PLID. It leverages the generated text description of each class from large language models to formulate the class-specific Gaussian distributions. By softly prompting these language-informed 140 imagesground truthpredictionsmall laptopheavy gearsplintered palmsmall dogsmall laptopheavy gearsplintered palmsmall dogengraved computerengraved floorheavy waterhuge waveruffled pastacooked pastasuccess casesfailure casesimagesOurs (w/o LID)Ours (PLID)ruffled ribbonthawed meatdry foamengraved swordcrinkled flowerbrowned beefcored clayblunt swordfrayed wirefrayed hosediced cheesediced salmonsunny creekdry creekblunt swordbrowned beefcored claycrinkled flowerdiced salmondry creekfrayed hoseclassesimagesOurs (w/o LID)Ours (PLID)ruffled ribbonthawed meatdry foamengraved swordcrinkled flowerbrowned beefcored clayblunt swordfrayed wirefrayed hosediced cheesediced salmonsunny creekdry creekblunt swordbrowned beefcored claycrinkled flowerdiced salmondry creekfrayed hoseclasses distributions, PLID could achieve diversified and informative class embeddings for fine-grained compositional classes. Besides, we decompose the visual embeddings of image data into states and objects, from which the re-composed predictions are derived to calibrate the prediction by our proposed stochastic logit mixup strategy. Experimental results show the superiority of the PLID on multiple CZSL datasets. 7.7 Supplementary Material 7.7.1 Broader Impact and Limitations Broader Impact. The method in this work can be broadly extended to more multi-modality applications, such as general zero-shot learning, cross-modality compositional retrieval and gener- ation, etc. Besides, the central idea of LLM-based modality alignment is not limited to text and image, but any modality that could reveal the semantic categories in practice is promising to explore in the future. The potential negative societal impact is that, the developers should be cautious by carefully examining the societal biases indicated by the generated class descriptions, though the LLMs we used are publicly accessible. Limitations. One limitation is that the primitive decomposition could be difficult to learn when the states are non-visual concepts like smelly, hot, etc., even by the pre-trained CLIP model. Another limitation is that the generated descriptions by LLMs are not grounded to the image such that some distraction from generated descriptions could be introduced. 7.7.2 Generating Compositional Class Descriptions In this work, we choose T5-base, OPT-1.3B, GPT-3.5, and Mistral-7B models as the LLMs for compositional class description generation. For the T5 model, we follow the same setting as [107] that uses the T5-base model for word-to-sentence generation. The T5-base model was pre-trained on the Colossal Clean Crawled Corpus dataset [262] and finetuned on the CommonGen dataset [190]. Take the painted ceiling as an example, the results from T5-base model are: - A very old but beautifully decorated ceiling. - A remodeled interior with a painted ceiling. 141 - A painted ceiling at a restaurant. - Stained glass windows and a carved pattern on the ceiling. - Painted ceilings and a fireplace. - This apartment has a painted ceiling. - A painted ceiling was a huge hit. - A chandelier is painted in the middle of the ceiling. - A stained glass window in a bathroom with a white painted ceiling. - The ceiling of a hotel is painted. For the OPT model, we adopt the pre-trained OPT-1.3B model which is small but still could generate sentences of sufficiently good quality. Considering that it is a decoder-only LLM that relies on a few examples as the prompt for word-to-sentence generation, we design the prompt template with 4 examples as shown below. Keywords: sliced, potato, picture Output: The picture features a beautifully arranged plate of thinly sliced potatoes. ### Keywords: red, apple, photo Output: In the photo, a bright red apple is the central focus, captured in stunning detail. ### Keywords: leather, shoes, image Output: The image showcases a sleek pair of leather shoes, meticulously designed to impress. ### Keywords: broken, car, photo Output: The photo captures the aftermath of a car accident with a broken vehicle lying on the road. ### 142 where the Keywords is followed by the words of the state, object, and the word randomly picked from the set {photo, image, picture}. Using the same example painted ceiling as T5-base model, the generated sentences are: - The painting of the ceiling features an intricate pattern of intricate gold-colored paint and is framed by a white background. - The ceiling has been painted with the pictures of these three characters, all arranged together. - In the picture, the ceiling is covered in bright, colorful paintwork that has been painted on by hands painted white. The colors have been selected carefully. - In the picture, the ceiling features painted decoration. The decoration resembles the surface of the sea, and has been painted in shades of blue. - The photograph captures both the bright colors of the painting atop the ceiling and the subtle shades of light reflecting off of it. - The large picture shows a large pattern painted onto the ceiling. The blue line shows paint dripping down. - The wall behind the picture shows three different painted ceilings, in bright contrasting colors. A vibrant sky and blue skies are depicted against the dark brick wall. - The ceiling of the room depicted in the painting could very well be painted in a few hours. The details of each object are clearly defined in its placement and position. - Another photo of the same scene, this time featuring a ceiling painted in a stunning, white color. - A painted ceiling is shown, painted according to a specific design. this is a typical design that can also include decorative or functional elements. It is clear that the generated class descriptions are much more diverse and informative than 143 Table 7.7 Effect of covariance sharing on MIT-States dataset. All methods use the same batch size of 64 for a fair comparison of GPU memory. Variants Mem.(GB) ProDA [215] PLID (w. ShareCov) PLID (full) 32.5 17.6 22.2 Hcw 32.71 AUCcw 16.11 How 17.30 AUCow 5.11 38.50 (-0.47%) 21.69 (-0.43%) 19.81 (-0.60%) 7.04 (-0.30%) 38.97 22.12 20.41 7.34 those of the OPT model. 7.7.3 Covariance Sharing For the CZSL task, the spatial complexity of computing the covariance matrix 𝚺1:𝐶 is 𝑂 (|𝐶 (𝑠) |2𝑑) which could be too heavy to compute if the number of the compositions is too large. For example, the C-GQA dataset contains 278K seen compositions which result in around 6 × 1013 floating elements of 𝚺1:𝐶 for 768-dim text features. To handle this issue, we instead implement the 𝚺1:𝐶 by sharing the covariance across attributes given the same object. This implies that the model is encouraged to learn the object-level distributions. Specifically, similar to the VLPD module of the main paper, we compute the mean 𝝁1:|O| and covariance 𝚺1:|O| over the objects by grouping t𝑦 and D(𝑦) with object labels: t𝑜 = 1 |Y𝑜 | ∑︁ 𝑦∈Y𝑜 t𝑦, D(𝑜) = 1 |Y𝑜 | ∑︁ 𝑦∈Y𝑜 D(𝑦), (7.5) where Y𝑜 is the subset of compositions in Y that contains the same object as 𝑦. Then, all the pairwise margins H(𝑚) ∈ R|O|×|O| in object space can be mapped back to H(𝑚) ∈ R𝐶×𝐶 in a 𝑜 compositional space by sharing it with all compositions in Y𝑜. This could significantly reduce the computation load of the covariance while compromising the accuracy of distribution modeling. Since the distribution modeling for both our PLID and ProDA is not applicable to the C-GQA dataset, we use the MIT States dataset to show the negative impact of sharing the covariance (see Table 7.7). It shows that the covariance sharing can significantly save the GPU memory (17.6 vs 32.5 GB), while still performing much better than ProDA. 144 Table 7.8 Hyperparameters of model implementation. Hyperparameters max epochs base learning rate weight decay number of text descriptions number of image views attention dropout weights of primitive loss MiT-States 20 0.00005 0.00002 64 8 0.5 0.1 UT-Zappos 25 0.0001 0.00001 32 8 0.1 0.01 C-GQA 20 0.00001 0.00001 64 8 0.1 0.01 7.7.4 Primitive-level Gaussian Modeling To formulate the Gaussian distributions over the state classes and the object classes, we group the text embeddings of composition descriptions D by Eq. (7.5), resulting in the distribution support points (DSP) t𝑜 + D(𝑜) and t𝑠 + D(𝑠) for a given object class 𝑜 and state class 𝑠, respectively. The DSPs are assumed to follow the state distribution N (t𝑠, 𝚺𝑠) or the object distribution N (t𝑜, 𝚺𝑜), where the covariances 𝚺𝑠 and 𝚺𝑜 are determined by D(𝑠) and D(𝑜), respectively. Eventually, given the decomposed state visual features 𝑓𝑠 (v) and object visual features 𝑓𝑜 (v), the logit margin terms are defined as ℎ(𝑚) 𝑘,𝑠 = 𝑓𝑠 (v)⊤A𝑘,𝑠 𝑓𝑠 (v), and ℎ(𝑚) 𝑘,𝑜 = 𝑓𝑜 (v)⊤A𝑘,𝑜 𝑓𝑜 (v), (7.6) where the index 𝑘 ranges within [1, |S|] for computing the state classification loss L𝑠, and ranges within [1, |O|] for computing the object classification loss L𝑜, respectively. 7.7.5 More Implementation Details and Results Implementation. The training hyperparameters of our final model on each dataset are listed in Table 7.8. More Ablation Analysis. In Table 7.9, we show more ablation study results on the design choices of our model. The first is to answer: Should we learn both the compositional and primitive feature space? This is interesting because if the primitive space can be learned by the proposed VLPD, intuitively the original compositional space is redundant. In the first line of Table 7.9, we show that if we remove the compositional space but only learn primitive space to recompose, 145 Table 7.9 More ablation study results. model variants recompose only w/o soft prompt 3-layers FE full model TFE only VFE only TFE+VFE Hcw 30.02 38.57 36.89 36.55 37.46 38.97 AUCcw 13.88 21.67 19.93 19.80 20.65 22.12 How 15.46 20.00 18.77 19.06 19.15 20.41 AUCow 4.35 7.17 6.42 6.51 6.70 7.34 the performance experiences a large drop in all metrics. This can be explained by the intuition that, without a direct compositional recognition, the merits of explicitly learned separatability and implicitly learned compositionality will be totally lost. These are the keys to the success of the pioneering CZSL method CSP [239]. Besides, in Table 7.9 line 2, we investigate whether the soft prompt is still useful or not based on our model, though it has been validated in prior CZSL literature [214]. It shows that without the soft prompt, the performance decreases but not too much. However, it is still necessary as it drives the LLM text distributions to align with visual features in training. Lastly, in Table 7.9 lines 3-5, we further analyze the impact of TFE and VFE modules if they are implemented with the three-layer cross-attention Transformers. The two modules still show contributions to the performance gain. Moreover, compared to the default one-layer setting, using more Transformer layers does not improve the performance, even performing worse. 146 CHAPTER 8 OPEN-VOCABULARY ACTION DETECTION 8.1 Introduction Action Detection (AD) aims to recognize actions and spatially and temporally localize the actors in videos. It plays a vital roles in various applications like video surveillance [365, 383, 51], autonomous driving [293], and sport event analysis [184], and it thus draws increasing attentions in recent literature [134, 299, 150, 250, 38, 345, 398, 347, 37]. Existing AD methods are mostly developed in a closed-set setting where the models are trained and tested on videos from the same fixed set of action categories.While significant progress has been made over the past few years [38, 398, 347, 37], the assumption that the training and test videos are from the same action categories limits their application to the real world, where test videos could contain actions beyond the pre-defined training categories. For example, a video surveillance system may be able to detect fighting, but other dangerous or suspicious actions like shooting and chasing will not be detected if the system has not been trained with annotated videos from these action categories. In addition, being able to detect actions in an open world facilitates a comprehensive understanding of videos and opens doors to high-level video understanding tasks, like reasoning [386], forecasting [300], etc., that usually require detecting various actions in videos. This motivates us to investigate Open-Vocabulary Action Detection (OVAD), a task aiming to detect any actions in videos, including both seen categories contained in the training set and unseen categories absent in the training set. However, OVAD is challenging as it requires understanding the human motion dynamics across frames. While motion dynamics modeling has been well studied by the conventional closed-set action detection [74, 38, 398, 347] that takes advantages of full supervision in training, it is challenging in the open-vocabulary setting since there is no This chapter is adapted from a paper in submission, which was mostly done during the internship of Wentao Bao and Yuxiao Chen at NEC Labs America and partly supported by Dr. Yu Kong at MSU. "Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, and Yu Kong. Exploiting VLM localizability and semantics for open vocabulary action detection. in submission, 2024." 147 supervision for the unseen action categories. Recently, harvesting the strong generalization capability of pre-trained large visual-language foundation models (VLMs) [261], various open-vocabulary approaches have been proposed for image recognition [392], object detection [9, 140, 348, 191, 160], and image segmentation [411, 188]. However, these methods are designed for images and do not consider temporal dynamics among video frames. In addition, image VLMs such as the CLIP [261] are struggling to capture the action verbs in text and human motion in videos [229]. This inevitably requires learning the temporal dynamics [132] or fully fine-tuning [264] for recognizing the actions on downstream tasks, which take the risk of poor generalizability to the unseen. There are a few seminal works that leverage VLMs for open-vocabulary video understanding, including action recognition [243, 340, 204, 264] and temporal action localization [265, 236, 359]. However, for the region-level action detection by VLM, there exists a representation gap between video-level pre-training and the region-level adaptation, which is analogous to the representation gap issue discussed in image-based open-vocabulary object detection literature [9, 140, 348]. Specific to the OVAD task, the representation gap stems from the holistic video-action alignment in pre-training and the downstream region-level sub-tasks, i.e., , region-action alignment and action-relevant person localization. The cause of the representation gap can be attributed to their intrinsically different adaptation goals from pre-trained video VLMs, i.e., , transferring the semantics and localizability of VLMs from video to regions for the two sub-tasks, respectively. Re-thinking the Transformer-like design of VLMs, we found that the way of using VLM semantic features and the undervalued localizability of VLMs are both critical to the OVAD task. First, to transfer the video-level semantics to each region, we propose to learn a set of region- wise queries to decode the temporal dynamics from videos, by using the pre-trained video-level features as adaptive semantics conditions. The updated queries and video-level features are further dynamically fused and aligned with the textual semantics for recognition. Second, to exploit the video VLM localizability for region-wise localization, we learn a set of queries to decode the person boxes starting from the prior locations revealed by the VLM visual attention. 148 Specifically, we develop a query-based open-vocabulary action detector, OpenMixer, to detect any video actions in an open vocabulary. It fits in the family of the detection transformers (DETR) [24, 271, 82, 398, 347]. The basic idea is to decouple the action recognition and localization by learning two sets of queries and corresponding decoding modules. Our OpenMixer consists of a spatial OpenMixer Block (S-OMB) for person localization, a temporal OpenMixer Block (T-OMB) for capturing the region-level temporal motion, and a dynamically fused alignment (DFA) for open-vocabulary action recognition. The S-OMB inherits the localizability of VLMs by the text- patch cross-attention, the T-OMB exploits the visual semantic features of VLMs to better capture the temporal dynamics, and the DFA dynamically fuses the pre-trained semantics into learnable region-level queries for generalizable recognition. Eventually, our model enjoys the merits of semantics and localizability from VLMs as well as the end-to-end detection capability from the DETR pipeline. Besides, we set up the first OVAD benchmark based on popular action detection datasets, and evaluate technically viable baselines and our proposed model. Empirical results show that our OpenMixer is superior to these baselines. In summary, the main contributions are three-fold: • We formulate the task of open-vocabulary action detection (OVAD) for the first time, which is valuable while challenging even by foundation models. • We develop the OpenMixer model that exploits the semantics and localizability of pre-trained video-language models toward the OVAD task. • We empirically reveal the effectiveness of the proposed modules that show strong generaliz- ability on multiple video action detection datasets. 8.2 Related Work Spatio-temporal Action Detection. This task aims to localize human actions spatially and temporally in videos and recognize their actions, which has been a fundamental video understand- ing topic [343, 74, 73, 275]. A line of recent works [299, 150, 306, 250, 36, 403] adopts the two-backbone design to separately extract features of the keyframes and the entire video for actor 149 localization and actor-context relation modeling, respectively. Though they are flexible by taking advantage of both image and video backbones for achieving promising performance, the model parameters are redundant in design and heavy to optimize [37]. With the recent advances in detec- tion transformer (DETR) [24], end-to-end action detection by a single backbone shows impressive performance and thus becomes a more popular design choice [89, 38, 398, 347, 37]. The basic idea is to use a single video transformer to get features of all video frames, and then introduce learnable queries to mix with video features for actor localization and action recognition. Specif- ically, WOO [38] follows the Sparse RCNN [301] for localization, TubeR [398] learns the action tubes following the classical DETR [24], and STMixer [347] follows the AdaMixer [82] design that achieves the sate-of-the-art performance. The query-based design is advantageous in modeling the interaction between actors and actor context while simplifying the architecture as a single-stage design. Therefore, to further overcome the limitation in an open world, we introduce large-scale pre-trained vision-language models to achieve query-based open-vocabulary action detection. Open-vocabulary Visual Understanding. The basic idea is to replace the traditional fixed classifier weights with the pre-trained textual representation of class categories. Thanks to the strong alignment capability of pre-trained visual language models (VLM), visual data from unseen classes in an open world can be recognized by the alignment between the visual feature and the text feature of the class names [346, 261]. This motivates a series of works in object detection [385, 206, 384, 405, 140, 191, 160, 348], action recognition [332, 243, 198, 132, 264, 338], and temporal action localization [132, 265, 236, 204]. For video action detection, the iCLIP [117] that tackles the zero-shot action detection is the most relevant work to ours. However, it skipped learning to localize actions by off-the-shelf person detectors [267] and only learns to recognize the unseen actions, which lacks the adaptability to localize action-relevant persons. Besides, the recent image- based open-vocabulary object detectors OV-DETR [384] and CORA [348] share a common spirit with ours by injecting VLM semantics into the learnable queries. However, the query conditions in OV-DETR are class-specific such that they are not adaptive to test-time samples, and the two-stage training in CORA limits its flexibility in video domains. To the best of our knowledge, we are the 150 Figure 8.1 Framework (left) and the OpenMixer Block (right). Given a video and an open vocabulary of actions, we use prompted classes and a pre-trained video VLM to obtain all kinds of VLM features. With a stack of cascaded OpenMixer blocks and spatial-temporal queries, the model predicts the action scores, person boxes, and their associated person scores. first to systematically investigate the open vocabulary action detection (OVAD) task and we develop the first query-based OVAD model that can be learned in an end-to-end way. 8.3 Method In contrast to the closed-set video action detection [299, 134], open-vocabulary action detection (OVAD) aims to recognize and spatiotemporally localize any human actions in videos, including action categories seen and unseen in training. Concretely, an OVAD model is learned from a training set of 𝑁𝑡𝑟𝑎𝑖𝑛 samples {(X, Y)𝑖 |𝑖 = 1, . . . , 𝑁𝑡𝑟𝑎𝑖𝑛} where X denotes the training video and Y denotes the bounding box annotations on the keyframe that consists of box coordinates b and action category 𝑦. In training, an action 𝑦 is drawn from a fixed set of base action categories C𝐵. In testing, the learned action detector could detect “any” actions in a given video from the open vocabulary C𝐵 ∪ C𝑁 , where C𝑁 contains any novel action categories. 8.3.1 OpenMixer In this paper, we propose the OpenMixer to solve the OVAD task. The OpenMixer model is developed within the family of query-based action DEtection TRansformers (DETR) [347, 38, 398]. Basically, DETR-style models treat the action detection task as a set-to-set prediction problem, i.e., , learning a sparse set of query features from videos to match with the ground truth boxes and action classes. Specific to the OVAD task, the action classes are predicted from an open vocabulary that contains both the base and novel actions. The OpenMixer is shown in Fig. 8.1 (left), given a video X and a list of text prompted action 151 action videoaction textA video of person {playing golfing}A video of person {playing golfing}Golfing: A person is swinging a golf club to hit a golf ball.𝑓!𝑓"𝑆text & visual semantics𝑉 visual context & locationaction scoresperson scorestemporal queriesT-OMB(")S-OMB(")spatial queriesOpenMixer Block⋮boxes("$%)boxes(")OpenMixer BlockOpenMixer BlockOpenMixer Blocktemporal queriesspatial queriesDFA(")VLM featuresvideo VLMaction scoresperson scoresboxes class as input, we leverage the visual and text encoders ΨVE and ΨTE of a pre-trained video VLM to obtain all features of the video and action text, i.e., , V, f𝑣, S = ΨVE(X) and f𝑡 = ΨTE(𝑦). Here, V, f𝑣 and S are the 4D patch-level video feature, video-level feature, and video attention, respectively, and f𝑡 is the text feature of class 𝑦. Then, we build 𝑀 cascaded OpenMixer Blocks (OMB) to learn a set of 𝑁 spatial queries Q𝑠 and 𝑁 temporal queries Q𝑡 from (V, S, f𝑣, f𝑡) for person detection and action classification, respectively. The OMB takes as input all the features from VLM and the Q𝑠 and Q𝑡 to predict person boxes, person scores, and action scores. For the 𝑚-th OMB, as shown in Fig. 3.2 (right), it consists of a Temporal OpenMixer Block (T-OMB) Ψ𝛼, a Spatial OpenMixer Block (S-OMB) Ψ𝜃, and a dynamically fused alignment (DFA). The S-OMB consists of prior location sampling, query-query (Q-Q) mixing by self-attention [328] and query-video (Q-V) mixing by AdaMixer [82], while the T-OMB sequentially consists of Q-Q mixing, query conditioning, and Q-V mixing (see Fig. 8.2a and 8.2b for reference). The DFA module recursively updates the Q𝑠, Q𝑡, and person boxes from the (𝑚 −1)-th OMB, and predict person scores and action scores. These three modules are developed for the OVAD task with the consideration of VLM semantics and localizability, which will be introduced in the following sections. 8.3.2 Localizability Prior for Spatial OMB A major challenge for one-stage query-based detectors is the low convergence of localization. One of the causes is the lack of prior knowledge of object locations. Specific to the action detection, recent two-stage action detectors [377, 117, 72, 37] address location prior by an off-the-shelf person detector and RoIAlign [105] cropping, but the feature cropping lacks the spatiotemporal context and suffer from representation gap when a pre-trained VLM is introduced. For recent query-based action detectors [38, 398, 347], the prior knowledge of the person locations is missing in their design. Therefore, when it comes to the OVAD task by VLMs, a natural question is that, Can we obtain the prior locations of actors from pre-trained VLMs in a cheap way? Motivated by these considerations, we resort to the visual attention from a pre-trained VLM. Prior Locations from VLM Attention. Visual attention maps are traditionally represented by 152 (a) S-OMB (b) T-OMB (c) DFA Figure 8.2 Spatial and Temporal OMB, and DFA. In Fig. 8.2a and Fig. 8.2b, the Q-Q and Q-V mixing modules aim to mix information among queries and across query-visual features, respectively. S-OMB is in Sec. 8.3.2 where the dashed arrow is only used at the 1st stage. T-OMB is in Sec. 8.3.3 and DFA is in Sec. 8.3.4. the class activation map (CAM) to visually explain recognition models [406, 281]. In the era of ViT [65] and VLM [261], recent works [25, 32, 178] propose to use the multi-head self-attention (MHSA) of the last ViT layer, or the gradient-weighted accumulative product over multi-layer self-attention. However, MHSA is not visually faithful due to the high redundancy of video tokens, and the gradient-based methods suffer from a huge computational cost on video VLMs and ad-hoc implementation for different VLMs. Moreover, due to the lack the token-level video-text correlation, their attention map does not closely relevant to the action specified by the vocabulary. Therefore, an efficient and structure-agnostic CAM is preferable to large video VLMs, which motivates us to use patch-text correlation as VLM attention to encode the location priors. Specifically, with the 𝐷-dimensional 4D video feature V ∈ R𝑇×ℎ𝑤×𝐷 where 𝑇 is the number of frames and ℎ𝑤 is the number of visual tokens in each frame, the holistic video feature f𝑣 ∈ R𝐷, and , . . . , f (𝐶) the text features of 𝐶 classes as F𝑡 = [f (1) ]⊤. The features are 𝐿2 normalized. We first 𝑡 𝑡 get the pre-matched text feature f𝑡 by maximum similarity: f𝑡 = arg maxf𝑡 ∈F𝑡 f⊤ 𝑣 ⊗ f𝑡, since we do not have access to the class label in testing. Thus, the inner-product between f𝑡 and V determines the patch-text correlation: S = V ⊗ f𝑡. Furthermore, as discussed in [179, 178], the q-v attention in self-attention layers shows an opposite heatmap where the foreground regions are associated with low attention value. In practice, we also observed this issue so that similar to [179], our CAM is determined by the reversed patch-text similarity: ˆS = 1 − S. By reshaping and spatial 153 ⋯Q-Q mixingQ-V mixingspatial queries (𝑸!)𝑽⋯updated queries ("𝑸!)location samplingboxes(#$%)𝒇#boxes(#)person scoresattention⋯Q-Q mixingConditionQ-V mixing𝒇!temporal queries (𝑸")𝑽⋯boxesupdated queries ($𝑸")action scores⨁⨂!𝑸!𝒇!𝒇"𝝀𝟏−𝝀A video of person {playing golfing}A video of person {playing golfing}Golfing: A person is swinging a golf club to hit a golf ball. interpolation over ˆS, the attention map is obtained for prior location sampling. We treat the ˆS as the prior distribution of person locations indicated by the VLM, thus the top-𝑁 positions are sampled as the initial boxes centers: {(𝑢, 𝑣)𝑖 |𝑖 = 1, . . . , 𝑁 } ∼ ˆS(𝑢, 𝑣, 𝑘) where (𝑢, 𝑣) are 2D coordinates on the keyframe 𝑘 and 𝑁 is the number of queries. Spatial OMB. With the sampled prior locations, the S-OMB (see Fig. 8.2a) that consists of Q-Q and Q-V mixing modules takes as input the video patch features V and the box prediction ˆb𝑚−1 of the previous (𝑚−1)-th stage, to update the spatial queries by ˆQ𝑠 = Ψ𝜃𝑚 (V, Q𝑠, ˆb𝑚−1). The updated spatial queries ˆQ𝑠 are used to predict the person scores ˆo𝑚 and person box offsets Δ ˆb𝑚 by MLP. Then, the predicted boxes at stage 𝑚 are updated by ˆb𝑚 = ˆb𝑚−1 + Δ ˆb𝑚, where initial box queries ˆb0 consist of the sampled prior locations and the video spatial range. The technical intuition behind the design is to encourage the proposed Spatial OMB to learn the box offset Δb starting from the prior locations inherited from the pre-trained VLM. Besides, compared to [347] that uses the fixed non-informative frame centers as prior locations, our VLM attention-based prior locations are adaptive to the test-time video content and vocabulary, which improves not only the seen action localization but also the generalization to the unseen (see Table 8.1 ZSR+TL section). 8.3.3 Adaptive Semantics for Temporal OMB For the query-based OVAD models, temporal queries are expected to be discriminative for both base and novel actions. This requires a strong capability of content decoding for the query-video (Q-V) mixing module. The pioneering work DETR [24] uses cross-attention while [82, 347] adopt the MLP-Mixer [322]. However, without VLM semantics, these approaches inevitably overfit the seen class data and are unable to detect the unseen. Recent works [384, 348] rightly address the importance of VLM semantics for the query features, but they lack the adaptability to the test-time visual content due to the class-wise semantic condition in [384] and the region prompting in [348]. These motivate us to propose the Temporal OMB that exploits adaptive semantics from pre-trained VLMs. Temporal OMB. As dipicted in Fig. 8.2b, with the temporal queries Q𝑡 and the predicted boxes 154 ˆb𝑚 at the current stage 𝑚, the queries are updated by interacting with the video features V and f𝑣 by the function ˆQ𝑡 = Ψ𝛼𝑚 (V, Q𝑡, f𝑣, ˆb𝑚). To achieve our motivation of using adaptive semantics, we propose a query update: ˆQ𝑡 = Ψ𝑞𝑣 (cid:0)Ψ𝑞𝑞 (Q𝑡, b) ⊕ f𝑣, V, b(cid:1) , (8.1) where Ψ𝑞𝑞 and Ψ𝑞𝑣 are Q-Q mixing and Q-V mixing modules by self-attention [328] and AdaMixer [82], respectively. Here, f𝑣 is the adaptive semantic condition by the pre-trained VLM video feature, which is broadcastly added (denoted as ⊕) to the output of Q-Q mixing. Remark. Note that the adaptiveness of the semantic condition stems from the test-time video feature f𝑣. Alternatively, when the semantic condition f𝑣 is changed to f𝑡 over 𝐶 classes, it is equivalent to the way in [384]. However, we empirically show this leads to inferior performance (see Table 8.4) especially for the seen action detection. The inferiority can be attributed to the lack of adaptability to test-time video content. Besides, as another alternative, the post-condition that places the condition f𝑣 after the Q-V mixing, i.e., , ˆQ𝑡 = Ψ𝑞𝑣 (cid:0)Ψ𝑞𝑞 (Q𝑡, b), V, b(cid:1) ⊕ f𝑣, the module Ψ𝑞𝑣 is thus to learn the residual of f𝑣. We empirically found that our pre-condition by Eq. 8.1 is superior to the post-condition, potentially because of the better query features used to learn the important Q-V mixing module. 8.3.4 Dynamically Fused Alignment To recognize both seen and unseen actions, the model needs to learn discriminative region- wise visual features to align with seen actions, while keeping the generalizable knowledge of the pre-trained VLMs to align with the unseen actions. Dealing with the two goals is challenging. A line of recent approaches uses model adaptation by prompt tuning [132, 67, 236, 338, 348, 117], adapters [251, 81], and gradient preserving [340, 412]. However, these methods either struggle in generalization to novel categories or need to back-propagate through the large VLM that incurs huge computational costs, especially for long videos. Therefore, we resort to a dynamically fused alignment (DFA) for open-vocabulary action recognition, which is lightweight in design and works well for both seen and unseen actions. 155 Specifically, as shown in Fig. 8.1 and Fig. 8.2c, the DFA is formulated to learn the action classification at each stage 𝑚, i.e., , ˆy𝑚 = Ψ𝝀𝑚 ( ˆQ𝑡, f𝑣, f𝑡), where ˆy𝑚 are the predicted actions for all queries ˆQ𝑡 and the 𝝀𝑚 are the learnable parameters. This module includes dynamic feature fusion and query-text alignment. Dynamic Feature Fusion. This step aims to fuse the video-level feature f𝑣 into each of the queries ˆQ𝑡 dynamically. Specifically, we first repeat 𝑁 times of the f𝑣 to be F𝑣 ∈ R𝑁×𝐷. Then, the fusion between F𝑣 and ˆQ𝑡 is achieved by ˜F𝑣 = 𝝀 ⊙F𝑣 +(1−𝝀) ⊙ ˆQ, where 𝝀 ∈ R𝑁×1 are learnable in training. The intuition behind the query-specific learnable 𝝀 is that, it allows the dynamic contributions of the video-level knowledge from f𝑣 to the different learnable queries in the set-matching training. Query-Text Alignment. To make the classification decision by ˜F𝑣 and open vocabulary of actions, for the action category, we leverage GPT-4 [248] to generate multiple visually descriptive action prompts for each category. With VLM text encoder, the aggregated text features of C classes are represented as F𝑡 ∈ RC×𝐷, where C is the number of classes. Eventually, we use the softmax of visual-text cosine similarity to represent the multi-class classification probability: 𝑃( ˆ𝑦| ˆQ) = softmax( ˜F𝑣 ⊗ F⊤ 𝑡 /𝜏) where 𝜏 is the VLM temperature. In testing, the open-vocabulary action recognition for all queries is achieved by finding the maximum visual-text cosine similarity: ˆ𝑦 = arg max𝑦∈C ( ˜F𝑣 ⊗ F⊤ 𝑡 ). Note that we do not include the spatial queries Q𝑠 as the input of our DFA module. This makes the T-OMB Ψ𝛼 and the S-OMB Ψ𝜃 to be decoupled in training such that the person localization is class-agnostic, which is essential for open-vocabulary tasks according to [236, 348]. 8.3.5 Training and Inference In training, for action localization, we adopt the regular set matching loss following the DETR literature [24, 271, 82]: L𝑠𝑒𝑡 = L𝑏𝑐𝑒 + L𝐿1 + L𝑔𝑖𝑜𝑢, where L𝑏𝑐𝑒 is a binary cross-entropy loss for person score prediction, L𝐿1 and L𝑔𝑖𝑜𝑢 are the coordinate distance and GIoU distance [269] between predicted and ground truth boxes, respectively. Then, we use the Hungarian matching [24] to find the optimal bipartite matching between the predicted and ground truth boxes for each video. For action recognition, we use a multi-class cross-entropy loss L𝑎𝑐𝑡 so that the total loss for training 156 is L𝑡𝑜𝑡𝑎𝑙 = 𝑤1L𝑠𝑒𝑡 + 𝑤2L𝑎𝑐𝑡 where the hyperparameters 𝑤1 and 𝑤2 are used to balance between the two subtasks. During inference, the thresholded person scores determine the kept person boxes, while the action scores assign the action categories to boxes from input class categories. 8.4 Experiments 8.4.1 Experimental Setup Datasets. Our method is implemented on two commonly-used action detection datasets, i.e., , J-HMDB [124] and UCF101-24 [296]. J-HMDB dataset contains per-frame annotated bounding boxes of persons along with 21 action classes. Similar to [132, 117], with 50% of actions as the novel classes, we randomly split it into 10 base classes for training and 11 novel classes for open- world testing, which results in 10,570 training samples and 9,139 testing samples. UCF101-24 dataset is a subset of UCF101 [296]. It is also per-frame annotated for action detection and contains 24 action classes. With the same 50% splitting strategy, we split it into 12 base classes for training and 12 novel classes for open-world testing. Evaluation criteria. Following the standard paradigm in action detection literature [150, 72, 38, 398, 347], the model performance is evaluated by video mAP. It evaluates the spatiotemporal action tubes of the detected bounding boxes over the classification and 3D intersection-over-union (IoU). Following [72], the 3DIoU threshold is set to 0.5 for J-HMDB and 0.2 for UCF101-24, respectively. Implementation details. We experiment with two VLMs including the image pre-trained OpenAI CLIP [261] and video pre-trained CLIP-ViP [357]. We use the same ViT-B/16 architecture for the two VLMs. The VLMs are kept frozen in training. For the image CLIP, we get video-level semantic features by temporal mean pooling. We obtain the patch token features of the last ViT layer and use them to construct the 4D pyramid feature V by multi-scale residual convolutions. By default, we set the number of queries and OMB stages to 100 and 3, respectively. In training, we set the mini-batch size to 16 and frame sampling by 16 × 1. The weight of the set loss L𝑠𝑒𝑡 and action loss L𝑎𝑐𝑡 are set to 2.0 and 48.0, respectively. Following [24, 38, 347], each intermediate stage is individually supervised by the loss L𝑠𝑒𝑡 and L𝑎𝑐𝑡. We set the base learning rate to 1e-5 157 Table 8.1 OVAD results. For all methods use the same pre-trained CLIP-ViP [357] as the frozen VLM and evaluated by video mAP. For the ZSR+ZSL setting, we use Mask RCNN [105] as the ZSL person localizer and use either handcrafted (HC) or GPT-generated (GPT) prompts for either video- or region-level zero-shot recognition. Settings Models ZSR+ZSL Region + GPT Video + HC Video + GPT ZSR+TL STMixer [347] Spatial OpenMixer E2E STMixer [347] STMixer [347] (w. CoOp [409]) OpenMixer (w. CoOp [409]) OpenMixer J-HMDB UCF101-24 Base Novel Mean 19.92 33.51 30.06 31.04 58.51 49.89 35.01 68.66 64.61 Base Novel 18.29 21.54 30.64 31.43 35.43 34.59 58.27 68.04 73.06 75.66 94.18 90.75 68.31 79.53 27.44 11.91 80.20 82.33 36.66 40.32 33.72 36.12 45.11 47.71 45.26 48.80 60.91 60.42 62.48 61.18 28.07 31.85 6.54 11.81 27.75 34.23 Mean 31.86 54.40 66.73 63.53 74.06 49.16 42.27 86.86 86.34 and use the AdamW [211] optimizer to train models for 12 epochs on 4 RTX 6000Ada or 2 A100 (80G) GPUs. In testing, the person detection threshold is set to 0.6. We individually test the base and novel classes and report their video-mAP results and the mean on all categories. Our model inference speed is 0.23 s/video per A6000 GPU, with 587M parameters based on CLIP-ViP/B16 VLM. OVAD task settings. To benchmark methods on OVAD task, three settings are presented consid- ering if the localization and classification are trained or not. • ZSR+ZSL (zero-shot action recognition and actor localization): without any training, we only use pre-trained person detectors such as Mask RCNN [105] to detect persons, and use pre-trained video VLMs such as CLIP-ViP [357] to perform region- or video-level open- vocabulary recognition. • ZSR+TL (zero-shot action recognition and trainable actor localization): we use pre-trained CLIP-ViP [357] to perform video-level action recognition while training the localization modules to detect persons on the training set. • E2E (end-to-end learning): we train and test models in an end-to-end way by using raw 158 videos and vocabulary as input. In this setting, with the same CLIP-ViP [357] backbone, we compare with STMixer [347] using different prompting methods, i.e., , handcraft prompts (HC), soft prompt by CoOp [409], and our recommended GPT-generated prompts (GPT). 8.4.2 Comparative Results The main results are reported in Table 8.1. To analyze the baseline performance, we summarize the discussion below. Zero-shot recognition and localization. In the ZSR+ZSL setting, the findings are as follows. First, region-level features (the 1st row) by RoIAlign [105] perform significantly worse than the video-level features (the 3rd row). This indicates that the RoI-cropped features from VLM suffers from a large representation gap between the video-level pre-training and downstream region- level recognition. Second, the descriptive GPT-generated prompts (the 3rd row) achieve a better performance than the handcrafted (HC) prompt such as the “a video of person [CLS]" (the 2nd row). This can be explained by the more transferable knowledge in the GPT prompts than the handcraft ones. Table 8.2 Zero-shot action detection. Following the same 75%-25% split as [117], we report both the frame- and video-level mAP (f-mAP and v-mAP) on novel classes of J-HMDB. Models iCLIP [117] OpenMixer f-mAP 65.41 77.06 v-mAP – 81.20 Zero-shot recognition with learnable localization. Under the ZSR+TL setting, we observe a significant superiority of Spatial OpenMixer to the STMixer baseline, with more than 10% performance gain on the J-HMDB dataset. Since the training is only encouraged to localize actors in videos, the outperformance suggests a good exploitation of the localizability in pre-trained VLMs. End-to-end learnable OVAD. For the E2E setting, the OpenMixer (the last row) outperforms the simple STMixer baseline (STMixer+VLM) by large margins, with 7.69% and 54.89% of video mAP gains on base and novel categories of the J-HMDB dataset, respectively. Besides, we explored 159 (a) number of queries (b) number of stages Figure 8.3 Hyperparameters. We show the video mAP with respect to different numbers of learnable queries and OMB stages. Table 8.3 Ablation study. We show the contribution of each proposed component. S-OMB ✗ ✓ ✓ ✓ DFA ✓ ✗ ✓ ✓ T-OMB ✓ ✓ ✗ ✓ Mean 81.77 74.06 83.47 86.34 Base 86.32 68.04 86.01 90.75 Novel 77.64 79.53 81.18 82.33 the widely-used VLM adaptation method CoOp [409] that optimizes the context of class names, i.e., , prompt tuning. From Table 8.1, we observe that CoOp improves the base class performance with sacrifice on the novel classes, while the GPT prompted OpenMixer achieves much better performance on novel classes. Lastly, we notice the relatively smaller numbers on UCF101-24 than those on J-HMDB. This reflects the challenging aspects of UCF101-24 dataset such as the long duration (∼ 10× longer), heavy background bias, and multi-person scenarios. Zero-shot action detection. We note the iCLIP [117] defines the zero-shot action detection (ZSAD) task which is different from our OVAD task. The ZSAD only cares about the samples from novel classes while OVAD values both the base and novel classes. Therefore, ZSAD uses all samples from base classes in training and only tests on novel classes. Following the same settings as iCLIP, the results in Table 8.2 show that our method could achieve much better performance than iCLIP, even though iCLIP relies on pre-detected person boxes from YOWO [150]. 160 25501002007680848892video mAPmeanbasenovel13577680848892video mAPmeanbasenovel Table 8.4 Results of query conditions. The post/pre and TQ/SQ means that the conditional feature (from video f𝑣 or from text f𝑡) is placed after/before the Q-V mixing on the temporal queries (TQ) or spatial queries (SQ). Methods Queries Modalities post pre w/o condition TQ TQ, SQ TQ video f𝑣 video f𝑣 text f𝑡 video f𝑣 Mean 83.99 85.48 85.66 76.36 86.34 Base 85.86 88.74 90.29 70.25 90.75 Novel 82.28 82.52 81.45 81.92 82.33 8.4.3 Ablation Study In this section, we analyze the properties of the OpenMixer model on the J-HMDB dataset. Results of the component-wise ablation are reported in Table 8.3. It shows that all three components could work well. Specifically, without S-OMB which means the attentional location prior is removed, the performance drops significantly especially for the novel classes. If the DFA is removed, we only use the pre-trained VLM feature for zero-shot recognition, it shows that the base class performance is the worst. Without T-OMB which means the semantic condition is removed and both spatial and temporal queries are used for recognition, it shows a decrease of 4.74% and 1.15% on base and novel actions, respectively. Query condition strategies. Specific to the T-OMB, we further investigate different alternatives of query condition strategies in Table 8.4, which shows the following observations: (1) Without any condition, it performs worse on the base class with 4.89% mAP drop. (2) Pre-condition performs much better than post-condition on base classes (+2%), with negligible performance drop on the novel classes (−0.19%). This can be explained that the pre-condition alleviates the difficulty of content decoding by the following Q-V mixing module. (3) The additional condition on spatial queries (SQ) hurts the performance on both the base and novel classes, because this essentially makes the recognition and localization entangled in training. (4) When using text feature f𝑡 as a condition, base class performance significantly decreased (−20.50%) and novel class performance also decreased a bit. This is due to the large semantic gap between text feature f𝑡 and patch-wise 161 Table 8.5 Results of fusion strategies. We explored different strategies to fuse the pre-trained F𝑣 and learnable queries ( ˆQ𝑡 or ˆQ𝑠) within our DFA module. Methods w/o F𝑣 (𝝀 = 0) w/o ˆQ𝑡 (𝝀 = 1) w/o dynamics (𝝀 = 0.5) use ˆQ𝑠 by concat. & mlp DFA (ours) Mean 68.84 74.06 51.48 85.51 86.34 Base 88.94 68.04 63.08 89.19 90.75 Novel 50.58 79.53 40.93 82.17 82.33 video token features V, suggesting that the test-time adaptive f𝑣 is preferable even though f𝑣 and f𝑡 are semantically aligned. Feature fusion strategies. To validate the design choice of our DFA module, we explored different feature fusion strategies, as shown in Table 8.5. The results show that only using the learned query feature ˆQ𝑡 (the 1st row) performs much worse performance on the novel classes, indicating the loss of generalization. If only using the pre-trained feature F𝑣 (the 2nd row), the model cannot work well on the base classes which indicates an under-fitting to the task. If fusing the features by simple averaging, the performance still lags behind ours as it is not adaptive to the variety in queries. Moreover, we notice that [347] uses both spatial and temporal queries for recognition by MLP layers. Thus, we additionally include the spatial queries ˆQ𝑠 by concatenation with temporal queries ˆQ𝑡 and use MLP layers for dimension reduction. We observe the performance drop, which can be explained as the MLP layers eliminate the benefits of the semantic conditions and makes localization and recognition entangled in training. Number of queries and OMB stages. In Fig. 8.3a and 8.3b, we show that using 100 queries and 3 OMB stages achieves the best average mAP. The figures also indicate that the number of OMB stages is more important than the number of queries, as the bipartite matching could handle the redundant queries in training. The decreasing trend with more than three OMB stages can be attributed to the risk of overfitting to training data. Impact of VLMs. We note there is a line of literature [332, 265, 132, 236, 204] built on image CLIP for open-vocabulary video understanding. Therefore, it is interesting to see whether image CLIP also works for the OVAD task. In Table 8.6, we compare OpenMixer with its variants using 162 Table 8.6 Effect of VLMs. We implement the OpenMixer by CLIP-ViP and CLIP with the same ViT-B/16 transformer architecture. Table 8.7 GPT help temporal localization. we compute mAP by only using temporal IoU on the J-HMDB dataset. VLMs Modality Mean Base Novel Mean(t) Base(t) Novel(t) CLIP [261] CLIP-ViP [357] video image 71.60 79.46 64.44 86.34 90.75 82.33 w/o. GPT w. GPT 83.57 91.62 90.74 93.63 77.06 89.79 𝑡/𝑇 = 1/6 𝑡/𝑇 = 2/6 𝑡/𝑇 = 3/6 𝑡/𝑇 = 4/6 𝑡/𝑇 = 5/6 𝑡/𝑇 = 1 l l a b k c i k p u l l u p Figure 8.4 Unseen Action Detection. We visualize our OpenMixer detections (in blue) and ground truth (in yellow) on two representative videos from novel classes. The numbers after class names are confidence scores. video-based CLIP-ViP [357] and image-based CLIP [261] under the same ViT-B/16 architecture. The results show that the OpenMixer with CLIP performs way worse than the model with CLIP-ViP, because of the limited capacity of image CLIP in capturing video actions. Impact of person detectors. In Table 8.8, we compare the impact of using external person boxes from off-the-shelf person detectors, i.e., , G-DINO [206] and Mask RCNN [105], in test time on the two best-performed models under the ZSR+ZSL and E2E settings, respectively. It shows that the high-quality boxes from G-DINO could consistently outperform those from Mask RCNN. With the same external test-time boxes, the results of OpenMixer model are consistently better than those of the strongest ZSR+ZSL baseline (Video+GPT). The relatively smaller gains on UCF101-24 than the gains on J-HMDB can be explained by the background bias in UCF videos that restricts VLMs in action recognition. Can GPT help temporal localization? This question is interesting as how textual prompts from language models like GPT could help temporal localization has not been explored in literature. In Table 8.7, we show that by evaluating the temporal action localization performance, GPT prompts 163 Table 8.8 Impact of person detectors. For E2E setting, predicted boxes from OpenMixer are replaced with boxes from Mask RCNN [105] or G-DINO [206], and their classification scores are assigned by maximum IoU with OpenMixer boxes that have scores. Models Test-time person boxes Video + GPT (ZSR+ZSL) MaskRCNN [105] G-DINO [206] OpenMixer (E2E) Mask RCNN [105] G-DINO [206] J-HMDB UCF101-24 Base Novel Mean 35.01 68.66 64.61 45.43 72.12 67.09 Base Novel 35.43 34.59 46.04 44.82 87.45 87.76 79.92 82.60 42.31 46.56 48.48 47.00 36.13 46.11 Mean 66.73 69.72 83.51 85.06 could significantly help. Qualitative results. We visualize results on the J-HMDB novel categories in Fig. 8.4. They show that OpenMixer could precisely localize and confidently recognize those unseen actions, even though there are multiple persons. 8.5 Conclusion We present the first work that addresses the open-vocabulary action detection (OVAD) problem. The key challenges are identified as two aspects. The first is how to transfer the semantic knowledge of human motion dynamics from video-level pre-trained VLMs to the spatiotemporal region-level actors in videos for recognizing both seen and unseen actions. The second is how to exploit the prior knowledge of actor location from pre-trained VLMs for action localization. In this paper, we tackle the challenges by developing a query-based detection transformer, OpenMixer, that fully exploits the semantics and localizability of VLMs for action recognition and localization, respectively. Furthermore, we build the first OVAD benchmark that extensively evaluates baselines and our model under various settings, showing the superiority of the OpenMixer while revealing open research questions to explore in the future. 8.6 Supplementary Material 8.6.1 Prompts for Query-Text Alignment To generate text prompts for each action category, we send a request to GPT [248] by us- ing the template: “For the action type {CLS}, what are the visual descriptions? Please respond with a list of 16 short sentences." where the placeholder “{CLS}" 164 is replaced by the action class name from the vocabulary. Thus, we obtained multiple caption-like sentence descriptions of the action. Eventually, the text feature for each class is computed by mean pooling of features from the VLM text encoder given the text prompts. 8.6.2 Explanation of the Reversed Attention As discussed in the main paper, the seemly counterintuitive phenomenon of reversed visual-text attention has been studied in [179, 178] and we also observed this in our video-based experiments. For CLIP-based models, [CLS] token in ViT is aligned to the text semantics so that its attention weight corresponds to the foreground, while the rest 𝐿 visual token weights are complementary after softmax over 𝐿 + 1 tokens before attention pooling. Therefore, due to the attention pooling, high similarity between text feature (or visual [CLS] token feature) and 𝐿 visual tokens could indicate the background. 8.6.3 Implementation Details Positional Embedding Interpolation. When using the pre-trained VLM without fine-tuning, an immediate challenge is that the input videos have different spatiotemporal resolutions from the data in VLM pre-training. For example, the CLIP-ViP is pre-trained on input videos with size 12 × 224 × 224 while videos from J-HMDB can be in any resolution after random augmentations in training. A simple solution is to resize the input video size to match with the pre-trained ones. But for the action detection subtask, person localization is sensitive to the input resolution. To handle this challenge, we instead keep the raw resolution as input, but interpolate the pre-trained spatial and temporal positional embeddings. For example, given the CLIP-ViP B16 VLM and an input video with size 𝑇 × 𝐻 × 𝑊, we interpolate the 12 temporal positional embeddings PE𝑡 ∈ R12×𝐷 to 16 ) spatial positional embeddings PE𝑠 ∈ R196×𝐷 to 16 . This technique is found useful for the action detection problem. ˆPE𝑡 ∈ R𝑇×𝐷, and interpolate the 196 (= 224 ˆPE𝑠 ∈ R𝐿×𝐷 where 𝐿 = 𝐻 16 × 224 16 × 𝑊 4D Feature Pyramid. Following the line of detection literature [176, 347], the pre-trained patch token features are transformed into a 4D feature pyramid before the detection head. Let the H ∈ Rℎ×𝑤×𝑇×𝐷 be the pre-trained patch token features from the VLM video encoder, where ℎ × 𝑤 is the number of patches for each frame, 𝑇 is the number of video frames, and 𝐷 is the Transformer 165 Table 8.9 Results on 50%-50% J-HMDB. Table 8.10 Results on 50%-50% UCF101-24. (3) (0) (2) (1) avg Metrics 86.34 86.29 85.50 86.73 83.40 85.65 Mean Base 90.75 89.89 89.20 87.70 85.36 88.58 Novel 82.33 83.02 82.13 85.85 81.61 82.99 (4) (0) (2) (3) (1) avg Metrics 46.42 46.28 45.45 47.32 48.30 46.75 Mean Base 59.10 61.11 55.85 62.33 61.25 59.93 Novel 33.73 31.45 35.05 32.31 35.34 33.58 (4) Table 8.11 Results on 75%-25% J-HMDB. Table 8.12 Results on 75%-25% UCF101-24. (3) (2) (0) (1) avg Metrics 75.96 79.43 79.77 81.88 86.56 80.72 Mean Base 74.73 75.21 78.34 82.14 85.46 79.17 Novel 79.03 89.98 83.34 81.23 89.30 84.57 (4) (0) (1) (3) (2) avg Metrics 55.78 55.83 57.04 57.19 61.85 57.54 Mean Base 64.85 61.83 60.16 58.74 61.82 61.48 Novel 28.55 37.80 47.69 52.55 61.96 45.71 (4) dimension. We use deconvolution or convolution to produce hierarchical feature maps ˜H(𝑙) by spatial strides 𝑠(𝑙) ∈ {1/4, 1/2, 1, 2} where the fractional strides are deconvolutional stides and 𝑙 indexes the pyramid level. Different from [176, 347] that fully fine-tunes the visual encoder, our VLM visual encoder has to be frozen. Therefore, to allow pre-trained features better utilized by OpenMixer head, we propose to add residual connection at each level of the 4D feature pyramid by spatial interpolation: ˆH(𝑙) = 𝜙(H, 𝑠(𝑙)) + ˜H(𝑙). The function 𝜙 is to spatially interpolate the feature map from the size ℎ × 𝑤 to the same resolution of ˜H(𝑙). 8.6.4 Results on Different Splits We experiment with five random 50%-50% seen-unseen class splits on both the J-HMDB and UCF101-24 datasets. Full results of video mAP are summarized in Table 8.9 and 8.10. The split (0) is used in all experiments of the main paper. We also experiment with five random 75%-25% seen-unseen class splits on the two datasets, and report results in Table 8.11 and 8.12. As some of human actions are much harder to detect than others and they could be included in either base or novel categories, it is normal that the overall performances on different splits vary significantly. Following the existing literature, we will release all splits. 8.6.5 Limitations and Future Work As indicated in existing literature, the recent large-scale action detection dataset AVA [98] is not included in this paper, as the AVA human actions are in fine granularity and the person boxes 166 are annotated with multi-label actions. This raises new challenges when adapting video-language foundation models for multi-label action detection problems, which are out of the scope in this paper. We note the recent success of multi-modal LLMs [200] that uses LLM to re-formulate down- stream tasks as a unified generative token prediction problem, which points out a promising direction toward the OVAD task in the future. 167 CHAPTER 9 CONCLUSIONS AND DISCUSSIONS 9.1 Contribution Summary In this dissertation, we have made several attempts to endow AI systems to learn from an open-ended visual world. These attempts could handle the major challenges of open-world visual understanding problems, including the open-world visual forecasting that has only limited temporal observations to predict the unseen future or in an unseen environment, open-world visual recog- nition that data from unknown categories could exist in testing, and open-world vision-language understanding to recognize the unknown from language queries. Exploring these problems is valu- able in real-world practice, as human-level intelligence cannot be achieved without the capability of forecasting and detecting the unseen. For this goal, our contributions are summarized below. First, we empirically found the practicality of Bayesian uncertainty in real-world visual fore- casting applications, i.e., the epistemic (model) uncertainty for traffic accident anticipation (Sec. 2) and the aleatoric (data) uncertainty for egocentric 3D hand trajectory prediction (Sec. 4). The deep learning uncertainties, on the one hand, lead to theoretically principled ways to regularize model learning on real-world data, on the other hand, provide trustworthy confidence in downstream decision-making in robotic systems. Moreover, beyond the supervised learning paradigm, we ex- plored deep reinforcement learning (DRL) in traffic accident anticipation (Sec. 3), which naturally mimics the dynamic Markov decision process of human observation and forecasting, resulting in visually explainable and best-performed accident anticipation since unseen driving distractors in an open-world can be suppressed. This indicates that DRL is not only applicable to the extensively studied topics in robotics such as planning and control, but is also potential for video understanding. Second, we are excited to have found that evidential deep learning (EDL) is powerful in detecting unknown human actions in videos. We technically contributed to several first works such as the open-set action recognition (Sec. 5) and the open-set temporal action localization (Sec. 6). The EDL method is successful because it provides a general Dirichlet assumption on classification problems such that multi-dimensions of classification uncertainties can be applied to regularize the 168 model learning accordingly in diverse downstream tasks. A broader insight of this line of research is that the open-set learning on videos cannot be simply treated the same as the learning on images, due to the implicit challenges and vital importance of modeling the temporal dynamics for complex action understanding. Lastly, we explore more complex open-world visual understanding problems by vision-language modeling. For image understanding, we contribute to the well-defined compositional zero-shot learning (CZSL) field that aims to recognize unseen compositional concepts from images in an open world. We found that the pre-trained CLIP model can be significantly enhanced for the CZSL task by leveraging LLM-generated compositional class descriptions. For video understanding, we go beyond the traditional closed-set action detection but formulate the first open-vocabulary action detection (OVAD) work that could detect any human actions in videos from an open-ended action vocabulary. On the technical side, we empirically found effective ways to fully utilize the semantics and localizability priors of a video-based CLIP model and LLM descriptions for the OVAD task. In addition to these two topics, there are many other research problems for open-world vision-language understanding to be explored in the future. 9.2 Limitation Discussion In this part, some unsolved challenges of this dissertation are discussed. We hope these discussions are useful to inspire future following works. (a) 3D motions and simulation for traffic accident anticipation. In Sec. 2, though it is interesting to formulate the spatio-temporal dynamics of a traffic scene by a sequence of object graphs, the underlying relation between traffic object nodes is from the 2D visual plane. However, in a real-world traffic scenario, the risk of a future accident could be better predicted from the relational motion between cars/pedestrians in a 3D physical world. Moreover, for the ego-vehicle involved accidents in Sec. 2 and Sec. 3, the ego-motion of the camera in 3D space plays an important role in accident forecasting, because an abnormal ego-motion intuitively indicates an out-of-control driving behavior. However, introducing 3D motions of traffic objects or ego-camera is practically infeasible, because the pose information of camera and cars is missing in existing accident video 169 datasets and it can hardly be collected in a real accident, i.e, we cannot take the risk of human lives to collect 3D annotation data. Fortunately, the recent advances in traffic simulation provide the potential. We could use 3D reconstruction from videos/images to represent the traffic objects in 3D [314, 313], import them into traffic simulators such as the CARLA [66] and MetaDrive [173], and simulate any kinds and any amount of traffic accident video data in anywhere. Moreover, through the traffic accident simulation, more data modalities can be created in addition to the visual data, such as high-definition (HD) maps, birds-eye-view (BEV) maps, and IMU sensor data. These modalities are commonly used in autonomous driving, while to the best of our knowledge, they have never been explored for accident anticipation. (b) Unbiased causal representation learning for open-world vision. As evidenced by many existing literature [87, 122, 42, 183, 139], visual understanding models guided by empirical risk minimization (ERM) suffer from spurious correlation such that the learned visual features are biased toward confounding factors, e.g., background in images, appearance in videos, co-occurrence between visual objects, etc. This may not be bad in a closed-world environment for achieving good fitting performance. But when the model is deployed in an open world where confounding factors exist in unknown data, the model performance shows a dramatic decrease. We have noticed this issue in our work as evidenced in Sec 5. Such an issue also exists in the accident anticipation task. The anticipation model could be biased by visual backgrounds since the backgrounds are typically different for accident and normal videos due to different data collection conditions, i.e., road, weather, and cities. To achieve a debiased open-set video recognition, recent work [389] introduces an adversarial scene reconstruction objective. However, there could be many more tools, e.g., the structured causal model [364], that can be explored for open-world visual understanding. (c) Grounded language descriptions for vision-language understanding. In our previous works in Sec. 7 and Sec. 8, the LLM-generated class descriptions are class-level free-form sen- tences, without grounding to the instance-level images or videos. A potential limitation is that the model could heavily rely on the quality of the LLM descriptions. Moreover, performing prompt en- gineering to generate desired class descriptions could be heuristic in practice. Inspired by the recent 170 Figure 9.1 4D Visual Understanding. image captioning models, a better approach could be to first caption the image or video data to get instance-level descriptive texts, which are grounded on the visual data. Then, an LLM [247, 324] or multi-modal LLM [201] can be used to summarize the generated texts into class-level descriptions. The caption-and-summarization scheme takes advantage of grounded language descriptions so that downstream vision-language adaptation will be more robust and generalizable for open-world visual understanding. 9.3 Future Work To summarize, my existing research serves as a few early steps toward the long-term goal of visual intelligence in the 3D open world. There are many topics such as open-world 3D perception, prediction, and reasoning, which are indispensable for open visual intelligence but under-explored by far. In this section, I outline three major directions that are interesting, valuable, and promising: 4D open visual understanding. Existing research efforts in 3D vision and open-world learning have been studied individually for a long period. However, it would be more practical to apply AI systems in a 4D open environment where both the 3D space and 1D time matter for forecasting, perception, and understanding, as shown in Fig. 9.1. For example, without knowing the 4D human poses, e.g., the time series of 3D human poses, existing open-set video models can hardly tell whether a risky action of falling down is unknown or not if the model has been learned on a known action of sitting down. This is because the two types of actions show very similar visual appearance in videos from a single-view camera. Instead, with additional 3D depth or multi-view sensors introduced, the captured 4D human poses could distinguish between the known and unknown actions. As shown in Fig. 9.1, the research in the dissertation chapters has explored 171 vision (+text)→2Daction(Chapt. 5 & 6 & 8)3D+vision→3D/4DI/Otaskforecastingaccident (Chapt. 2 & 3)trajectory(Chapt. 4)perceptionunderstandingvision + text (Chapt. 7 & 8)? Figure 9.2 Generative Vision-Language Reasoning. many of the 2D and 3D tasks in an open world, which sheds light on the 4D open visual world in the future, especially for perception and advanced visual understanding tasks. Generative vision-language reasoning. Though existing foundation models [261, 336] have shown superior performance, they are still limited in open-world applications that require complex visual reasoning [323]. Revisiting our dissertation works [317, 11, 311, 312], the model capabilities are still far from human-level reasoning which is common in video applications. Therefore, it will be important to explore how to learn the temporal causality of human activities in long- form video in the future. My recent vision-language understanding work [39] which explores the procedure step localization in long-form videos, sets a starting point towards this direction. Moreover, existing advances in LLMs [247, 324], multi-modal LLM [104, 201, 172], and diffusion models [112, 294, 273] have shown that generative modeling by next-token-prediction or diffusion- denoising is potential to unify visual understanding and visual generation [129, 307]. Also, the latent features of a visual generative model are semantically editable and controllable to benefit the visual recognition [167, 94] and visual reasoning [100, 40, 175]. Motivated by these advances as summarized in Fig. 9.2, as well as my ongoing generative modeling [319] and the collaboration in MLLM [180], building an open-world vision-language reasoning system by generative models will be promising in the future. 172 understandingrecognition & forecastingreasoninggeneration 9.4 List of Doctoral Works In this section, the peer-reviewed publications and ongoing works or pre-prints during my Ph.D. period (2019.08 - 2024.07) are listed below for reference. Peer-reviewed Publications: 1. Wentao Bao, Lichang Chen, Heng Huang, Yu Kong, "Prompting Language-Informed Distri- bution for Compositional Zero-Shot Learning," in European Conference on Computer Vision (ECCV), 2024. 2. Yifan Li, Anh Dao, Wenta Bao, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong, "Facial Behavior Analysis with Instruction Tuning," in European Conference on Computer Vision (ECCV), 2024. 3. Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang Min, Dimitris N. Metaxas, "Learning to Localize Actions in Instructional Videos with LLM-Based Multi- Pathway Text-Video Alignment," in European Conference on Computer Vision (ECCV), 2024. 4. Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu Kong, "Uncertainty- aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting," in Interna- tional Conference on Computer Vision (ICCV), 2023. 5. Libing Zeng, Lele Chen, Wentao Bao, Zhong Li, Yi Xu, Junsong Yuan, Nima Khademi Kalantari, "3D-aware Facial Landmark Detection via Multiview Consistent Training on Synthetic Data," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6. Yuansheng Zhu, Wentao Bao, Qi Yu, "Towards Open Set Video Anomaly Detection," in European Conference on Computer Vision (ECCV), 2022. 7. Wentao Bao, Qi Yu, Yu Kong, "OpenTAL: Towards Open Set Temporal Action Localization," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral). 173 8. Xinmiao Lin, Wentao Bao, Matthew Wright, Yu Kong, "Gradient Frequency Modulation for Visually Explaining Video Understanding Models," in British Machine Vision Conference (BMVC), 2021. 9. Wentao Bao, Qi Yu, Yu Kong, "Evidential Deep Learning for Open Set Action Recognition," in International Conference on Computer Vision (ICCV), 2021 (Oral). 10. Wentao Bao, Qi Yu, Yu Kong, "DRIVE: Deep Reinforced Accident Anticipation with Visual Explanation," in International Conference on Computer Vision (ICCV), 2021. 11. Xiwen Dengxiong, Wentao Bao, Yu Kong, "Multiple Instance Relational Learning for Video Anomaly Detection," in International Joint Conference on Neural Network (IJCNN), 2021. 12. Wentao Bao, Qi Yu, Yu Kong, "Uncertainty-based Traffic Accident Anticipation with Spatio- Temporal Relational Learning," in 28th ACM International Conference on Multimedia (MM), 2020. 13. Junwen Chen, Wentao Bao, Yu Kong, "Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos," in 28th ACM International Conference on Multimedia (MM), 2020. 14. Hanbin Hong, Wentao Bao, Yuan Hong, Yu Kong, "Privacy Attributes-aware Message Pass- ing Neural Network for Visual Privacy Attributes Classification," in International Conference on Pattern Recognition (ICPR), 2020. 15. Junwen Chen, Wentao Bao, Yu Kong, "Group Activity Prediction with Sequential Relational Anticipation Model," in European Conference on Computer Vision (ECCV), 2020. 16. Wentao Bao, Qi Yu, Yu Kong, "Object-Aware Centroid Voting for Monocular 3D Object Detection," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. 174 Ongoing submissions and pre-prints: 1. Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong, "Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection," (in submission), 2024. 2. Wentao Bao, Qi Yu, Yu Kong, "Latent Space Energy-based Model for Fine-grained Open Set Recognition," arXiv preprint, arXiv:2309.10711 (in submission), 2024. 3. Suhan Park, Wentao Bao, Saniat Sohrawardi, Matthew Wright, Yu Kong, "Open-Set Deep- fake Detection by Evidential Deep Learning," (in submission), 2024. 4. Yuansheng Zhu, Md Abdullah Al Forhad, Wentao Bao, Weishi Shi, Yu Kong, Qi Yu, "Taking No Shortcuts in Lifelong Learning by Following Mixture of Local Experts," (in submission), 2024. 5. Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong, "On Model Explanations with Transferable Neural Pathways," arXiv preprint, arXiv:2309.09887, 2023. 175 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakr- ishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In CVPR, 2016. Stefano Alletto, Andrea Palazzi, Francesco Solera, Simone Calderara, and Rita Cucchiara. DR(eye)VE: a dataset for attention-based tasks with applications to autonomous and assisted driving. In CVPR Workshop, 2016. Alexander Amini, Wilko Schwarting, Ava Soleimany, and Daniela Rus. Deep evidential regression. In NeurIPS, 2020. Yuval Atzmon, Felix Kreuk, Uri Shalit, and Gal Chechik. A causal view of compositional zero-shot recognition. In Adv. Neural Inform. Process. Syst., 2020. Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In ICML, 2020. Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. Boundary content graph neural network for temporal action proposal generation. In ECCV, 2020. Duhyeon Bang, Kyungjune Baek, Jiwoo Kim, Yunho Jeon, Jin-Hwa Kim, Jiwon Kim, Jongwuk Lee, and Hyunjung Shim. Logit mixing training for more reliable and accurate prediction. In IJCAI, 2022. Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. In NIPS, 2022. [10] Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In ACM MM, 2020. [11] Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. In CVPR, pages 2979–2989, 2022. [12] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006. [13] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. In NeurIPS 176 Workshop, 2014. [14] Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: A survey. Machine Learning, 109(4):719–760, 2020. [15] Abhijit Bendale and Terrance Boult. Towards open world recognition. In CVPR, 2015. [16] Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. In CVPR, 2016. [17] Huikun Bi, Ruisi Zhang, Tianlu Mao, Zhigang Deng, and Zhaoqi Wang. How can I see my future? FvTraj: Using first-person view for pedestrian trajectory prediction. In ECCV, 2020. [18] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncer- tainty in neural networks. In International Conference on Machine Learning, 2015. [19] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In International Conference on Learning Representations, 2017. [20] Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, 2017. [21] Pau Panareda Busto, Ahsan Iqbal, and Juergen Gall. Open set domain adaptation for image and action recognition. IEEE transactions on pattern analysis and machine intelligence, 42(2):413–429, 2018. [22] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activi- tynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. [23] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. [24] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. [25] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021. [26] Luigi Carratino, Moustapha Cissa, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization. JMLR, 23(325), 2022. [27] J. Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017. 177 [28] Jun Cen, Peng Yun, Junhao Cai, Michael Yu Wang, and Ming Liu. Deep metric learning for open world semantic segmentation. In ICCV, 2021. [29] Fu-Hsiang Chan, Yu-Ting Chen, Yu Xiang, and Min Sun. Anticipating accidents in dashcam videos. In Asian Conference on Computer Vision, 2016. [30] Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, 2018. [31] Bertrand Charpentier, Daniel Zügner, and Stephan Günnemann. Posterior network: Un- In NeurIPS, certainty estimation without ood samples via density-based pseudo-counts. 2020. [32] Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, pages 397–406, 2021. [33] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian. Adversarial reciprocal points learning for open set recognition. IEEE TPAMI, 2021. [34] Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, and Yonghong Tian. Learning open set network with discriminative reciprocal points. In ECCV, 2020. [35] Junwen Chen, Wentao Bao, and Yu Kong. Group activity prediction with sequential relational anticipation model. In ECCV, 2020. [36] Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. Cycleacr: Cycle mod- eling of actor-context relations for video action detection. arXiv preprint arXiv:2303.16118, 2023. [37] Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. Efficient video action detection with token dropout and context refinement. In ICCV, 2023. [38] Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and Ping Luo. Watch only once: An end-to-end video action detection framework. In ICCV, pages 8178–8187, 2021. [39] Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang Min, and Dim- itris N. Metaxas. Learning to localize actions in instructional videos with llm-based multi- pathway text-video alignment. In Proceedings of the European Conference on Computer Vision (ECCV), 2024. [40] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In ICCV, 2023. 178 [41] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014. [42] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. In NeurIPS, 2019. [43] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NeurIPS, 2015. [44] Charles E Connor, Howard E Egeth, and Steven Yantis. Visual attention: bottom-up versus top-down. Current Biology, 14(19):R850–R852, 2004. [45] MMAction2 Contributors. Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2, 2020. [46] G. Corcoran and J. Clark. Traffic risk assessment: A two-stream approach using dynamic attention. In Conference on Computer and Robot Vision, 2019. [47] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. A deep multi-level network for saliency prediction. In ICPR, 2016. [48] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005. [49] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. [50] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE TPAMI, 43(11):4125– 4141, 2020. [51] Ishan Dave, Zacchaeus Scheffer, Akash Kumar, Sarah Shiraz, Yogesh Singh Rawat, and Mubarak Shah. Gabriellav2: Towards better generalization in surveillance videos for action detection. In WACV, pages 122–132, 2022. [52] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural net- works on graphs with fast localized spectral filtering. In Proceedings of Neural Information Processing Systems, 2016. [53] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A 179 large-scale hierarchical image database. In CVPR, 2009. [54] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012. [55] T. Deng, H. Yan, L. Qin, T. Ngo, and B. S. Manjunath. How do drivers allocate their potential attention? driving fixation prediction via convolutional neural networks. IEEE TITS, 21(5):2146–2154, 2020. [56] Tao Deng, Andong Chen, Min Gao, and Hongmei Yan. Top-down based saliency model in traffic driving environment. In ITSC, 2014. [57] Tao Deng, Hongmei Yan, and Yong-Jie Li. Learning to boost bottom-up fixation prediction in driving environments via random forest. IEEE TITS, 19(9):3059–3067, 2017. [58] John S. Denker and Yann LeCun. Transforming neural-net output levels to probability distributions. In Proceedings of Neural Information Processing Systems, 1990. [59] Nachiket Deo and Mohan M Trivedi. Convolutional social pooling for vehicle trajectory prediction. In CVPR, 2018. [60] Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, and Brais Martinez. Bayesian prompt learning for image-language model generalization. In ICCV, 2023. [61] Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult. The overlooked elephant of object detection: Open set. In WACV, 2020. [62] Akshay Raj Dhamija, Manuel Günther, and Terrance E Boult. Reducing network agnosto- phobia. In NeurIPS, 2018. [63] Christian Diller, Thomas Funkhouser, and Angela Dai. Forecasting characteristic 3d poses of human actions. In CVPR, 2022. [64] Luke Ditria, Benjamin J Meyer, and Tom Drummond. OpenGAN: Open set generative adversarial networks. In ACCV, 2020. [65] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [66] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In ACRL, 2017. 180 [67] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022. [68] J. Fang, D. Yan, J. Qiao, J. Xue, H. Wang, and S. Li. DADA-2000: Can driving accident be predicted by driver attention? analyzed by a benchmark. In IEEE Intelligent Transportation Systems Conference, 2019. [69] Zhen Fang, Jie Lu, Anjin Liu, Feng Liu, and Guangquan Zhang. Learning bounds for open-set learning. In ICML, 2021. [70] Alireza Fathi, Ali Farhadi, and James M Rehg. Understanding egocentric activities. In ICCV, 2011. [71] Mishal Fatima, Muhammad Umar Karim Khan, and Chong Min Kyung. Global feature aggregation for accident anticipation. arXiv preprint arXiv:2006.08942, 2020. [72] Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. Holistic interaction transformer network for action detection. In WACV, pages 3340–3350, 2023. [73] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, pages 203–213, 2020. [74] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019. [75] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In NeurIPS, 2016. [76] Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. Cyclical annealing schedule: A simple approach to mitigating KL vanishing. In NACCL, 2019. [77] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In ICML, 2018. [78] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, Department of Engineering, University of Cambridge, 2016. [79] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. In International Conference on Learning Representations (Workshop), 2016. [80] Nisal Menuka Gamage, Deepana Ishtaweera, Martin Weigel, and Anusha Withana. So predictable! continuous 3d hand trajectory prediction in virtual reality. In UIST, 2021. 181 [81] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hong- sheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023. [82] Ziteng Gao, Limin Wang, Bing Han, and Sheng Guo. Adamixer: A fast-converging query- based object detector. In CVPR, pages 5364–5373, 2022. [83] Zongyuan Ge, Sergey Demyanov, Zetao Chen, and Rahil Garnavi. Generative OpenMax for multi-class open set classification. In BMVC, 2017. [84] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? In Proceedings of IEEE Conference on Computer The KITTI Vision Benchmark Suite. Vision and Pattern Recognition, 2012. [85] Wilson S Geisler and Jeffrey S Perry. Real-time foveated multiresolution system for low- In Human Vision and Electronic Imaging III, volume bandwidth video communication. 3299, pages 294–305, 1998. [86] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Recent advances in open set IEEE transactions on pattern analysis and machine intelligence, recognition: A survey. 2020. [87] Soumya Suvra Ghosal and Yixuan Li. Are vision transformers robust to spurious correla- tions? International Journal of Computer Vision, pages 1–21, 2023. [88] Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Schölkopf. From variational to deterministic autoencoders. In ICLR, 2020. [89] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action trans- former network. In CVPR, June 2019. [90] Rohit Girdhar and Kristen Grauman. Anticipative video transformer. In ICCV, 2021. [91] Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. Dynamical variational autoencoders: A comprehensive review. Founda- tions and Trends in Machine Learning, 15(1-2):1–175, 2021. [92] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014. [93] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019. [94] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. In NeurIPS, 2022. 182 [95] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. [96] Alex Graves. Practical variational inference for neural networks. In Proceedings of Neural Information Processing Systems, 2011. [97] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statis- tical dependence with hilbert-schmidt norms. In ICALT, 2005. [98] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018. [99] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017. [100] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In CVPR, 2023. [101] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off- In ICML, policy maximum entropy deep reinforcement learning with a stochastic actor. 2018. [102] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. [103] Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, and Xiaoning Qian. Variational graph recurrent neural networks. In Proceedings of Neural Information Processing Systems, 2019. [104] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022. [105] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017. [106] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [107] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? In ICLR, 2023. 183 [108] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, and Xiangyu Zhang. Bounding box regression with uncertainty for accurate object detection. In CVPR, 2019. [109] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, 2016. [110] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. [111] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [112] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [113] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. [114] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011. [115] H. Hu, Y. Lin, M. Liu, H. Cheng, Y. Chang, and M. Sun. Deep 360 Pilot: Learning a deep agent for piloting through 360° sports videos. In CVPR, 2017. [116] Siteng Huang, Biao Gong, Yutong Feng, Yiliang Lv, and Donglin Wang. Troika: Multi-path cross-modal traction for compositional zero-shot learning. In CVPR, 2024. [117] Wei-Jhe Huang, Jheng-Hsien Yeh, Min-Hung Chen, Gueter Josmy Faure, and Shang-Hong Lai. Interaction-aware prompting for zero-shot spatio-temporal action detection. In ICCV Workshop, pages 284–293, 2023. [118] Xiaohui Huang, Pan He, Anand Rangarajan, and Sanjay Ranka. Intelligent intersection: Two-stream convolutional networks for real-time near-accident detection in traffic video. ACM TSAS, 6(2), 2020. [119] Dat Huynh and Ehsan Elhamifar. Compositional zero-shot learning via fine-grained dense feature composition. In Adv. Neural Inform. Process. Syst., 2020. [120] Jaedong Hwang, Seoung Wug Oh, Joon-Young Lee, and Bohyung Han. Exemplar-based open-set panoptic segmentation network. In CVPR, 2021. [121] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and transformations in image collections. In CVPR, 2015. 184 [122] Pavel Izmailov, Polina Kirichenko, Nate Gruver, and Andrew G Wilson. On feature learning in the presence of spurious correlations. In NeurIPS, 2022. [123] Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multi-class open set recognition using probability of inclusion. In ECCV, 2014. [124] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013. [125] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hari- haran, and Ser-Nam Lim. Visual prompt tuning. In ECCV, 2022. [126] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. [127] M. Jiang, X. Boix, G. Roig, J. Xu, L. Van Gool, and Q. Zhao. Learning to predict sequences of human visual fixations. IEEE TNNLS, 27(6):1241–1252, 2016. [128] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/ THUMOS14/, 2014. [129] Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023. [130] Audun Jøsang. Subjective logic. Springer, 2016. [131] KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian. Towards open world object detection. In CVPR, 2021. [132] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In ECCV, 2022. [133] Pedro Ribeiro Mendes Júnior, Terrance E Boult, Jacques Wainer, and Anderson Rocha. Spe- cialized support vector machines for open-set recognition. arXiv preprint arXiv:1606.03802, 2016. [134] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In ICCV, 2017. [135] Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In CVPR, 2022. 185 [136] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of Neural Information Processing Systems, 2017. [137] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018. [138] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. In CVPR, 2023. [139] Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In CVPR, 2019. [140] Dahun Kim, Anelia Angelova, and Weicheng Kuo. Region-aware pretraining for open- vocabulary object detection with vision transformers. In CVPR, 2023. [141] Jinkyu Kim and John Canny. Interpretable learning for self-driving cars by visualizing causal attention. In ICCV, 2017. [142] Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In ECCV, 2018. [143] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2013. [144] Thomas N Kipf and Max Welling. Variational graph auto-encoders. In Proceedings of Neural Information Processing Systems (Workshop), 2016. [145] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. [146] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In ICML, 2017. [147] Shu Kong and Deva Ramanan. OpenGAN: Open-set recognition via open data generation. In ICCV, 2021. [148] Yu Kong and Yun Fu. Human action recognition and prediction: A survey. arXiv preprint arXiv:1806.11230, 2018. [149] Yu Kong, Zhiqiang Tao, and Yun Fu. Adversarial action prediction networks. IEEE TPAMI, 2018. [150] Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll. You only watch once: A unified cnn ar- chitecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644, 2019. 186 [151] Florian Kraus and Klaus Dietmayer. Uncertainty estimation in one-stage object detection. In ITSC, 2019. [152] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015. [153] Ranganath Krishnan, Mahesh Subedar, and Omesh Tickoo. BAR: Bayesian activity recog- nition using variational inference. In NeurIPS, 2018. [154] Ranganath Krishnan, Mahesh Subedar, and Omesh Tickoo. Specifying weight priors in bayesian deep neural networks with empirical bayes. In AAAI, 2020. [155] Ranganath Krishnan and Omesh Tickoo. Improving model calibration with accuracy versus uncertainty optimization. In NeurIPS, 2020. [156] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Un- published manuscript, 40(7):1–9, 2010. [157] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [158] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. [159] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. [160] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open- vocabulary object detection upon frozen vision and language models. In ICLR, 2022. [161] Hyeongjun Kwon, Taeyong Song, Somi Jeong, Jin Kim, Jinhyun Jang, and Kwanghoon Sohn. Probabilistic prompt learning for dense prediction. In CVPR, 2023. [162] Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2O: Two hands manipulating objects for first person interaction recognition. In ICCV, 2021. [163] Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee Cho Paik. Uncertainty quantification using bayesian neural networks in classification: Application to ischemic stroke lesion segmentation. In Medical Imaging with Deep Learning, 2018. [164] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. [165] Olivier Le Meur and Zhi Liu. Saccadic model of eye movements for free-viewing condition. 187 Vision Research, 116:152 – 164, 2015. [166] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of hand- written digits. http://yann.lecun.com/exdb/mnist. [167] Sharon Lee, Yunzhi Zhang, Shangzhe Wu, and Jiajun Wu. Language-informed visual concept learning. arXiv preprint arXiv:2312.03587, 2023. [168] Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud. A recurrent variational autoencoder for speech enhancement. In ICASSP, 2020. [169] Martha Lewis, Qinan Yu, Jack Merullo, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022. [170] Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. In AAAI, 2019. [171] Fayin Li and Harry Wechsler. Open set face recognition using transduction. IEEE transac- tions on pattern analysis and machine intelligence, 27(11):1686–1697, 2005. [172] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In ICML, 2023. [173] Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [174] Xiangyu Li, Xu Yang, Kun Wei, Cheng Deng, and Muli Yang. Siamese contrastive embed- ding network for compositional zero-shot learning. In CVPR, 2022. [175] Xiaochuan Li, Baoyu Fan, Runze Zhang, Liang Jin, Di Wang, Zhenhua Guo, Yaqian Zhao, and Rengang Li. Image content generation with causal reasoning. In AAAI, 2024. [176] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296, 2022. [177] Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grauman. Ego-exo: Transferring visual representations from third-person to first-person videos. In CVPR, 2021. [178] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. [179] Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, and Xiaomeng Li. Exploring visual inter- pretability for contrastive language-image pre-training. arXiv preprint arXiv:2209.07046, 2022. 188 [180] Yifan Li, Anh Dao, Wentao Bao, Zhen Tan, Tianlong Chen, Huan Liu, and Yu Kong. Facial behavior analysis with instruction tuning. In ECCV, 2024. [181] Yiming Li, Ziang Cao, Andrew Liang, Benjamin Liang, Luoyao Chen, Hang Zhao, and Chen Feng. Egocentric prediction of action target in 3d. In CVPR, 2022. [182] Yin Li, Miao Liu, and James M. Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. In ECCV, 2018. [183] Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND: Towards action recognition without representation bias. In ECCV, 2018. [184] Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multi- sports: A multi-person video dataset of spatio-temporally localized sports actions. In ICCV, pages 13536–13545, 2021. [185] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object compositions. In CVPR, 2020. [186] Yun Li, Zhe Liu, Hang Chen, and Lina Yao. Context-based and diversity-driven specificity in compositional zero-shot learning. In CVPR, 2024. [187] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. Learning the depths of moving people by watching frozen people. In CVPR, 2019. [188] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023. [189] John Liechty, Rik Pieters, and Michel Wedel. Global and local covert visual attention: Evidence from a bayesian hidden markov model. Psychometrika, 68(4):519–541, 2003. [190] Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In EMNLP, 2020. [191] Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2022. [192] Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Learning salient boundary feature for anchor-free temporal action localization. In CVPR, 2021. [193] Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video 189 understanding. In ICCV, 2019. [194] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, 2019. [195] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, 2018. [196] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. [197] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017. [198] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In ECCV, pages 388–404, 2022. [199] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015. [200] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. [201] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. [202] Miao Liu, Lingni Ma, Kiran Somasundaram, Yin Li, Kristen Grauman, James M Rehg, and Chao Li. Egocentric activity recognition and localization on a 3d map. In ECCV, 2022. [203] Miao Liu, Siyu Tang, Yin Li, and James M Rehg. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In ECCV, 2020. [204] Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, 2023. [205] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, 2022. [206] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. [207] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: 190 Deep hypersphere embedding for face recognition. In CVPR, 2017. [208] Xinyang Liu, Dongsheng Wang, Miaoge Li, Zhibin Duan, Yishi Xu, Bo Chen, and Mingyuan Zhou. Patch-token aligned bayesian prompt learning for vision-language models. arXiv preprint arXiv:2303.09100, 2023. [209] Zhe Liu, Yun Li, Lina Yao, Xiaojun Chang, Wei Fang, Xiaojun Wu, and Yi Yang. Simple primitives with feasibility-and contextuality-dependence for open-world compositional zero- shot learning. arXiv preprint arXiv:2211.02895, 2022. [210] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In CVPR, 2019. [211] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018. [212] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004. [213] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016. [214] Xiaocheng Lu, Ziming Liu, Song Guo, and Jingcai Guo. Decomposed soft prompt guided fusion enhancing for compositional zero-shot learning. In CVPR, 2023. [215] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In CVPR, 2022. [216] Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. [217] Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? In CVPR, 2023. [218] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. In NeurIPS, 2018. [219] Srikanth Malla, Isht Dwivedi, Behzad Dariush, and Chiho Choi. Nemo: Future object localization using noisy ego priors. In ITSC, 2022. [220] Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. Open world compositional zero-shot learning. In CVPR, 2021. [221] Devraj Mandal, Sanath Narayan, Sai Kumar Dwivedi, Vikram Gupta, Shuaib Ahmed, Fa- had Shahbaz Khan, and Ling Shao. Out-of-distribution detection for generalized zero-shot 191 action recognition. In CVPR, 2019. [222] Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. Enhancing clip with gpt-4: Harnessing visual descriptions as prompts. arXiv preprint arXiv:2307.11661, 2023. [223] Sachit Menon and Carl Vondrick. Visual classification via description from large language models. In ICLR, 2023. [224] Dimity Miller, Lachlan Nicholson, Feras Dayoub, and Niko Sünderhauf. Dropout sampling for robust object detection in open-set conditions. In ICRA, 2018. [225] Kyle Min and Jason J. Corso. TASED-Net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In ICCV, 2019. [226] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. In CVPR, 2017. [227] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016. [228] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In NeurIPS, 2014. [229] Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, and Cordelia Schmid. Verbs in action: Improving verb understanding in video-language models. In ICCV, pages 15579–15591, 2023. [230] Mathew Monfort, Kandan Ramakrishnan, Alex Andonian, Barry A. McNamara, Alex Las- celles, Bowen Pan, Quanfu Fan, Dan Gutfreund, Rogério Schmidt Feris, and Aude Oliva. Multi-Moments in Time: Learning and interpreting models for multi-action video under- standing. CoRR, abs/1911.00232, 2019. [231] Jishnu Mukhoti and Yarin Gal. Evaluating bayesian deep learning methods for semantic segmentation. arXiv preprint arXiv:1811.12709, 2018. [232] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. In NeurIPS, 2020. [233] Martin Mundt, Iuliia Pliushch, Sagnik Majumder, and Visvanathan Ramesh. Open set recognition through deep neural network uncertainty: Does out-of-distribution detection require generative classifiers? In ICCVW, 2019. [234] Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning 192 graph embeddings for compositional zero-shot learning. In CVPR, 2021. [235] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015. [236] Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. Zero-shot temporal action detec- tion via vision-language prompting. In ECCV, 2022. [237] Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute- object compositions. In ECCV, 2018. [238] Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. Ego-topo: Environment affordances from egocentric video. In CVPR, 2020. [239] Nihal V Nayak, Peilin Yu, and Stephen H Bach. Learning to compose soft prompts for compositional zero-shot learning. In ICLR, 2023. [240] Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, and Fuxin Li. Open set learning with counterfactual images. In ECCV, 2018. [241] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. [242] Lukas Neumann, Andrew Zisserman, and Andrea Vedaldi. Future event prediction: If and when. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Workshop), 2019. [243] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In ECCV, pages 1–18, 2022. [244] Adrián Núñez-Marcos, Gorka Azkune, and Ignacio Arganda-Carreras. Egocentric vision- based action recognition: a survey. Neurocomputing, 472:175–197, 2022. [245] Dimitri Ognibene, Christian Balkenius, and Gianluca Baldassarre. A reinforcement-learning model of top-down attention based on a potential-action map. In The Challenge of Antici- pation: A Unifying Framework for the Analysis and Design of Artificial Cognitive Systems, pages 161–184. Springer, 2008. [246] Hugo Oliveira, Caio Silva, Gabriel LS Machado, Keiller Nogueira, and Jefersson A dos Santos. Fully convolutional open set segmentation. Machine Learning, pages 1–52, 2021. [247] OpenAI. OpenAI GPT-3.5 API [gpt-3.5-turbo-0125]. https://openai.com/blog/chatgpt. Ac- cessed: 2023. 193 [248] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [249] Poojan Oza and Vishal M Patel. C2AE: Class conditioned auto-encoder for open-set recog- nition. In CVPR, 2019. [250] Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. Actor- In CVPR, pages context-actor relation network for spatio-temporal action localization. 464–474, 2021. [251] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter- efficient image-to-video transfer learning. In NeurIPS, pages 26462–26477, 2022. [252] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. Deep learning for anomaly detection: A review. arXiv preprint arXiv:2007.02500, 2020. [253] Paschalis Panteleris, Iason Oikonomidis, and Antonis Argyros. Using a single rgb frame for real time 3d hand pose estimation in the wild. In WACV, 2018. [254] Seulki Park, Jongin Lim, Younghan Jeon, and Jin Young Choi. Influence-balanced loss for imbalanced visual classification. In ICCV, 2021. [255] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of Neural Information Processing Systems, 2019. [256] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014. [257] Pramuditha Perera, Vlad I Morariu, Rajiv Jain, Varun Manjunatha, Curtis Wigington, Vi- cente Ordonez, and Vishal M Patel. Generative-discriminative feature representations for open-set recognition. In CVPR, 2020. [258] Trung Pham, Thanh-Toan Do, Gustavo Carneiro, Ian Reid, et al. Bayesian semantic instance segmentation in open set world. In ECCV, 2018. [259] Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. In ICCV, 2019. [260] Jianing Qiu, Lipeng Chen, Xiao Gu, Frank P-W Lo, Ya-Yen Tsai, Jiankai Sun, Jiaqi Liu, and Benny Lo. Egocentric human trajectory forecasting with a wearable camera and multi-modal fusion. IEEE Robotics and Automation Letters, 7(4):8799–8806, 2022. 194 [261] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. [262] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(1):5485–5551, 2020. [263] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang In Wang, and Tim Januschowski. Deep state space models for time series forecasting. NeurIPS, 2018. [264] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Fine-tuned clip models are efficient video learners. In CVPR, 2023. [265] Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, and David A Ross. Open-vocabulary temporal action detection with off-the-shelf image-text features. In BMVC, 2022. [266] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: Improving prompt tuning with residual reparameterization. In ACL, 2023. [267] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real- time object detection with region proposal networks. In Proceedings of Neural Information Processing Systems, 2015. [268] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE TPAMI, 39(6):1137–1149, 2017. [269] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pages 658–666, 2019. [270] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation In International Conference on and approximate inference in deep generative models. Machine Learning, 2014. [271] Byungseok Roh, JaeWoong Shin, Wuhyun Shin, and Saehoon Kim. Sparse DETR: Efficient end-to-end object detection with learnable sparsity. In ICLR, 2022. [272] Alina Roitberg, Chaoxiang Ma, Monica Haurilet, and Rainer Stiefelhagen. Open set driver activity recognition. In IVS, 2020. [273] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om- 195 In Proceedings of mer. High-resolution image synthesis with latent diffusion models. the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [274] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015. [275] Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In ICML, 2023. [276] Kuniaki Saito, Donghyun Kim, and Kate Saenko. Openmatch: Open-set semi-supervised learning with open-set consistency regularization. In NeurIPS, 2021. [277] D. M. Saxena, S. Bae, A. Nakhaei, K. Fujimura, and M. Likhachev. Driving in dense traffic with model-free reinforcement learning. In ICRA, 2020. [278] Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2012. [279] Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence, 36(11):2317– 2324, 2014. [280] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. [281] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient- based localization. In ICCV, pages 618–626, 2017. [282] Murat Sensoy, Lance Kaplan, Federico Cerutti, and Maryam Saleki. Uncertainty-aware deep classifiers using generative models. In AAAI, 2020. [283] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classification uncertainty. In NeurIPS, 2018. [284] Kari Sentz, Scott Ferson, et al. Combination of evidence in Dempster-Shafer theory, volume 4015. Sandia National Laboratories Albuquerque, 2002. [285] Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. In Proceedings of Neural Information Processing Systems, 2018. 196 [286] Ankit Shah, Jean Baptiste Lamare, Tuan Nguyen Anh, and Alexander Hauptmann. CADP: A novel dataset for CCTV traffic camera based accident analysis. In International Workshop on Traffic and Street Surveillance for Safety and Security, 2018. [287] Weishi Shi, Xujiang Zhao, Feng Chen, and Qi Yu. Multifaceted uncertainty estimation for label-efficient deep learning. In NeurIPS, 2020. [288] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017. [289] Kumar Shridhar, Felix Laumann, and Marcus Liwicki. Uncertainty estimations by soft- plus normalization in bayesian convolutional neural networks with variational inference. arXiv:1806.05978, 2018. [290] Yu Shu, Yemin Shi, Yaowei Wang, Tiejun Huang, and Yonghong Tian. p-odn: prototype- based open deep network for open set recognition. Scientific reports, 10(1):1–13, 2020. [291] Yu Shu, Yemin Shi, Yaowei Wang, Yixiong Zou, Qingsheng Yuan, and Yonghong Tian. ODN: Opening the deep network for open-set action recognition. In ICME, 2018. [292] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [293] Gurkirt Singh, Stephen Akrigg, Manuele Di Maio, Valentina Fontana, Reza Javanmard Alitappeh, Salman Khan, Suman Saha, Kossar Jeddisaravi, Farzad Yousefi, Jacob Culley, Tom Nicholson, Jordan Omokeowa, Stanislao Grazioso, Andrew Bradley, Giuseppe Di Gironimo, and Fabio Cuzzolin. Road: The road event awareness dataset for autonomous driving. IEEE TPAMI, 45(1):1036–1054, 2023. [294] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. [295] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(5), 2012. [296] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [297] Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. Class semantics-based attention for action detection. In ICCV, 2021. [298] Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, and Jonathan Huang. Uncertainty-aware audiovisual activity recognition using deep bayesian variational inference. In ICCV, 2019. 197 [299] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. Actor-centric relation network. In ECCV, pages 318–334, 2018. [300] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. Relational action forecasting. In CVPR, 2019. [301] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In CVPR, pages 14454–14463, 2021. [302] Xin Sun, Zhenning Yang, Chi Zhang, Keck-Voon Ling, and Guohao Peng. Conditional Gaussian distribution learning for open set recognition. In CVPR, 2020. [303] Tomoyuki Suzuki, Hirokatsu Kataoka, Yoshimitsu Aoki, and Yutaka Satoh. Anticipating traffic accidents with adaptive loss and large-scale incident DB. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. [304] Yoshiaki Takimoto, Yusuke Tanaka, Takeshi Kurashima, Shuhei Yamamoto, Maya Okawa, and Hiroyuki Toda. Predicting traffic accidents with event recorder data. In Proceedings of ACM SIGSPATIAL International Workshop on Prediction of Human Mobility, 2019. [305] Binh Tang and David S Matteson. Probabilistic transformer for time series analysis. In NeurIPS, 2021. [306] Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggre- gation for action detection. In ECCV, pages 71–87, 2020. [307] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. [308] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020. [309] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+o: Unified egocentric recognition of 3d hand-object poses and interactions. In CVPR, 2019. [310] Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In International Conference on Computer Vision (ICCV), 2023. [311] Wentao Bao, Lichang Chen, Heng Huang, and Yu Kong. Prompting language-informed distribution for compositional zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), 2024. [312] Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, and Yu Kong. 198 Exploiting vlm localizability and semantics for open vocabulary action detection. arXiv preprint, 2024. [313] Wentao Bao, Bin Xu, and Zhenzhong Chen. MonoFENet: Monocular 3d object detec- tion with feature enhancement networks. IEEE Transactions on Image Processing (TIP), 29:2753–2765, November 2019. [314] Wentao Bao, Qi Yu, and Yu Kong. Object-aware centroid voting for monocular 3d object detection. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. [315] Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In Proceedings of the 28th ACM International Confer- ence on Multimedia (ACM MM), 2020. [316] Wentao Bao, Qi Yu, and Yu Kong. DRIVE: Deep reinforced accident anticipation with visual explanation. In International Conference on Computer Vision (ICCV), 2021. [317] Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learning for open set action recognition. In International Conference on Computer Vision (ICCV), Oral, 2021. [318] Wentao Bao, Qi Yu, and Yu Kong. OpenTAL: Towards open set temporal action localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oral, 2022. [319] Wentao Bao, Qi Yu, and Yu Kong. Latent space energy-based model for fine-grained open set recognition. arXiv preprint arXiv:2309.10711, 2023. [320] Luca Anthony Thiede and Pratik Prabhanjan Brahma. Analyzing the variety loss in the context of probabilistic trajectory prediction. In CVPR, 2019. [321] Pavel Tokmakov, Yu-Xiong Wang, and Martial Hebert. Learning compositional representa- tions for few-shot recognition. In ICCV, 2019. [322] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp- mixer: An all-mlp architecture for vision. In NeurIPS, pages 24261–24272, 2021. [323] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024. [324] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 199 [325] Matthew Trager, Pramuditha Perera, Luca Zancato, Alessandro Achille, Parminder Bhatia, Bing Xiang, and Stefano Soatto. Linear spaces of meanings: the compositional language of vlms. arXiv preprint arXiv:2302.14383, 2023. [326] Dustin Tran, Jasper Snoek, and Balaji Lakshminarayanan. Practical uncertainty estimation and out-of-distribution robustness in deep learning. Technical report, Google Brain, 2020. NeurIPS Tutorial. [327] Vladimir Naumovich Vapnik, Vlamimir Vapnik, et al. Statistical learning theory. wiley New York, 1998. [328] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Neural Information Processing Systems, 2017. [329] Trent Victor. Keeping eye and mind on the road. PhD thesis, Acta Universitatis Upsaliensis, 2005. [330] Henan Wang, Muli Yang, Kun Wei, and Cheng Deng. Hierarchical prompt learning for compositional zero-shot recognition. In IJCAI, 2023. [331] Jiashun Wang, Huazhe Xu, Medhini Narasimhan, and Xiaolong Wang. Multi-person 3d motion prediction with multi-range transformers. In NeurIPS, 2021. [332] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021. [333] Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021. [334] Xiaohan Wang, Yu Wu, Linchao Zhu, and Yi Yang. Symbiotic attention with privileged information for egocentric action recognition. In AAAI, 2020. [335] Yezhen Wang, Bo Li, Tong Che, Kaiyang Zhou, Ziwei Liu, and Dongsheng Li. Energy-based open-world uncertainty modeling for confidence calibration. In ICCV, 2021. [336] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Internvideo: General video foundation models via Jilan Xu, Yi Liu, Zun Wang, et al. generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022. [337] Yufei Wang and Tianwei Ni. Meta-SAC: Auto-tune the entropy temperature of soft actor- critic via metagradient. In ICML Workshop, 2020. [338] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In CVPR, pages 200 23034–23044, 2023. [339] Philippe Weinzaepfel and Grégory Rogez. Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision, pages 1–16, 2021. [340] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Open-vclip: Trans- forming clip to an open-vocabulary video model via interpolated weight optimization. In ICML, 2023. [341] Jürgen Wiest, Matthias Höffken, Ulrich Kreßel, and Klaus Dietmayer. Probabilistic trajectory prediction with gaussian mixture models. In IVS, 2012. [342] Max Wolff, Wieland Brendel, and Stuart Wolff. The independent compositional subspace hypothesis for the structure of clip’s last layer. In ICLR Workshop, 2023. [343] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019. [344] Di Wu, Nabin Sharma, and Michael Blumenstein. Recent advances in video-based human action recognition using deep learning: A review. In IJCNN, 2017. [345] Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. Context- aware rcnn: A baseline for action detection in videos. In ECCV, pages 440–456, 2020. [346] Jianzong Wu, Xiangtai Li, Shilin Xu Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, et al. Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880, 2023. [347] Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, and Limin Wang. Stmixer: A one-stage sparse action detector. In CVPR, pages 14720–14729, 2023. [348] Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023. [349] Ye Xia, Jinkyu Kim, John Canny, Karl Zipser, Teresa Canas-Bajo, and David Whitney. Periphery-fovea multi-resolution driving model guided by human attention. In WACV, 2020. [350] Ye Xia, Danqing Zhang, Jinkyu Kim, Ken Nakayama, Karl Zipser, and David Whitney. Predicting driver attention in critical situations. In ACCV, 2018. [351] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. [352] Guangyue Xu, Parisa Kordjamshidi, and Joyce Chai. Prompting large pre-trained vision- 201 language models for compositional concept learning. arXiv preprint arXiv:2211.05077, 2022. [353] Hu Xu, Bing Liu, Lei Shu, and P Yu. Open-world learning and application to product classification. In WWW, 2019. [354] Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017. [355] Mai Xu, Yuhang Song, Jianyi Wang, MingLang Qiao, Liangyu Huo, and Zulin Wang. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE TPAMI, 41(11):2693–2708, 2018. [356] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In CVPR, 2020. [357] Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2022. [358] An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William Wang, Jingbo Shang, and Julian McAuley. Learning concise and descriptive attributes for visual recognition. arXiv preprint arXiv:2308.03685, 2023. [359] Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, and Cordelia Schmid. Unloc: A unified framework for video localization tasks. In ICCV, 2023. [360] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal pyramid network for action recognition. In CVPR, 2020. [361] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Robust classification with convolutional prototype learning. In CVPR, 2018. [362] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, Qing Yang, and Cheng-Lin Liu. Convolutional IEEE Transactions on Pattern Analysis and prototype network for open set recognition. Machine Intelligence, 2020. [363] Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, et al. Openood: Benchmarking generalized out-of-distribution detection. In NeurIPS, 2022. [364] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. Causal- vae: Disentangled representation learning via neural structural causal models. In CVPR, 2021. 202 [365] Ming Yang, Shuiwang Ji, Wei Xu, Jinjun Wang, Fengjun Lv, Kai Yu, Yihong Gong, Mert Dikmen, Dennis J Lin, and Thomas S Huang. Detecting human actions in surveillance videos. In TRECVID, 2009. [366] Yang Yang, Chunping Hou, Yue Lang, Dai Guan, Danyang Huang, and Jinchen Xu. Open- set human activity recognition based on micro-doppler signatures. Pattern Recognition, 85:60–69, 2019. [367] Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. Predicting goal-directed human attention using inverse reinforcement learning. In CVPR, 2020. [368] Yu Yao, Mingze Xu, Yuchen Wang, David J Crandall, and Ella M Atkins. Unsupervised traffic accident detection in first-person videos. In International Conference on Intelligent Robots and Systems, 2019. [369] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019. [370] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016. [371] Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In ICML, 2018. [372] Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Nae- mura. Classification-reconstruction learning for open-set recognition. In CVPR, 2019. [373] Tackgeun You and Bohyung Han. Traffic accident benchmark for causality recognition. In ECCV, 2020. [374] Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014. [375] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. [376] Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. Multi-task curriculum framework for open-set semi-supervised learning. In ECCV, 2020. [377] Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, and Ting Liu. Contextualized spatio-temporal contrastive learning with self-supervision. In CVPR, pages 13977–13986, 2022. 203 [378] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. NeW CRFs: Neural window fully-connected crfs for monocular depth estimation. In CVPR, 2022. [379] Ye Yuan and Kris Kitani. Ego-pose estimation and forecasting as real-time PD control. In ICCV, 2019. [380] Ye Yuan, Xinshuo Weng, Yanglan Ou, and Kris M Kitani. Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. In ICCV, 2021. [381] Zhongqi Yue, Tan Wang, Qianru Sun, Xian-Sheng Hua, and Hanwang Zhang. Counterfactual zero-shot and open-set visual recognition. In CVPR, 2021. [382] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2023. [383] Kimin Yun, Yongjin Kwon, Sungchan Oh, Jinyoung Moon, and Jongyoul Park. Vision- based garbage dumping action detection for real-world surveillance platform. ETRI Journal, 41(4):494–505, 2019. [384] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. In ECCV, pages 106–122, 2022. [385] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, pages 14393–14402, 2021. [386] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019. [387] Kuo-Hao Zeng, Shih-Han Chou, Fu-Hsiang Chan, Juan Carlos Niebles, and Min Sun. Agent- centric risk assessment: Accident anticipation and risky region localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. [388] Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. Graph convolutional networks for temporal action localization. In ICCV, 2019. [389] Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, and Gang Hua. Soar: Scene-debiasing open-set action recognition. In ICCV, 2023. [390] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A comprehensive survey of vision-based human action recognition methods. Sensors, 19(5):1005, 2019. [391] Hui Zhang and Henghui Ding. Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 2021. 204 [392] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021. [393] Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, and Nanning Zheng. Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In CVPR, 2019. [394] Ruohan Zhang, Zhuode Liu, Luxin Zhang, Jake A. Whritner, Karl S. Muller, Mary M. Hayhoe, and Dana H. Ballard. AGIL: Learning attention from human for visuomotor tasks. In ECCV, 2018. [395] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. [396] Tian Zhang, Kongming Liang, Ruoyi Du, Xian Sun, Zhanyu Ma, and Jun Guo. Learning invariant visual representations for compositional zero-shot learning. In ECCV, 2022. [397] Zhoutong Zhang, Forrester Cole, Richard Tucker, William T Freeman, and Tali Dekel. Consistent depth of moving objects in video. ACM TOG, 40(4):1–12, 2021. [398] Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. Tuber: Tubelet transformer for video action detection. In CVPR, pages 13598–13607, 2022. [399] Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. Bottom-up temporal action localization with mutual regularization. In ECCV, 2020. [400] Rui Zhao, Kang Wang, Hui Su, and Qiang Ji. Bayesian graph convolution lstm for skeleton based action recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2019. [401] Xujiang Zhao, Yuzhe Ou, Lance Kaplan, Feng Chen, and Jin-Hee Cho. Quantifying classi- fication uncertainty using regularized evidential neural networks. In AAAI, 2019. [402] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Tem- poral action detection with structured segment networks. In ICCV, 2017. [403] Yin-Dong Zheng, Guo Chen, Minglei Yuan, and Tong Lu. Mrsn: Multi-relation support network for video action detection. arXiv preprint arXiv:2304.11975, 2023. [404] Zhaoheng Zheng, Haidong Zhu, and Ram Nevatia. Caila: Concept-aware intra-layer adapters for compositional zero-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1721–1731, 2024. 205 [405] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language- image pretraining. In CVPR, pages 16793–16803, 2022. [406] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016. [407] Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Learning placeholders for open-set recog- nition. In CVPR, 2021. [408] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022. [409] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 2022. [410] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018. [411] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023. [412] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In ICCV, pages 15659–15669, 2023. [413] Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. Enriching local and global contexts for temporal action localization. In ICCV, 2021. [414] Yixiong Zou, Shanghang Zhang, Guangyao Chen, Yonghong Tian, Kurt Keutzer, and José M. F. Moura. Annotation-efficient untrimmed video action recognition. In ACM MM, 2021. [415] Yixiong Zou, Shanghang Zhang, Ke Chen, Yonghong Tian, Yaowei Wang, and José MF Moura. Compositional few-shot recognition with primitive discovery and enhancing. In ACM MM, 2020. 206