ACTION MODELING IN LONG-FORM VIDEOS

By

Junwen Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2024

ABSTRACT

Video is the dominant modality that people use to consume content and share experiences in time

series. The significant expansion of video data available both on the internet and in everyday life

has spurred the creation of intelligent systems that can automatically analyze video content and

comprehend human actions. Compared to static images, videos describe how the world changes as

time elapses. The uniqueness of videos, beyond what can be understood from a single image, is the

context of action understanding. Over the last ten years, we have seen huge success in recognizing

human actions in a video, by deep neural networks. However, this action recognition has several

limitations for real applications. It primarily focuses on recognizing action patterns within only a

few seconds. This is still far from progressing to a human-level intelligence of video understanding.

People can directly perceive uncurated long videos in the real world. We want the model directly

applied to long-form videos, which are untrimmed and contain multiple actions/events.

In approaching this challenge, in this thesis, we first study the representation learning of

actions/events in long-form videos. We develop models to learn the fine-grained motion repre-

sentations across multiple actions/events in a video. My research seeks to enable machine visions

to represent motions over a long-horizon range, by exploiting the potential of multi-modal video-

language contexts. We also address learning the actions jointly performed by a group of people, by

modeling their interactions. After that, we investigate leveraging the long-range dependencies of

the events in boosting temporal reasoning downstream tasks, including online action detection and

spatiotemporal object grounding. Finally, considering the wide applications of video models, we

focus on cultivating trustworthiness in the models for long-form videos from static bias mitigation

and interpretable reasoning perspectives.

Copyright by
JUNWEN CHEN
2024

ACKNOWLEDGEMENTS

I would like to express my sincere thanks to all the people who helped and supported me in

this invaluable academic discovery journey. First of all, I am full of appreciation for my advisor

Professor Yu Kong who has taken me under his wing since the beginning of his faculty career and

supported me all these years. Thank you for the days and nights we worked together, discussing

research ideas, sharing insightful work, writing papers and responses, and sharing a cake when

our effort paid off. He offered me the space to grow and the great freedom to explore the topics

that I am interested in. He also gave me a lot of courage and recognition to pose new research

questions and tackle challenging and exciting research problems. When I encountered difficulties,

he provided me with constructive suggestions and cheered me up. Besides, I am sincerely thankful

to Dr. Kong and his family who overcame many obstacles moving to MSU, so that I could have the

opportunity to experience two different cities and departments and meet a diversity of researchers.

I am deeply grateful to my committee members, Prof. Xiaoming Liu, Prof. Parisa Kordjamshidi,

and Prof. Daniel Morris, all of whom have given me invaluable support and provided precious

suggestions and feedback on my research and thesis. Their students have also been intellectually

stimulating and socially fun to me. I would extend gratitude to my internship mentors as well.

I would like to thank our Action lab: When I first joined the lab, I had accumulated less research

experiences in computer vision compared to Wentao. He spent countless efforts in helping me

present results, polish papers, and write responses, even when he had deadlines to rush. I am happy

and grateful to have the young generation Yujiang, Yifan, and Zhanbo as labmates in the final year

of my Ph.D., with whom I shared common research interests and got a lot of motivations. Many

thanks also to the former members, Haiting, Hanbin, Chuanqi, Anna, Xinmiao, and Xiwen.

During my Ph.D., I am fortunate to spend unforgettable four years in Rochester, the birthplace

of Kodak, where I met researchers working on a variety of computer vision and imaging related

projects, including Prof. Linwei Wang (the role model to a female Ph.D. student), Prof. Pengcheng

Shi, Prof. Rui Li, Prof. Matthew Wright, Prof. Dongfang Liu, Prof. Chenliang Xu and his group.

Finally, I would like to thank my parents and family, whose endless love serves as my cornerstone.

iv

TABLE OF CONTENTS

CHAPTER 1

BIBLIOGRAPHY . .

. .

.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
7

CHAPTER 2

BIBLIOGRAPHY . .

. .

.

LEARNING FINE-GRAINED MOTION REPRESENTATIONS . . . .

9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

CHAPTER 3

GROUP ACTIVITY PREDICTION WITH SEQUENTIALLY RELA-
TIONAL ANTICIPATION MODEL . . . . . . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

.

.

.

.

BIBLIOGRAPHY .

CHAPTER 4

GATED HISTORY UNIT WITH BACKGROUND SUPPRESSION
FOR ONLINE ACTION DETECTION . . . . . . . . . . . . . . . . . . 54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

.

.

.

.

BIBLIOGRAPHY .

CHAPTER 5

ACTIVITY-DRIVEN WEAKLY-SUPERVISED SPATIAL-TEMPORAL
VIDEO OBJECT GROUNDING . . . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

.

.

.

.

BIBLIOGRAPHY .

CHAPTER 6

EXPLAINABLE VIDEO ENTAILMENT WITH VISUALLY GROUNDED
EVIDENCE .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 100
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

.
.

.

.

BIBLIOGRAPHY .

CHAPTER 7

CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . .

. 120

v

CHAPTER 1

INTRODUCTION

1.1 Motivating Problems

Video is the primary signal that we perceive the world every day, as we observe our surrounding

environment in the form of continuous visual input. We are in the era of video.

In the real

world, there are more than 45 billion cameras on the Earth now. On the Internet, hundreds of

hours of videos are uploaded to video-sharing platforms such as YouTube and TikTok every single

minute. One of the fundamental goals of AI research is to equip intelligent systems to analyze

video content automatically and perceive their environment. For example, video-sharing websites

need to understand the characters and events to make a promising recommendation; autonomous

vehicle needs to predict whether a pedestrian will cross the street or not at a stop sign, to predict

what he may perform next.

A video task uniquely suited for videos, beyond what can be understood from a still image

(e.g.

scenes, people, and objects), is the context of action understanding in videos. Human

action encodes how an actor’s relationship with surroundings evolves over time. Over the last

ten years, we have seen big success in recognizing human actions in a video, by deep neural

networks Bertasius et al. (2021); Feichtenhofer et al. (2019); Carreira and Zisserman (2017). We

now have visual systems that can accurately recognize human actions from various perspectives:

in different camera angles from third-person videos Carreira and Zisserman (2017); Shao et al.

(2020) to first-person videos Damen et al. (2018); Grauman et al. (2022) by wearable cameras,

in different granularity from coarse-level Carreira and Zisserman (2017) to fine-level Shao et al.

(2020), in different modalities from pixel Carreira and Zisserman (2017); Shao et al. (2020)

to human pose Shahroudy et al. (2016), and in a different number of actors from simple limb

movement Shahroudy et al. (2016) to group activities of multiple actors Ibrahim et al. (2016).

Powered by the ubiquitous large-scale vision(-language) models Xu et al. (2021); Radford et al.

(2021); Dosovitskiy et al. (2021), these models have demonstrated near human-level performance

on the recognition tasks Bertasius et al. (2021); Wang et al. (2016); Feichtenhofer et al. (2019) and

1

Figure 1.1 Action Modeling in Long-form Videos: Representations, Reasoning, and Trustworthi-
ness.

localization tasks Dai et al. (2017); Zhao et al. (2017). However, this action recognition has several

limitations for real applications, since it primarily focuses on recognizing action patterns within

only a few seconds. In the real world, we desire the model directly applied to raw untrimmed videos

which may be prolonged and contain multiple events across time. To realize the goal, we need to

study action modeling in long-form video understanding.

In this dissertation, we study three problems that play important roles in long-form videos (See

Fig. 1.1): (1) How to perceive the actions/events from long-form videos? Action modeling is a

fundamental task in video understanding. Challenges of action modeling from long-form videos

stem from learning the fine-grained motion representations across multiple events performed on

the individuals and learning the actions performed by multiple people.

(2) How to capture the

long-range dependencies across time and conduct temporal reasoning between the events? An

untrimmed video usually contains rich temporal dependencies in time series. Because a baby fell

down seconds ago, the baby is crying now. In addition to progressing in the perception of actions,

it is necessary to design methods that learn the temporal relations in a video, such as predicting

what is about to happen and why something is happening. (3) How to cultivate trustworthiness

in video models? The diverse video understanding tasks and applications also call for designing

trustworthy models. Specifically, we study the bias mitigation in video understanding and present

explainable decision-making for humans.

2

1.2 Our Approaches

1.2.1 Learning action representations from long-form videos

In the first part of the thesis, we address the research question on How to perceive the ac-

tions/events from long-form videos. Existing action recognition tasks mainly benchmark the action

recognition in videos with only a few seconds. We focus on learning the representations of dy-

namics in long-form videos, where multiple actions are performed. In the first work, we leverage

video-language alignment to learn the fine-grained motions across multiple events in the same

scenario. In the second work, we present a novel framework to learn the dynamics performed by

multiple actors.

Specifically, in Chapter 2, we take video question answering (VideoQA) datasets that feature

temporal reasoning to evaluate the fine-grained motion representations. For example, imagine

that answer the question “What did the boy do before he raised his hand to take the camera?”.

The model needs to recognize the actions of “raising hand” and “taking camera” in the video,

which are performed by the same individual “the boy” sequentially and share the same appearance

information. Thus, we approach a fine-grained motion representation learning for this challenge.

We introduce Action Temporality Modeling (ATM) via three-fold uniqueness: (1) an empirical

study of realizing that optical flow is effective in capturing the long horizon temporality reasoning;

(2) training the visual-text embedding by contrastive learning in an action-centric manner, leading

to better action representations in both vision and text modalities; and (3) preventing the model

from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious

correlation between appearance and motion and hence ensure faithful temporality reasoning. In the

experiments, we show that ATM outperforms previous approaches in terms of accuracy on multiple

VideoQAs and exhibits better true temporality reasoning ability.

In Chapter 3, we further investigate the representation learning of actions that are jointly

performed by multiple actors, which has applications in many surveillance scenarios. We propose

a novel approach to predict group activities given the beginning frames with incomplete activity

executions. For group activity prediction, the relation evolution of people’s activity and their

3

positions over time is an important cue for predicting group activity. To this end, we propose a

sequential relational anticipation model (SRAM) that summarizes the relational dynamics in the

partial observation and progressively anticipates the group representations with rich discriminative

information. Our model explicitly anticipates both activity features and positions by two graph

auto-encoders, aiming to learn a discriminative group representation for group activity prediction.

Experimental results on two popularly used datasets demonstrate that our approach significantly

outperforms state-of-the-art action prediction methods.

1.2.2 Leveraging long-range dependencies in long-form video understanding

After extracting the representation of the individual actions/events in a long-form video, in the

second part of the thesis, we further study i.e., the long-range dependencies of the actions and

leverage them in downstream tasks. First, we develop a vision algorithm that is capable of having

a past-to-future reasoning in a fluid video stream. The algorithm can efficiently detect the relevant

information from the long and redundant history frames. Second, we improve the spatiotemporal

grounding of the objects by modeling the effect of human actions.

In Chapter 4, the research strives to address how to relate the long and redundant history to

understanding the present. Online action detection is the task of predicting the action as soon as

it happens in a streaming video. A major challenge is that the model does not have access to the

future and has to rely solely on history, i.e., , the frames observed so far, to make predictions. It is

therefore important to accentuate parts of the history that are more informative to the prediction of

the current frame. We present Gated History Unit with Background Suppression i.e., GateHUB,

that comprises a novel gated cross-attention mechanism to enhance or suppress parts of the history

as per how informative they are for current frame prediction.

In a single unified framework,

GateHUB integrates the transformer’s ability of long-range temporal modeling and the recurrent

model’s capacity to selectively encode relevant information. Extensive validation demonstrates

that GateHUB significantly outperforms all existing methods and is also more efficient than the

existing best work.

When a human performs action, the carrier of action i.e., , objects experiences the state change,

4

which is considered the effect of human action. For example, when doing “mash potato”, we can

see the potato evolve from “cube-shape” to “paste-shape”. Given our dynamic world, I contend

that traditional object-centric tasks such as grounding can be facilitated with videos, where the

cause-effect is an intrinsic cue for learning to see. Chapter 5 pointed out that activity cues in

both text and visual modalities are informative for grounding objects in an untrimmed video. This

work is one of the first to combine the cause-effect from instructional videos, which is a setting

that is more likely to become common as environmental and wearable cameras become even more

ubiquitous.

1.2.3 Cultivating trustworthiness in video models

While video models can be widely adopted in many tasks and daily applications, it is necessary

to guarantee that the developed models are trustworthy, especially the human-centered visual tasks.

In the third part of the thesis, we work on making model decision-making trustworthy, by (1)

revealing and mitigating the bias in video understanding and (2) making the decision-making

interpretable, both leveraging the multimodal context.

Recent studies have pointed out that the models sometimes capture the unwanted bias exhibited

in training data.

In addition to the widely considered bias e.g., , gender, race, and watermark,

videos contain a specific type of bias, i.e., static bias. That is exploiting the static representations

(objects, scenes, and people) to learn the underlying temporal reasoning task. For example,

while the “basketball dunk” and “soccer juggling” have distinct temporal patterns, they can be

discriminated by classifying the background into a basketball court or a soccer field. In Chapter 2,

the proposed ATM also contributes to overcoming static bias in temporal reasoning by manipulating

text supervision. Particularly, we design a simple yet effective solution that masks the appearance

word in the text and guides the video representation to be aligned with the motion word. We also

design a new metric to reveal if the temporal reasoning question answering relies on the actual

motion information.

In Chapter 6, we propose to improve the video language reasoning task performance with

visually grounded evidence.

In addition to the downstream task reasoning part, we add the

5

explanation part that grounds the people and objects mentioned in the text to the videos. We

leverage video entailment as evaluation, which aims at determining if a hypothesis textual statement

is entailed or contradicted by a premise video. The main challenge of video entailment is that it

requires fine-grained reasoning to understand complex and long story-based videos. We proposes

to incorporate visual grounding to the entailment by explicitly linking the entities described in the

statement to the evidence in the video. If the entities are grounded in the video, we enhance the

entailment judgment by focusing on the frames where the entities occur. Besides, in the entailment

dataset, the entailed/contradictory (also named as real/fake) statements are formed in pairs with

the subtle discrepancy, which allows an add-on explanation module to predict which words or

phrases make the statement contradictory to the video and regularize the training of the entailment

judgment. Experimental results demonstrate that our approach outperforms the state-of-the-art

methods.

1.3 Relevant Publications

• Chapter 2- Junwen Chen, Jie Zhu, and Yu Kong. ATM: Action Temporality Modeling for

Video Question Answering. In ACM Multimedia, 2023

• Chapter 3- Junwen Chen, Wentao Bao, and Yu Kong. Group Activity Prediction with

Sequential Relational Anticipation Model.

In European Conference on Computer Vision

(ECCV), 2020

• Chapter 4- Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, and Mei Chen. GateHUB: Gated

History Unit with Background Suppression for Online Action Detection. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2022

• Chapter 5- Junwen Chen, Wentao Bao, and Yu Kong. Activity-driven Weakly-Supervised

Spatio-Temporal Grounding from Untrimmed Videos. In ACM Multimedia, 2020

• Chapter 6- Junwen Chen and Yu Kong. Explainable Video Entailment with Visually

Grounded Evidence. In IEEE International Conference on Computer Vision (ICCV), 2021.

6

BIBLIOGRAPHY

Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-time attention all you need for video

understanding?

Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? A new model and the

Kinetics dataset. In CVPR.

Dai, X., Singh, B., Zhang, G., Davis, L. S., and Chen, Y. Q. (2017). Temporal context network for

activity localization in videos. In ICCV.

Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D.,
Munro, J., Perrett, T., Price, W., and Wray, M. (2018). Scaling egocentric vision: The epic-
kitchens dataset. In European Conference on Computer Vision (ECCV).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019). Slowfast networks for video recognition.

In ICCV.

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang,
H., Liu, M., Liu, X., et al. (2022). Ego4d: Around the world in 3,000 hours of egocentric video.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
18995–19012.

Ibrahim, M. S., Muralidharan, S., Deng, Z., Vahdat, A., and Mori, G. (2016). A hierarchical deep

temporal model for group activity recognition. In CVPR, pages 1971–1980.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language
supervision. In ICML, pages 8748–8763. PMLR.

Shahroudy, A., Liu, J., Ng, T.-T., and Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d
human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1010–1019.

Shao, D., Zhao, Y., Dai, B., and Lin, D. (2020). Finegym: A hierarchical video dataset for fine-
grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 2616–2625.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016). Temporal

segment networks: Towards good practices for deep action recognition. In ECCV.

7

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L.,
and Feichtenhofer, C. (2021). Videoclip: Contrastive pre-training for zero-shot video-text
understanding. In EMNLP, pages 6787–6800.

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. (2017). Temporal action detection

with structured segment networks. In ICCV.

8

CHAPTER 2

LEARNING FINE-GRAINED MOTION REPRESENTATIONS

2.1

Introduction

Video question answering (VideoQA) is an interactive AI task, which enables many downstream

applications such as vision-language navigation and communication systems. It aims to answer the

natural language question given the video content. Recent VideoQA benchmark Xiao et al. (2021)

has gone beyond the description of video content like “A baby is crying” and started to provide

effective diagnostics for the models on solving temporal reasoning and causal reflection, e.g., “The

train stops after moving for a while”. To correctly answer the question, a VideoQA model needs to

detect the object “train”, recognize the “railway” scene, more importantly ground the action “move”

and “stop” and understand their temporal relations. The questions are unconstrained and complex,

and thus, it is necessary to have a visual-text alignment model that has the reasoning capability

towards all aforementioned contents.

Recent advanced VideoQA models have shown the capability of learning from the descriptive

contents Lei et al. (2018, 2019), thanks to the success of cross-modal transformers Li et al. (2020);

Lei et al. (2021). However, the temporality reasoning in videos remains a great challenge, since these

VideoQAs are only capable of holistic recognition of static content in a video. Recent work attempt

to solve this issue by (1) enhancing the video representation with fine-grained dynamics Xiao et al.

(2022b,a) and (2) answering by grounding to question-critical visual evidence Li et al. (2022b,c).

But it is hard to achieve a precise grounding, without the ground-truth of temporal boundaries

for training. The state-of-the-art method VGT Xiao et al. (2022b) proposes to model the atomic

actions across frames from the spatio-temporal dynamics of objects. In this way, the fine-grained

dynamics can be captured. But their model may rely on the static bias i.e., object appearance, as

shortcuts from videos while the causal factors i.e., the dynamics are overlooked in training. In this

paper, we address the importance of precise and faithful modeling of actions for the VideoQA task.

We propose Action Temporality Modeling (ATM) to address the challenging temporality

VideoQA (as shown in Fig. 2.1). A promise of VideoQA compared to ImageQA is to examine

9

Figure 2.1 ATM addresses VideoQA featuring multi-frame temporality reasoning by (1) an
appearance-free stream i.e., optical flow to extract precise motion cues, (2) Action-centric con-
trastive learning (AcCL) for an action-plentiful cross-modal motion representation, and (3) a tem-
poral sensitivity-aware confusion (TSC) training to avoid learning a shortcut between temporality-
critical motion and appearance.

the temporal relation reasoning regarding motion information. As the targeted video is continu-

ous, actions across a long video usually share the same scene in short moments. We realize that

(1) leveraging an appearance-free stream e.g., optical flow as input, though the flow stream may

become less considered in recent action recognition methods Bertasius et al. (2021); Feichtenhofer

et al. (2019), is still important in VideoQA. Because flow can capture the subtle transition in long

horizon and aid the temporality reasoning. ATM trains the visual-text encoding in a contrastive

manner. Questions are usually unconstrained in the real world. Action may be only a small portion

of the question, which is easily overwhelmed by other information such as objects. (2) To learn

an action-plentiful cross-modal embedding, we develop a novel action-centric contrastive learning

(AcCL) before fine-tuning VideoQA. Specifically, it parses an action phrase from a question and

encourages a feature alignment between the video and the parsed action phrase alone, discarding

other textual information. The merit of the AcCL is that both video and text encoders are trained to

focus on actions, mitigating the backbone’s representation bias towards the static visual appearance

in videos.

Based on the learned representations, we further introduce a novel temporal sensitivity-aware

10

Q: what happens to the train after moving for a whilenear the end?a0. push toy train   a1. stopsa2. changes direction  a3.move   a4. follow the train with eyes⋯ATM       flowAcCLTSCconfusion loss (TSC) in VideoQA finetuning. It prevents a model from answering a temporality

question if the corresponding video is shuffled in the temporal domain, thus avoiding simply

learning the shortcut correlation to the static content. Note that VideoQA contains a lot of

descriptive questions that can be answered invariant to temporal change. Thus, we only apply the

confusion loss to temporal-sensitive questions that contain temporal keywords.

Thanks to these components, the proposed ATM outperforms all of the existing methods on

three commonly used VideoQA datasets. It is worth noting that our method without external vision-

language pretraining can surpass the existing method that relies on large-scale pre-training by a

clear margin. Moreover, we devise a new metric that quantifies the accuracy difference between

conditioned on a full video and conditioned on a single frame, which reveals the VideoQA’s

true temporality reasoning ability. Results show that our model experiences a larger performance

escalation from a single frame to a full video, which demonstrates ours relies on less appearance

bias and handles temporal reasoning in a faithful manner. To summarize, our main contributions

are as follows:

• We propose the ATM to address event temporality reasoning in VideoQA by faithful action

modeling. Our action-centric contrastive learning learns action-aware representations from

both vision and text modalities. We realize an appearance-free stream is effective in the

multi-event temporality understanding across frames.

• We fine-tune the model with a newly developed temporal sensitivity-aware confusion loss

that mitigates static bias in temporality reasoning.

• Our method is more accurate than all existing methods on three widely used VideoQA

datasets. By a new metric, we also indicate that our method addresses temporality reasoning

more faithfully.

2.2 Related Work

Video Question Answering. Escalating ImageQA Antol et al. (2015), VideoQA Xu et al.

(2016); Yu et al. (2019); Li et al. (2016); Xiao et al. (2021); Li et al. (2022a); Lei et al. (2018)

11

Figure 2.2 Framework Overview. Following the recent VQAs Xiao et al. (2022b); Yang et al.
(2021a), we solve VideoQA by a similarity comparison between video and text (a). To achieve
this, we formulate the training procedure into two stages. Before finetuning, we present a novel
action-centric contrastive learning (AcCL) to guide the visual and text representation expressive
for action information (b). After that, we fine-tune the VideoQA (c) by a newly developed temporal
sensitivity-aware confusion loss (TSC) to prevent leveraging static bias in temporality reasoning.

is enriched with reasoning about temporal nature. Prior arts Le et al. (2020); Park et al. (2021);

Xiao et al. (2022a) on VideoQA focus on learning an informative video content representation

and a cross-modal fusion model to answer the question. An informative video representation

is usually hierarchical, fusing object-, frame- and clip-level representations, which are extracted

by graph neural network Jiang and Han (2020); Li et al. (2022c); Park et al. (2021), relation

learning or transformers. While those VideoQA methods achieve compelling results on VideoQA

benchmarks, they mainly answer descriptive questions for the video content, such as questions that

holistic recognize the main actions/objects across frames.

Recent benchmark Xiao et al. (2021) begins to challenge the temporal relationship reasoning

ability, as actions in videos are diverse and causally dependent. Those methods that are only

capable of descriptive content recognition cannot perform well, because they hardly capture the

subtle transitions in the same scene. To this end, recent work Xiao et al. (2022a,b) proposes to

encode video as a local-to-global dynamic graph of spatiotemporal objects, so that the interaction

relations can be encoded. However, the VideoQA model built upon the dynamic graph may easily

be distracted by the object’s appearance and capture limited motion information. We alleviate

the distraction by a novel two-stage training to ensure a faithful representation of motions that are

critical for temporality reasoning.

Static Bias in Video. The promise of video lies in the potential to go beyond image-level

understanding i.e., scenes, objects, and people to capture the temporality of events. However, for

12

many video(+language) tasks and datasets, given just a single frame of video, an existing image-

centric model can achieve surprisingly high performance, comparable to the model using multiple

frames. The strong single-frame performance suggests that the video representation is biased

towards the still appearance information, namely “static appearance bias”. Existing work Buch

et al. (2022); Lei et al. (2022); Li et al. (2018); Choi et al. (2019) reveals this kind of bias in action

recognition dataset Carreira and Zisserman (2017); Soomro et al. (2012) and retrieval dataset Liu

et al. (2019); Luo et al. (2021). Circling around the fundamental video task action recognition, Li

et al. (2018); Choi et al. (2019) analyze the role of temporality in action recognition and inspires

the subsequent development of profound faithful evaluations Shao et al. (2020); Li et al. (2018) and

model structures Feichtenhofer et al. (2019); Lin et al. (2019); Feichtenhofer (2020); Duan et al.

(2022).

To address the challenging temporality reasoning in multi-modal scenarios i.e., VideoQA,

motion representations, unbiased toward appearance, are necessary. As VideoQA requires a deep

understanding of open-vocabulary action semantics, existing VideoQAs Le et al. (2020); Xiao

et al. (2022a) extract the motion features based on backbones pre-trained on a large-scale action

recognition dataset Carreira and Zisserman (2017). As mentioned, static bias exists in action

recognition, which makes the motion representations not the causal factors of actions, thus useless

to temporality reasoning. Existing methodsChoi et al. (2019); Li et al. (2018) mitigate the static bias

in action recognition by evaluating it on fine-grained action recognitionShao et al. (2020); Li et al.

(2018), where the scene context is the same across the different actions. However, in fine-grained

action recognition, motion is more critical information, which is different from VideoQA where

object/entity appearance is inevitable.

To mitigate static bias in VideoQA, IGV Li et al. (2022c) and EIGV Li et al. (2022b) are

proposed to ground the question-critical scenes across frames as the evidence of yielding the

answers. However, the dominant content of a question is appearance information e.g., people,

objects, and locations. The grounding may pay less attention to the actions that are critical for

temporality understanding and be not precise as no ground truth boundaries are provided. Our

13

method designs two simple yet effective schemes that learn faithful visual and text representations

informative for action and temporality. We also revisit the early action recognition work Wang et al.

(2016); Carreira and Zisserman (2017) and enhance the motion representation with an appearance-

free stream.

2.3 Methodology

Figure 2.2 gives an overview of ATM framework. Our framework addresses the VideoQA task

that challenges the temporal reasoning of dynamics in a video. Following the recent VQAs Yang

et al. (2021a); Xiao et al. (2022b), we solve VideoQA by a similarity comparison between video

and text (Figure 2.2-a). To achieve this, we formulate the training procedure into two stages. In the

first stage (Figure 2.2-b), we present a novel action-centric contrastive learning (AcCL, Sec. 2.3.3),

which makes the visual and text representation lexpressive for action information. After that, we

finetune the VideoQA (Figure 2.2-c) by a newly developed temporal sensitivity-aware confusion

loss (TSC, Sec. 2.3.4) to prevent leveraging static bias in temporality reasoning. We detailed the

video and text encoding in Sec. 2.3.2

2.3.1 Preliminaries

Given a video h and a question 𝑞, VideoQA aims to combine the two modalities h and 𝑞 to

predict the answer 𝑎. Following existing VideoQA work Li et al. (2022c); Xiao et al. (2022a,b),

we predict the answer by selecting the best matched 𝑎∗ from many candidates A of a question 𝑞,

given the corresponding video h:

𝑎∗ = arg max𝑎∈A F𝑊 (𝑎|𝑞, h, A),

(2.1)

where F𝑊 denotes the mapping function with learnable parameters 𝑊. The candidates A are

multi-choices in multi-choiceQA or a global answer list in open-ended QA.

Prior arts on VideoQA usually build F𝑊 as a cross-attention transformer Zhu and Yang (2020);

Lei et al. (2021), which takes a holistic token sequence containing video, question and each

candidate answer as input and classifies the answers as output. Recent work VGT Xiao et al.

(2022b) and VQA-T Yang et al. (2021b) propose to design F𝑊 as two unimodal transformers that

14

Figure 2.3 Motivation of using an appearance-free stream for motion representation in
VideoQA task. The example in (a) shows the state transition on a train, from moving to stop-
ping. We can see flow provides better cues for the actions than RGB. (b) summarizes the relative
performance gain/loss of different video backbones pivot on TSN, for both action recognition (Ki-
netics Carreira and Zisserman (2017)) and VideoQA (NextQA Xiao et al. (2021)), which shows
appearance-free stream i.e., flow is necessary for VideoQA. The numbers for action recognition
(green curves) are reported in their paper for Kinetics-400. The numbers for VideoQA are derived
based on our implementation on NextQA.

encode video and question-answer pair respectively and compare the visual-text similarity for each

answer as output:

𝑠𝑎 = F𝑣 (h) F𝑞 ( [𝑞; 𝑎])⊤ ,

(2.2)

in which F𝑣 denotes the video encoder and F𝑣 (h) ∈ R𝑑 is the video’ global feature obtained by mean-

pooling the features across 𝑇 frames. Likewise, F𝑞 denotes the text encoder and F𝑞 ( [𝑞; 𝑎]) ∈ R𝑑 is

the feature vector of a question-answer pair, where [; ] indicates the concatenation of question and

answer text. The visual-text similarity 𝑠𝑎 is obtained via a dot-product of video and text features

w.r.t. the answer 𝑎. The optimal answer is selected by maximizing the similarity score from the

candidate in the pool A:

𝑎∗ = arg max
𝑎∈A

(𝑠𝑎).

(2.3)

Following existing work Xiao et al. (2022b); Li et al. (2022c), we implement F𝑞 by the BERT De-

vlin et al. (2019) to extract text features. For video modality, many existing methods Xiao et al.

(2022b,a) extract features in multiple streams including object-level and frame-level. Following

them, we also formulate F𝑣 as a multi-stream video encoder (MSVE), by which object features are

encoded as 𝑓𝑜 ∈ R𝑇×𝑑 and frame features are encoded as 𝑓𝑖 ∈ R𝑇×𝑑. The object/frame feature ex-

15

traction and transformer-based encoding are exactly the same as state-of-the-art method VGT Xiao

et al. (2022b) for a fair comparison.

2.3.2 Rethinking motion representations in VideoQA

In video feature extraction of both the existing methods Xiao et al. (2022b,a); Le et al. (2020)

and ours, frame-level features 𝑓𝑖 ∈ R𝑇×𝑑 and object features 𝑓𝑜 ∈ R𝑇×𝑑 both represent appearance.

Optionally, they Xiao et al. (2022a); Le et al. (2020) apply a pre-trained 3D Conv network Carreira

and Zisserman (2017) on the neighboring frames to capture motions. However, VideoQA studies

the temporality of the actions in a video where multiple actions are performed across frames. As

a video captures continuous information, these actions usually share the same scene context and

are performed by the same people and on the entity. In this case, although 3D Conv can capture

motions, neighboring RGB frames may be too redundant to precisely model the actions. For

example, in Figure 2.3-a, it is hard to recognize “the train is stopping” in the last clip from RGB.

This inspires us to enhance the video representation by a stream, where the appearance information

is least and hence the motions are highlighted. To this end, we resort to optical flow that describes

the apparent motion of individual pixels on the image. As shown in Figure 2.3-a’s example, flow

maps provide better cues to understand the state transition of objects e.g., “train” was moving (in

the first and second clip) and stopped (in the third clip).

As VideoQA requires the open-vocabulary semantic understanding of motions, we use the

backbone pretrained on a large-scale action recognition dataset Kinetics-400 Kay et al. (2017) to

extract flow features. Flow features are extracted as per appearance frame timestamps as 𝑓𝑚 ∈ R𝑇×𝑑.

To fuse the object, appearance, and flow streams, our MSVE applies MLPs and a learnable multi-

head self-attention layer MSA with position embedding to model the temporal interactions upon

the multi-stream features and mean-pool the frames to obtain the global video representation 𝑓𝑣.

𝑓𝑣 = Mean-Pool(MSA(MLP( [ 𝑓𝑜; 𝑓𝑟; 𝑓𝑚])))

(2.4)

Note that we should not ignore the appearance information in VideoQA task, as the questions

are unconstrained and may contain characters, objects and locations that need to be grounded to

videos. This is different from the action segmentation Ding and Yao (2021) or skeleton-based

16

activity recognition Zhou et al. (2021); Duan et al. (2022), where motion is the only critical

information.

We revisit the fundamental video understanding task i.e., action recognition, in which the early

methods e.g., TSN Wang et al. (2016); Carreira and Zisserman (2017) also utilized optical flow to

capture motions. As shown in Figure 2.3-b, we observe that although the existing powerful back-

bones e.g. SlowFast Feichtenhofer et al. (2019), X3D Feichtenhofer (2020), TimeSformer Bertasius

et al. (2021) and XCLIP Ni et al. (2022) achieve good performance w/o appearance-free stream

i.e., optical flow in action recognition, they are less helpful in VideoQA compared to the early

methods w/ appearance-free stream. This demonstrates that towards longer-horizon temporality

understanding, a stream free of appearance is necessary. Detailed comparison will be discussed in

Sec. 2.4.5.3.

2.3.3 Action-centric Contrastive Learning (AcCL)

As aforementioned, question-answer contains much information including characters, objects,

and locations. Actions, the important reasoning objective in videos, may only occupy a small

portion of QA text and be neglected in the cross-modal alignment. Since VideoQA takes the

alignment of global video features and a full QA sequence features as the optimization objective,

the precise motion information obtained from Sec. 2.3.2 may not be well exploited. A VideoQA

model, capable of answering temporal questions, should make good use of motion.

To this end, we propose a novel training scheme that conducts contrastive learning for visual-

language matching before finetuning VideoQA objective. Different from conventional VL con-

trastive learning, the contrastive learning in our method is action-centric. It encourages the video

representation to be aligned with the representation of action phrase that is parsed from the ques-

tion. That is to say, other information such as entity, location, objects are not present in the text for

matching. For example, in the question “what happens to the train after moving for a while?”, the

action phrase to be aligned with the whole video clip is “moving for a while”. Under this matching

objective, the video representation has to focus on precise motions, leading to a deep understanding

17

of temporality. In specific, we propose a contrastive loss L 𝑝𝑡 to update the encoders F𝑣, F𝑞:

(cid:18)

∑︁

log

L 𝑝𝑡 =

𝑖

exp (𝑠𝑐)
exp (𝑠𝑐) + (cid:205)𝑐′∈N𝑖

exp (𝑠′
𝑐)

(cid:19)

,

(2.5)

where N𝑖 denotes the negative pool of action phrase for the 𝑖-th sample, i.e., , action phrases from

the questions that are unpaired to the video h. 𝑠𝑐 = F𝑣 (h)F𝑞 (𝑐)⊤ is the similarity between the

action phrase 𝑐 and video h of the 𝑖-th sample. It encourages the video representation closer to

its paired action phrase 𝑐 and far away from the unpaired 𝑐′ that are randomly sampled into the

mini-batch. Thus, by contrastive to many other action phrases 𝑐′ ∈ N𝑖 in the dataset, the motion in

vision and the textual action are better mined and aligned. The motion-plentiful features and model

provide a good starting point for VideoQA finetuning.

Many VideoQA task benefits from contrastive learning based video language pretraining Lei

et al. (2021); Zellers et al. (2021) from large-scale video-language data Bain et al. (2021), which

is also reflected in the SoTA model in our task Xiao et al. (2022b). However, our AcCL is just

conducted on our task datasets themselves, without resorting to any of the external training data,

and has already been more effective than VGT Xiao et al. (2022b) with external data pretraining,

while taking much less training resources.

2.3.4 Temporal Sensitivity-aware Confusion Loss

At the end of Sec. 2.3.2, we mention that although we have an appearance-free stream to extract

precise motions, the appearance stream is indispensable. Unfortunately, the appearance stream,

even fused with an appearance-free stream, provides the possibility to model action biased towards

scene/object context Choi et al. (2019). To mitigate this issue, we propose to prevent the model

from answering a question if the corresponding video is randomly ordered in the temporal domain.

Our motivation is that the temporality reasoning needs the model to infer the inter-action relations

across temporal, such as “stop (action 1) after moving for a while (action 2)”. Thus, if we randomly

shuffle the video, the “after” relation no longer exists. A reliable network should be unable to answer

the “stop” to the question like “What is the train doing after moving for a while?”.

Motivated by this, we design a confusion loss that takes as input the shuffled video ˜h and

18

question-answer [𝑞; 𝑎]:

L (𝑛)

𝑐 𝑓 ( ˆp, ˆp) = −

|A|
∑︁

ˆ𝑝 (𝑎) log ˆ𝑝 (𝑎),

ˆ𝑝 (𝑎) =

𝑎=1
exp ( ˆ𝑠(𝑎))
𝑘=1 exp ( ˆ𝑠(𝑘))

(cid:205)|A|

,

(2.6)

where ˆ𝑠(𝑎) = ˜f𝑣f⊤
𝑞 (denote the ˆ𝑠(𝑎) ∈ [ ˆ𝑠(1), . . . , ˆ𝑠(|A|)]⊤) is the inner-product similarity score for
the 𝑎-th answer features f𝑞 = F𝑞 ( [𝑞; 𝑎]) w.r.t. its shuffled video feature vector ˜f𝑣 = F𝑣 ( ˜h). The

confusion loss is applied on the shuffled video and encourages the largest entropy for all of answers,

so that the scene context invariant to temporal order change will be ignored in action relation

modeling.

Many questions in the VideoQAs, e.g., “Where is the video taken?” just rely on descriptive

content and can be answered even with the shuffled videos. Thus, the confusion loss only applies

to temporal-sensitive questions, e.g., the “after” question: ’what does A do after raising her hand?’

The temporal-sensitive questions contain specific English syntax, e.g. “before”, “after”, “when”.

We filter out the temporal-insensitive questions based on the existence of the syntaxes. The overall

optimization objective is as follows.

min E𝑞 (𝑛) ∼𝑄 𝜏

(cid:104)

L (𝑛)

𝑐𝑒 (y, p) − L (𝑛)

𝑐 𝑓 ( ˆp, ˆp)

(cid:105)

,

(2.7)

where 𝑄𝜏 denotes the set of questions that are temporally sensitive. L (𝑛)

𝑐𝑒 is the cross entropy loss to

metric if the probability over the candidates answers is p = [ 𝑝 (1), ..., 𝑝 (|A|)] follows ground-truth
answer 𝑦. L (𝑛)

𝑐𝑒 is applied to all of the samples including the temporal-insensitive one, which is to

optimize:

min E𝑞 (𝑛) ∼𝑄\𝜏

(cid:104)

L (𝑛)

𝑐𝑒 (y, p)

(cid:105)

(2.8)

where 𝑄\𝜏 denotes the set of remaining temporally insensitive samples. The two loss are used for

fine-tune the VideoQA after AcCL (see Sec. 2.3.3).

19

Methods

EVQA Antol et al. (2015)
STVQA Jang et al. (2017)
CoMem Gao et al. (2018)
HCRN* Le et al. (2020)
HME Fan et al. (2019)
HGA Jiang and Han (2020)
HQGA Xiao et al. (2022a)
P3D-G Cherian et al. (2022)
IGV Li et al. (2022c)
EIGV Li et al. (2022b)
ATPBuch et al. (2022)
VGT Xiao et al. (2022b)
VGT* Xiao et al. (2022b)
Ours

NExT-QA Val

NExT-QA Test

Acc@C
42.46
44.76
45.22
45.91
46.18
46.26
48.48
51.33
-
-
53.1
52.28
53.43
56.04

Acc@T
46.34
49.26
49.07
49.26
48.20
50.74
51.24
52.30
-
-
50.2
55.09
56.39
58.44

Acc@D
45.82
55.86
55.34
53.67
58.30
59.33
61.65
62.58
-
-
66.8
64.09
69.50
65.38

Acc@All
44.24
47.94
48.04
48.20
48.72
49.74
51.42
53.40
-
-
54.30
55.02
56.89
58.27

Acc@C
43.27
45.51
45.85
47.07
46.76
48.13
49.04
-
48.56
-
-
51.62
52.78
55.31

Acc@T
46.93
47.57
50.02
49.27
48.89
49.08
52.28
-
51.67
-
-
51.94
54.54
55.55

Acc@D
45.62
54.59
54.38
54.02
57.37
57.79
59.43
-
59.64
-
-
63.65
67.26
65.34

Acc@All
44.92
47.64
48.54
48.89
49.16
50.01
51.75
-
51.34
53.70
-
53.68
55.70
57.03

Table 2.1 Results of multi-choice QA on validation set and test set of NextQA dataset. The
best results are bolded. Note that the greyed out VGT* uses 0.18 million videos from webvid
dataset Bain et al. (2021) as pretraining, while the remaining include ATM do not pretrain on the
external large-scale data. All of numbers for existing work are recorded from their papers. “-”
indicates the missing results. 𝐴𝑐𝑐𝐶, 𝐴𝑐𝑐𝑇 , 𝐴𝑐𝑐𝐷 denote the accuracy for causality, temporality
and descriptive questions.

2.4 Experiments

2.4.1 Datasets

NExT-QA Xiao et al. (2021) consists of 47.7K questions with answers in the form of multiple

choices, which are annotated from 5.4K videos. It pinpoints the causal and temporal reasoning

over the object interaction. Next-QA focuses on question answering with visual evidence. Thus,

in addition to temporal reasoning questions, the causal questions e.g., “How”, ”Why”, require

the corresponding answers visible in the video and also assess the multi-frame temporality event

understanding.

TGIF-QA Jang et al. (2017) contains 134.7K questions about repeated actions, state transitions

and a certain frame, which is annotated from 91.8K GIFs. MSRVTT-QA Xu et al. (2017)

challenges a holistic visual recognition or description, which includes 10K annotated videos and

244K open-ended question-answer pairs.

2.4.2

Implementation Details

Appearance Features Following Xiao et al. (2022b,a), we decode the video into frames and

sparsely sample 16 clips where each clip is in the length of 4 frames. To make a fair comparison with

20

state-of-the-art VGT Xiao et al. (2022b), we also the RoI aligned features as the object appearance

features 𝑓𝑜 ∈ R16∗2048, which is pretrained by Anderson et al. (2018). Frame features 𝑓𝑖 ∈ R16∗2048

are extracted by ResNet-50 He et al. (2016) pretrained on ImageNet.

Motion Features We use denseflow Wang et al. (2020) to extract the optical flow maps using

videos’ original FPS. Then, we use mmaction2 Contributors (2020)-based ResNet from TSN Wang

et al. (2016) pre-trained on Kinetics-400 Carreira and Zisserman (2017) to extract optical flow

features for the three datasets. To temporally align with the object and frame features, we uniformly

distributed the flow maps into 𝐾 = 16 clips per video. We uniformly sample 5 frames as per each

clip and obtain a 2048-d feature vector for a clip. Thus, motion features 𝑓𝑚 for a video are R16∗2048.

Action-centric Contrastive Learning We parse the action phrases from questions using SpaCy

parser Honnibal and Montani (2017). We use Adam optimizer Kingma and Ba (2015) with cosine

annealing learning schedule of PyTorch initialized at 1𝑒 − 5 on NVIDIA RTX A6000 at the

maximum epoch of 10 among all of the datasets. Each batch contains 64 aligned video-action pairs

and forms 64 pairs in total in the contrastive learning.

Fine-tuning We finetune the VideoQA using Adam optimizer Kingma and Ba (2015), batch

size of 64, cosine annealing learning schedule of PyTorch initialized at 1𝑒 − 5 on NVIDIA RTX

A6000. The maximum epochs are set as 15 on NextQA, 30 on MSRVTT-QA and 50 on TGIFs.

2.4.3 Comparison with State-of-the-Art

Table 2.1 compares our method with existing state-of-the-art (SoTA) VideoQA methods on the

widely used Next-QA dataset that feature the temporality reasoning. To ensure a fair comparison,

ATM follows SoTA VGT Xiao et al. (2022b) and uses the exact same appearance feature extraction

and applies DGT Xiao et al. (2022b) to model the object features. From the table, we can observe

that ATM outperforms all existing methods without external data pretraining, by at least 3.85%

and 3.35% on val. and test splits respectively. The outperformance is across causal, temporal,

and descriptive splits of the Next-QA dataset, which demonstrate that ATM is effective in various

question types that span from short segment to full video, from causal to temporal, and from single

to repeated action execution.

21

Models

LGCN Huang et al. (2020)
HGA Jiang and Han (2020)
HCRN Le et al. (2020)
B2A Park et al. (2021)
HOSTR Dang et al. (2021)
HAIR Liu et al. (2021)
MASN Seo et al. (2021)
PGAT Peng et al. (2021)
MHN Peng et al. (2022)
ClipBERT* Lei et al. (2021)
SiaSRea* Yu et al. (2021)

TGIF-QA
Action Transition Action† Transition†
74.3
75.4
75.0
75.9
75.0
77.8
84.4
80.6
83.5
82.8
79.7
MERLOT* Zellers et al. (2021) 94.0
95.0
96.0

-
-
63.9
-
-
-
-
65.9
-
-
-
-
70.5
71.0

-
-
55.7
-
-
-
-
58.7
-
-
-
-
59.9
65.7

81.1
81.0
81.4
82.6
83.0
82.3
87.4
85.7
90.8
87.8
85.3
96.2
97.6
97.3

VGT Xiao et al. (2022b)
Ours Zellers et al. (2021)

MSRVTT-QA

-
35.5
35.6
36.9
35.9
36.9
35.2
38.1
38.6
37.4
41.6
43.1
39.7
40.3

Table 2.2 Results on TGIF-QA and MSVTT-QA. † denotes TGIF-QA-R Peng et al. (2021) whose
multiple choices for repeated action and state transition are more challenging. * denotes the models
pretrained with large-scale external data.

Moreover, ATM which comes without external large-scale pre-training, even surpasses the

existing method that used large-scale pretraining on more than 0.18 million videos Bain et al.

(2021), by a clear margin of 1.38% and 1.33% on validation and test splits respectively. This

demonstrates that ATM comprises of appearance-free motion features Sec 2.3.2, action-centric

contrastive learning 2.3.3 and temporal sensitive-aware confusion objective 2.3.4, which holistically

models action temporality, is more effective than the global video-text matching while uses less

training computation resources.

In ATP Buch et al. (2022), the temporal modeling is performed on frames that are representative

for single events and are encoded with CLIP model Radford et al. (2021). Our method also exceeds

ATP Buch et al. (2022) by a large margin of 3.97%. This shows in temporality-heavy tasks, precise

and faithful motion modeling is more effective than selecting the informative single frame for an

event. This validates that ATM to precisely model and reason about motion, sets the new SoTA on

Next-QA Xiao et al. (2021) benchmark.

We further compare ATM with SoTA on TGIF-QA in Table 2.2. Following the protocol, we use

the same appearance features extracted by VGT Xiao et al. (2022b) and extract the motion stream

features. We observe that ATM set new SoTA for repeated actions, and transition in TGIF-QA,

22

which shows ATM as a whole is also effective in the repeated action and object transition scenarios.

For MSRVTT-QA in Table 2.2, our performance (free-of pertaining) is better than pretraining-

free SoTA VGT but is inferior to the large-scale pre-trained methods MERLOT Zellers et al. (2021)

and SiaSRea* Yu et al. (2021). This is because pre-training help model the descriptive content,

while our work focuses on action temporality modeling.

2.4.4 True Temporality Metric

ATP Buch et al. (2022) evaluated the upper bound performance of a single-frame model on a

video dataset and pointed out that even though NextQA dataset focuses on temporality reasoning,

the dataset still contains static appearance bias. A small portion of questions can be correctly

answered exclusively from a single frame without temporal information. To this end, we propose

to measure the temporality faithfulness of VideoQA methods, i.e., revealing if a VideoQA method

learns true temporality to answering questions, instead of learning the correlation between the

static appearance and the answer. In specific, the proposed true temporality metric measures the

difference of QA accuracy between given the full video and given the middle frame respectively,

as 𝛿.

Table 2.3 shows that ATM better learns the true temporality compared to SoTA VGT, w/ w/o

pretraining, on both Next-QA and TGif-QA. We observe that the external large-scale data for

pretraining VGT guides the model to leverage more static information in temporality reasoning

(only +0.84% on Next-QA test) since the pre-training helps more on the descriptive content that

is static. Each of our component i.e., AcCL, TSC, and appearance-free motion stream, helps

to learn the true temporality. TSC mitigates the static bias by preventing answering temporality

question if the temporal relations are destroyed. AcCL encourages learning motion representation

agonistic to the entity or other appearance information. Appearance-free motion streams extract

motion-plentiful representations that are necessary to understand the true temporality.

2.4.5 Ablation Studies

In addition to the study of each component, we conduct further ablation studies on NextQA Xiao

et al. (2021) dataset.

23

Next-QA (%)

TGIF-QA (%)

val-Acc val-𝛿 test-Acc test-𝛿 act-Acc act-𝛿 trans-Acc trans-𝛿
58.27 +5.51 57.03 +5.13
Ours
+1.3
+0.7
56.87 +2.71 55.02 +2.30
w/o AcCL
+0.2
w/o TSC
57.99 +2.98 56.24 +3.25
+0.8
w/o motion stream 56.57 +3.02 55.78 +2.80
+0.3
VGT w/o pretrained 55.02 +2.91 53.68 +2.15
-
56.89 +1.02 55.70 +0.84
VGT w/ pretrained

96.0 +1.2
93.5 +0.5
95.6 +0.8
95.2 +0.9
95.0 +0.6

97.3
97.1
96.9
96.7
97.6
-

-

-

Table 2.3 True temporality evaluation: Study of model components and comparison with SoTA.

2.4.5.1

Impact of Action-centric Contrastive Learning

We conduct an experiment where we test different variants of the text in Action-centric Con-

trastive Learning (AcCL). Table 2.4-a summarizes the results of the ablations. AcCL aims at

learning action features by aligning the video with the action phrase from the question. The

variants replace the action phrase by (1) the correct answer text w.r.t. to video-question, denoted

as “Answer”, (2) the concatenation of the entire question and the correct answer text, denoted as

“Question+Answer”, (3) the entire question text, denoted as “Question”, (4) the verb in question.

Variants
Action phrase (ours)
Answer
Question+Answer
Question
Verb in Question
w/o AcCL

val (%)
58.27
55.92
56.83
57.51
57.07
56.87

test (%)
57.03
53.60
55.16
56.38
56.57
55.02

Variants
TS-aware (Ours)
TS-unaware
w/o TSC

(a)

(b)

test (%)
57.03
56.89
56.24

test-𝛿 (%)
+5.13
+3.67
+3.25

val (%)
Variants
TSN (ours)
58.27
I3D
57.71
57.01
3D ResNext101
56.97
SlowFast
56.27
X3D
56.99
Timesformer
56.08
XCLIP
57.35
I3D-RGB only
TSN-RGB only
56.94
w/o motion stream 56.57

test (%)
57.03
56.40
55.30
55.83
55.78
56.00
55.90
55.63
55.42
55.78

(c)

Table 2.4 Ablation study comparing different variants of (a) AcCL (b) TSC and (c) motion repre-
sentations.

Table 2.4-a shows that our implementation of AcCL outperforms all of the other variants. We

observe that the “Question” variant performs 0.65% worse than our “action in question" on test split

since the full question text contains entity, scene, and other appearance information in addition to

the action phrase. Contrasting with full questions will distract the representation from the motion

information to the dominant and easily learned appearance features, which is less effective than

action-centric version. Using “Answer”, “Question+Answer” also performs worse than ours. This

24

demonstrates that the action phrases in questions are the information that the randomly initialized

model parameters easily overlook but are important for temporality. Using “verb from question”

is also less effective, as the action cannot be described by a single word in many cases, e.g. verb

“get” is not informative enough for the action “get up”.

2.4.5.2

Impact of TSC Loss

We compare our Temporal Sensitivity-aware Confusion loss (TSC) in Table 2.4-b, with variants

(1) removing the TSC and only training with cross-entropy, as “w/o TSC”. (2) applying the confusion

loss to all samples regardless of time-sensitivity, as “TS-unaware”. Our method is slightly better

than these two variants in VideoQA accuracy and much higher on the proposed true temporality

reasoning metric. This validates that alleviating static bias by TSC helps a faithful temporal

reasoning model, which in turn improves the event temporality understanding.

2.4.5.3

Impact of Appearance-free stream

Table 2.4-c shows the ablations on motion features 𝑓𝑚 and analyzes the effectiveness of in-

corporating an appearance-free stream. In the table, TSN and I3D extract motion features with

an appearance-free stream i.e., flow maps, while the remaining extract motions only from the

appearance-included input i.e., RGB. These RGB-only methods SlowFast Feichtenhofer et al.

(2019), X3D Feichtenhofer (2020), TimeSformer Bertasius et al. (2021) and XCLIP Ni et al.

(2022) show superb performance on action recognition, as shown in Fig. 2.3-b. But they fall behind

of the methods with the optical flow on motion feature extraction for VideoQA, though TSN and

I3D are relatively early work without fancy network structures. RGB frames may only be enough

for characterizing limited sets of atomic actions that are dominant for action recognition, it is less

effective in modeling events with long horizon temporality. 3D ResNext101 Hara et al. (2017) has

been used for motion feature extraction in existing VideoQA Le et al. (2020); Xiao et al. (2022a),

but it is also RGB-only and 1.73% worse than TSN where flow is used.

2.4.6 Qualitative Analysis

In Fig. 2.4, we qualitatively assess the effect of the ATM by visualizing the results of represen-

tative samples in val. split. We can observe that the AcCL scheme helps to learn the discriminative

25

Figure 2.4 Visualization. The ground-truth are marked in green. We display the results of ATM,
ATM w/o AcCL (as “Ours w/o AcCL”), ATM w/o motion stream (as “Ours w/o flow”) and the
existing SoTA method VGT. The samples span across causality (1) and temporality (2, 3) reasoning.

representations for actions e.g., “turn” in (1), while the variant w/o AcCL may learn the superficial

correlations between appearance e.g., “his fingers” and the answers. Moreover, the appearance-free

stream also helps in extracting precise and useful motions. Since the scene and actor do not change

in (3), the optical flow stream is informative for recognizing the “drag towards” action.

2.5 Summary

In this paper, we propose a novel framework to solve the VideoQA. Our method addresses

the importance of temporality reasoning. To this end, we realize that it is worth revisiting optical

flow, as flow may become less considered in atomic action recognition but is still effective in

long-horizon temporality. Then, we propose an action-centric contrastive learning that makes both

video and text representations informative for action. Then, we fine-tune the VideoQA via a novel

temporal sensitivity-aware confusion loss to mitigate the potential static bias. Our ATM method is

demonstrated to be superior to all existing VideoQA methods on multiple benchmarks and shows

a faithful temporality reasoning via a new metric.

26

Q: how did the person show the sides of the phone? A: a0. turn the phone.  a1. flip side to side.  a2. mike stand.  a3. by the driver.  a4. using his fingers. VGTa0Ours (w/o pt)a4Ours (w/o mot)a0Oursa0Q: what did the boy do before he raised his hand to take the camera? A: a0. brush his pants. a1. turn the vacuum cleaner. a2. look away. a3. laughed.a4. talk to boy.Ours (w/o pt)a2VGTa3Ours (w/o mot)a3Oursa1Q: what does the lady do after touching the first bell on table b? A: a0. walk off.  a1. raise her hand.  a2. move to her left. a3. drag it towards her. a4. clap hands.Ours (w/o pt)a3VGTa1Ours (w/o mot)a1Oursa3BIBLIOGRAPHY

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018). Bottom-
up and top-down attention for image captioning and visual question answering. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077–
6086.

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. (2015).

Vqa: Visual question answering. In CVPR, pages 2425–2433.

Bain, M., Nagrani, A., Varol, G., and Zisserman, A. (2021). Frozen in time: A joint video and image
encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), pages 1728–1738.

Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-time attention all you need for video

understanding?

Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., and Niebles, J. C. (2022). Revisiting
the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 2917–2927.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? A new model and the

Kinetics dataset. In CVPR.

Cherian, A., Hori, C., Marks, T. K., and Le Roux, J. (2022). (2.5+ 1) d spatio-temporal scene graphs
for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 36, pages 444–453.

Choi, J., Gao, C., Messou, J. C., and Huang, J.-B. (2019). Why can’t i dance in the mall? learning

to mitigate scene bias in action recognition. 32.

Contributors, M. (2020). Openmmlab’s next generation video understanding toolbox and bench-

mark. https://github.com/open-mmlab/mmaction2.

Dang, L. H., Le, T. M., Le, V., and Tran, T. (2021). Hierarchical object-oriented spatio-temporal

reasoning for video question answering.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirec-

tional transformers for language understanding.

Ding, G. and Yao, A. (2021). Temporal action segmentation with high-level complex activity

labels. arXiv preprint arXiv:2108.06706.

Duan, H., Zhao, Y., Chen, K., Lin, D., and Dai, B. (2022). Revisiting skeleton-based action
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

recognition.

27

Recognition, pages 2969–2978.

Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., and Huang, H. (2019). Heterogeneous
memory enhanced multimodal attention model for video question answering. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007.

Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In Proceed-
ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019). Slowfast networks for video recognition.

In ICCV.

Gao, J., Ge, R., Chen, K., and Nevatia, R. (2018). Motion-appearance co-memory networks for

video question answering. In CVPR, pages 6576–6585.

Hara, K., Kataoka, H., and Satoh, Y. (2017). Can spatiotemporal 3d cnns retrace the history of 2d

cnns and imagenet? arXiv:1711.09577.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 770–778.

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom

embeddings, convolutional neural networks and incremental parsing. To appear.

Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., and Gan, C. (2020). Location-aware graph

convolutional networks for video question answering. In AAAI, pages 11021–11028.

Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. (2017). Tgif-qa: Toward spatio-temporal reasoning

in visual question answering. In CVPR.

Jiang, P. and Han, Y. (2020). Reasoning with heterogeneous graph alignment for video question

answering. In AAAI Conference on Artificial Intelligence (AAAI).

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vĳayanarasimhan, S., Viola, F., Green,
T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950.

Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Le, T. M., Le, V., Venkatesh, S., and Tran, T. (2020). Hierarchical conditional relation networks
for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 9972–9981.

Lei, J., Berg, T. L., and Bansal, M. (2022). Revealing single frame bias for video-and-language

28

learning. arXiv preprint arXiv:2206.03428.

Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., and Liu, J. (2021). Less is more:
Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 7331–7341.

Lei, J., Yu, L., Bansal, M., and Berg, T. L. (2018). Tvqa: Localized, compositional video question

answering.

Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2019). Tvqa+: Spatio-temporal grounding for video

question answering.

Li, J., Niu, L., and Zhang, L. (2022a). From representation to reasoning: Towards both evidence
and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 21273–21282.

Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., and Liu, J. (2020). Hero: Hierarchical encoder for
video+ language omni-representation pre-training. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pages 2046–2065.

Li, Y., Li, Y., and Vasconcelos, N. (2018). Resound: Towards action recognition without repre-
sentation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pages
513–528.

Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., and Luo, J. (2016). Tgif: A new

dataset and benchmark on animated gif description. In CVPR, pages 4641–4650.

Li, Y., Wang, X., Xiao, J., and Chua, T.-S. (2022b). Equivariant and invariant grounding for video
question answering. In Proceedings of the 30th ACM International Conference on Multimedia,
pages 4714–4722.

Li, Y., Wang, X., Xiao, J., Ji, W., and Chua, T.-S. (2022c). Invariant grounding for video question
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

answering.
Recognition (CVPR), pages 2928–2937.

Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift module for efficient video understanding.

In ICCV, pages 7083–7093.

Liu, F., Liu, J., Wang, W., and Lu, H. (2021). Hair: Hierarchical visual-semantic relational reasoning
for video question answering. In Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), pages 1698–1707.

Liu, Y., Albanie, S., Nagrani, A., and Zisserman, A. (2019). Use what you have: Video retrieval

using representations from collaborative experts. arXiv preprint arXiv:1907.13487.

29

Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., and Li, T. (2021). Clip4clip: An empirical

study of clip for end to end video clip retrieval. In arXiv preprint arXiv:2104.08860.

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., and Ling, H. (2022). Expanding
language-image pretrained models for general video recognition. In Computer Vision–ECCV
2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV,
pages 1–18. Springer.

Park, J., Lee, J., and Sohn, K. (2021). Bridge to answer: Structure-aware graph interaction network
for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 15526–15535.

Peng, L., Yang, S., Bin, Y., and Wang, G. (2021). Progressive graph attention network for video

question answering. In ACM MM, pages 2871–2879.

Peng, M., Wang, C., Gao, Y., Shi, Y., and Zhou, X.-D. (2022). Multilevel hierarchical network

with multiscale sampling for video question answering. ĲCAI.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language
supervision. In ICML, pages 8748–8763. PMLR.

Seo, A., Kang, G.-C., Park, J., and Zhang, B.-T. (2021). Attend what you need: Motion-appearance

synergistic networks for video question answering. In ACL, pages 6167–6177.

Shao, D., Zhao, Y., Dai, B., and Lin, D. (2020). Finegym: A hierarchical video dataset for fine-
grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 2616–2625.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes

from videos in the wild. arXiv preprint arXiv:1212.0402.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016). Temporal

segment networks: Towards good practices for deep action recognition. In ECCV.

Wang, S., Li, Z., Zhao, Y., Xiong, Y., Wang, L., and Lin, D. (2020).

denseflow.

https://github.com/open-mmlab/denseflow.

Xiao, J., Shang, X., Yao, A., and Chua, T.-S. (2021). Next-qa: Next phase of question-answering to
explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 9777–9786.

Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., and Chua, T.-S. (2022a). Video as conditional graph
hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on
Artificial Intelligence (AAAI), pages 2804–2812.

30

Xiao, J., Zhou, P., Chua, T.-S., and Yan, S. (2022b). Video graph transformer for video question

answering. In European Conference on Computer Vision, pages 39–58. Springer.

Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., and Zhuang, Y. (2017). Video question
In ACM MM, pages

answering via gradually refined attention over appearance and motion.
1645–1653.

Xu, J., Mei, T., Yao, T., and Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging

video and language. In CVPR.

Yang, A., Miech, A., Sivic, J., Laptev, I., and Schmid, C. (2021a). Just ask: Learning to answer
In Proceedings of the IEEE/CVF International

questions from millions of narrated videos.
Conference on Computer Vision (ICCV), pages 1686–1697.

Yang, A., Miech, A., Sivic, J., Laptev, I., and Schmid, C. (2021b). Just ask: Learning to answer
In Proceedings of the IEEE/CVF International

questions from millions of narrated videos.
Conference on Computer Vision (ICCV), pages 1686–1697.

Yu, W., Zheng, H., Li, M., Ji, L., Wu, L., Xiao, N., and Duan, N. (2021). Learning from inside:

Self-driven siamese sampling and reasoning for video question answering. 34.

Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., and Tao, D. (2019). Activitynet-qa: A dataset

for understanding complex web videos via question answering.

Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J. S., Cao, J., Farhadi, A., and Choi, Y. (2021). Merlot:
In Advances in neural information processing

Multimodal neural script knowledge models.
systems (NeurIPS), volume 34.

Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., and Graf,
H. P. (2021). Composer: Compositional reasoning of group activity in videos with keypoint-only
modality. arXiv preprint arXiv:2112.05892.

Zhu, L. and Yang, Y. (2020). Actbert: Learning global-local video-text representations.

In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 8746–8755.

31

CHAPTER 3

GROUP ACTIVITY PREDICTION WITH SEQUENTIALLY RELATIONAL
ANTICIPATION MODEL

3.1 Background and Motivation

Group activity prediction is to forecast an activity performed by a group of people before the

activity ends. Different from group activity recognition, it only has access to the beginning frames

of a video containing incomplete activity execution. It is useful in scenarios where the intelligent

systems have to make prompt decisions, such as surveillance and traffic accident avoidance where

multiple people are present. Unfortunately, existing action prediction methods Yao et al. (2018);

Yan et al. (2017); Kong et al. (2017, 2014, 2018) are limited to actions performed by an individual.

Even though some methods Kong et al. (2017, 2020); Wang et al. (2019) attempt to predict

actions performed by multiple people in standard databases such as UCF101 Soomro et al. (2012),

they simply model the multiple people as a single entity and ignore their relations. This would

undoubtedly result in a low prediction performance.

As shown in Kong et al. (2017), one of the major challenges in activity prediction is how to

enhance the discriminative power of the features extracted from the partial observations. However,

this is even more challenging to do so in group activity prediction as multiple people are present

in the scene. Each person’s individual action may vary and people’s interactions frequently appear

and change in a group activity. To this end, it is important to model the relations of multiple people

in the observed frames and predict their future group representations. In addition, if only limited

beginning frames are observed, it would be extremely difficult to directly anticipate the features

of full observations at once. A temporally progressive anticipation model is desired for modeling

activity evolution.

To address these challenges, we propose a novel sequential relational anticipation model

(SRAM) for group activity prediction by anticipating group activities and positions in the fu-

ture (see Fig. 3.1). SRAM is developed as an encoder-decoder framework, in which an observation

encoder summarizes the relational dynamics in these beginning observed frames and a sequential

32

decoder further anticipates the representations for group activities and positions occurring in the

future. Specifically, the observation encoder naturally models the relational dynamics of people

and complex interactions between people in the observed frames. To predict group activity, we

propose a sequential decoder to anticipate the structured group representation in the future using

several unrolling stages. Two graph auto-encoders are used in the sequential decoder to explicitly

anticipate the activity and the position relations of people in the unobserved frames. We propose

to make a sequential prediction that progressively anticipates the future group representation by

performing multiple unrolling stages guided by three novel loss functions. This allows us to better

capture complex group activity evolution.

To our best knowledge, we are the first to investigate the challenging problem of group activity

prediction. The benefit of our method is twofold. Firstly, it not only predicts people’s group

activities but also predicts individuals’ positions in the future. Our experimental results show that

predicting people’s future positions significantly helps predict their group activities. Secondly, the

proposed method progressively anticipates structured group representations, which has shown to

be very powerful in prediction especially when limited frames are observed. This idea could be

generalized to other prediction tasks, e.g. human motion prediction Martinez et al. (2017) and

video prediction Wichers et al. (2018).

Our contributions can be summarized as follows:

• We propose a novel sequential decoder to anticipate the representations for multiple people’s

future positions and activity, aiming to learn a discriminative structural representation for

group activity prediction.

• We progressively anticipate the structured group representations at several unrolling stages

guided by novel loss functions. This improves the performance when only few frames are

observed.

• Extensive experiments demonstrate that our method outperforms the existing state-of-the-arts

by a large margin.

33

Figure 3.1 Given the beginning frames, our method models the relational dynamics of a group,
and predicts a group activity by anticipating the group activity representation and their positions
occurring in the future unobserved frames.

3.2 Related Work

Action Prediction aims to recognize the label of an action before the action is fully executed.

Existing work Lan et al. (2014); Cai et al. (2019); Kong et al. (2018); Hu et al. (2018); Sadegh Ali-

akbarian et al. (2017); Ma et al. (2016); Zhao and Wildes (2019); Shi et al. (2018); Vondrick et al.

(2016) focuses on predicting actions performed by an individual. Ryoo (2011) used integral and

dynamic bag-of-words to represent features variations over time. DeepSCN Kong et al. (2017) and

AAPNet Kong et al. (2020) make use of sequential context information by transferring knowledge

in full videos to partial observations. Wang et al. (2019) developed a teacher-student learning

framework to distill knowledge from the action recognition task, in order to enhance action predic-

tion. Gammulle et al. (2019) presented a jointly learnt task for both action prediction and future

motion representation inference.

Prediction on interactions between two people was studied in Yan et al. (2017); Yao et al. (2018).

Yan et al. (2017) developed a tri-coupled recurrent structure and an attention mechanism to address

action prediction for two individuals’ interactions. Yao et al. (2018) predicted the motion of the

interactions between two people, but did not predict their interaction labels. Different from them,

we focus on the prediction of group activities involving multiple people. Our method elegantly

captures complex relational dynamics between people for learning discriminative information.

Group Activity Recognition has been extensively studied in previous work Amer et al. (2014);

Shu et al. (2017); Ibrahim et al. (2016); Yan et al. (2018a). Early work applies graphical models on

34

observedfuturefuture group representationsobserved group representationsactivity and position anticipationEncoderDecoderSRAMthe extracted hand-craft features Amer et al. (2014); Lan et al. (2012); Wang et al. (2013) as group

representations. Deep learning methods for multi-people activity recognition have shown excellent

performance Shu et al. (2017); Bagautdinov et al. (2017); Wang et al. (2017); Ibrahim and Mori

(2018); Deng et al. (2016); Tang et al. (2018); Gavrilyuk et al. (2020). HDTM Ibrahim et al. (2016)

develops a two-stage LSTM model to firstly extract features of temporal individual motions and

then aggregate neighborhood information. SSU Bagautdinov et al. (2017) achieves the individual

detection and group activity recognition in a unified framework. Recent work suggests that only

part of people’s motions contribute to the entire group activity Yan et al. (2018a); Gammulle

et al. (2018); Ramanathan et al. (2016); Azar et al. (2019); Hu et al. (2020), via suppressing

the irrelevant actions. Previous work also shows that interactions between people are important in

understanding group activity. For example, HRNIbrahim and Mori (2018) introduces a hierarchical

spatial relational layer to learn the relational representations between two players. Other methods,

including Stagnet Qi et al. (2018), S-RNN Biswas and Gall (2018), SBGAR Li and Choo Chuah

(2017) apply structural-RNN to obtain spatiotemporal features. ARG Wu et al. (2019) explicitly

models the interactions by employing graph convolution on a learnable graph.

The main difference between our work and group activity recognition methods is that we aim

at predicting the group activity label given incomplete activity execution, while these methods are

given complete activity executions. This prompts us to develop novel model architecture and loss

functions in this work.

3.3 Proposed Method – Sequential Relational Anticipation Model

Problem formulation. Our goal is to predict the activity label 𝑦 of a group of people given a

partial observation of a video containing incomplete activity execution. We define the observation

ratio as the number of observed frames 𝑡0 in a streaming video divided by the total number of frames

𝑇 in the corresponding full video following Kong et al. (2017, 2020), i.e., 𝑡0/𝑇. For instance, if

a partial video contains 30 frames and the corresponding full video contains 100 frames, then the

observation ratio of this activity is 30%.

During training, we have access to all full training videos containing complete group activity

35

Figure 3.2 Overall architecture. Our framework SRAM takes the beginning 𝑡0 observed frames as
input and predicts the group activity label. An observation encoder first summarizes the relational
dynamics in partial observation as a latent variable 𝑍0. Then, a sequential decoder takes over 𝑍0
and progressively anticipates the group representation through 𝐾 unrolling stages. The output of
the last unrolling stage is expected to contain rich discriminative information for group activity
prediction. Details can be seen in Fig. 3.3.

executions. These full videos are supposed to contain all the discriminative information for

classification. During test, given a partial observation of a group activity execution, we encourage

our model to anticipate the group representations that contain similar amount of discriminative

information as the corresponding full observation. Thus, its prediction power can be enhanced.

Overall architecture. The overall architecture is shown in Fig. 3.2. We formulate our group

activity prediction model as an encoder-decoder framework that contains an observation encoder

and a sequential decoder. Given a partial observation containing 𝑡0 frames, the observation encoder

summarizes the relational dynamics of the group from the partial observation and then the sequential

decoder anticipates the group representation for activities and positions in the future unobserved

frames.

Due to the large motion variations between a partial and a full observation, a novel sequential

decoder is proposed in this work to progressively anticipate the structured group representation

for the future unobserved frames by several unrolling stages. This is useful for enhancing the

discriminative power of the anticipated representation, especially if limited frames are observed.

Moreover, for group activity, relations between multiple people are discriminative information and

they vary as time. To predict group activity, our sequential decoder uses two graph auto-encoders to

concurrently perform relational anticipation on both people’s activity features and their positions.

36

The image part with The image part with partial observed video (1:!")avg pooling#"Sequential DecoderObservationEncoder⋯⋯$=1⋯$=2$=-unrolling stagesActivity AEPosition AE-left spiking-left pass⋯#"Sequential DecoderObservationEncoder$=1⋯$=2$=-Unrolling stagesActivity AEPosition AEPartial observation (1:!")PositionpredictionGroupActivityPrediction3.3.1 Relation Modeling for Group Activity

Given 𝑡0 observed frames, we first extract features of all the observed 𝑡0 frames, and then apply

ROIAlign He et al. (2017) to extract the feature vectors of multiple people based on their positions

(cid:8)𝐵1, 𝐵2, · · · , 𝐵𝑡0

(cid:9)(𝑡 ∈ {1, · · · , 𝑡0}). Action features and positions of the 𝑖-th individual on the 𝑡-th

frame are represented as x𝑡 (𝑖) and b𝑡 (𝑖) respectively. Afterwards, upon the individual dynamics,

we follow Wu et al. (2019) to explicitly model the pair-wise position relations and action relations

of multiple people in the observed frames as two relation graphs 𝐺a

𝑡 ∈ R𝑁×𝑁 and 𝐺p

𝑡 ∈ R𝑁×𝑁 ,

respectively. Both of the two graphs have 𝑁 nodes representing 𝑁 people in the 𝑡-th frame. Given

the 𝑖-th and 𝑗-th individuals, the edge of the action similarity graph 𝐺a

𝑡 (𝑖, 𝑗) is computed by the

cosine similarity and normalized by Softmax function. The edge on the position relation graph

𝐺p

𝑡 (𝑖, 𝑗) is computed by the normalized Euclidean distance (denoted by 𝑑 (·, ·)):

𝐺a

𝑡 (𝑖, 𝑗) =

exp (cid:0)x𝑡 (𝑖)T · x𝑡 ( 𝑗)(cid:1)
𝑗=1 exp (cid:0)x𝑡 (𝑖)T · x𝑡 ( 𝑗)(cid:1)

(cid:205)𝑁

, 𝐺p

𝑡 (𝑖, 𝑗) =

1/𝑑 (b𝑡 (𝑖), b𝑡 ( 𝑗))
𝑗=1 1/𝑑 (b𝑡 (𝑖), b𝑡 ( 𝑗))

(cid:205)𝑁

.

(3.1)

Once the graphs are built, we obtain the structured representations for the group activity in

the observed frames. We will also perform anticipation on the two graphs representing the group

activity in the unobserved frames, which will be discussed below.

3.3.2 Observation Encoder E

The observation encoder E is proposed to summarize spatiotemporal information of the complex

relational dynamics of multiple people in partial observations containing 𝑡0 frames. E learns to

map 𝐺a

1:𝑡0

, 𝐺p

1:𝑡0

, and 𝑋1:𝑡0 to a latent variable 𝑍0, by the spatio-temporal graph convolution network

ST-GCN Yan et al. (2018b). Specifically, it first performs spatial graph convolution Kipf and

Welling (2017) on the two graphs 𝐺p

𝑡 and 𝐺a

𝑡 for the 𝑡-th frame

𝜎(𝐺p

𝑡 𝑋𝑡𝑊p) + 𝜎(𝐺a

𝑡 𝑋𝑡𝑊a),

(3.2)

and then performing temporal convolution Lea et al. (2017) on every three consecutive frames to

learn the latent variable 𝑍0. Here, 𝜎 is ReLU activation, 𝑊p and 𝑊a are learnable weights, and 𝑋𝑡

is the action features of 𝑁 people. Latent variable 𝑍0 will be integrated in the sequential decoder,

and guides its unrolling stages.

37

Figure 3.3 Sequential decoder D is formulated as two auto-encoders Ea-Da and Ep-Dp that pro-
gressively anticipate the group activity representation for future unobserved frames using multiple
unrolling stages. At the 𝑘-th stage, D is fed with the summary of the partial observation encoded
in the latent variable 𝑍0 as well as the action features ˆ𝑋𝑘 and the position features ˆ𝐵𝑘 from the
previous stage. Then D anticipates the action features ˆ𝑋𝑘+1 and positions ˆ𝐵𝑘+1.

Different from ARG Wu et al. (2019), our model captures the temporal patterns of people’s

relations, which is useful for group activity prediction.

3.3.3 Sequential Decoder D

The performance of state-of-the-art action prediction methods Kong et al. (2020, 2017) is still

limited especially when few beginning frames are given. This is mainly because they use direct

mapping from partial observation to the corresponding full observation in one pass, which is not

powerful enough to deal with large visual variations between partial and full observations.

In

this paper, we propose a sequential decoder that progressively anticipates the group representation

that is expected to contain rich discriminative information as the fully observed activity using 𝐾

unrolling stages (see Fig. 3.3). This allows us to create a more powerful model for group activity

prediction.

Besides, different from individual action prediction methods Kong et al. (2020); Wang et al.

(2019), people’s relations formulated as graphs using Eq. (3.1), are discriminative information for

group activity. Moreover, the group activity varies overtime.

It is necessary to predict group

representations by anticipating relations in the unobserved stage. As described in Chapter 3.3.1,

people’s relations can be inferred from their action similarity and relative positions. For example,

a partial observation of a volleyball activity is given, which contains run-up of ace spikers and

38

⋯⋯⋯stageswaiting gestures of their opponents. Our model is supposed to predict it as “spiking” by the cue that

the players are moving towards net with their actions. Therefore, we develop a sequential decoder

as a mixture of two graph auto-encoders: an activity auto-encoder Ea-Da for predicting activity

representations and a position auto-encoder Ep-Dp for predicting positions of multiple people. The

two auto-encoders are coupled by the shared latent variables 𝑍0 learned from partial observations.

Activity auto-encoder Ea-Da. Using 𝐾 activity auto-encoders, the proposed sequential de-

coder progressively anticipates the activity representation by 𝐾 unrolling stages. Each activity

auto-encoder at the 𝑘-th stage (𝑘 ∈ {1, 2, · · · , 𝐾 }) is fed with the output ˆ𝑋𝑘 of the activity auto-

encoder at the previous (𝑘 − 1)-th stage. We use the spatiotemporal action features at the last
observed frame 𝑡0 as the input of the activity auto-encoder at stage 𝑘 = 1. We encode the input ˆ𝑋𝑘

of current unrolling stage to a latent variable 𝑍 a
𝑘

by

𝑘 = 𝜎(𝐺p
𝑍 𝑎

𝑘

ˆ𝑋𝑘𝑈ep) + 𝜎(𝐺a
𝑘

ˆ𝑋𝑘𝑈ea),

and then decodes the activity representation ˆ𝑋𝑘+1:

ˆ𝑋𝑘+1 = 𝜎(𝐺a

𝑘 (𝑍0 + 𝑍 a

𝑘 ))𝑈da) + 𝜎(𝐺p

𝑘 (𝑍0 + 𝑍 a

𝑘 )𝑈dp),

(3.3)

(3.4)

where 𝑈ep, 𝑈ea, 𝑈dp, 𝑈da are learnable parameters.

ˆ𝑋𝑘+1 is the anticipated group features at the

𝑘-th stage, and is served as the input for the activity auto-encoder at the (𝑘 + 1)-th stage. The
anticipation of ˆ𝑋𝑘+1 is conditioned on latent variables 𝑍 𝑎
𝑘

and 𝑍0, in order to both keep track of

the short-term information of the previous unrolling stage and use the long-term spatiotemporal

information in the partial observations. 𝐺a
𝑘

and 𝐺p
𝑘

are computed by the generated activity features

and positions at the 𝑘-th stage using similar functions as Eq. (3.1), respectively (replacing time step

𝑡 by the stage 𝑘).

The benefits of the progressive anticipation using 𝐾 unrolling stages lie in two aspects. First,

the temporal dependency of activity evolution is naturally built between successive stages. This

allows us to naturally anticipate structured group activity representations for prediction purpose.

Second, the prediction granularity can be controlled with the number of unrolling stages 𝐾. The

case when 𝐾 = 1 is equivalent to the existing one-pass solution used in Kong et al. (2020, 2017).

39

Position auto-encoder Ep-Dp. As described in Chapter 3.3.1, the interactions between two

people also depend on their relative positions. Thus, it is necessary to explicitly anticipate the

positions of these people in group activity prediction.

Similar to activity auto-encoder, the proposed sequential decoder also performs 𝐾 unrolling

stages for position prediction for a group of people using 𝐾 position auto-encoders. Each position

auto-encoder at stage 𝑘 is fed with the output of its previous auto-encoder at stage 𝑘 − 1, and outputs

the positions of people. Experimental results in Chapter 3.4.4 show that the anticipated future

positions of people help improve performance of group activity prediction.

The position auto-encoder first encodes the positions ˆ𝐵𝑘 of multiple people to a latent variable

𝑍 p
𝑘

at stage 𝑘 through graph convolution Kipf and Welling (2017):

𝑘 = 𝜎(𝐺p
𝑍 p

𝑘

ˆ𝐵𝑘𝑉ep) + 𝜎(𝐺a
𝑘

ˆ𝐵𝑘𝑉ea),

and then decodes the positions ˆ𝐵𝑘+1 for the next stage by

ˆ𝐵𝑘+1 = 𝜎(𝐺a

𝑘 (𝑍0 + 𝑍 p

𝑘 ))𝑉da) + 𝜎(𝐺p

𝑘 (𝑍0 + 𝑍 p

𝑘 )𝑉dp),

(3.5)

(3.6)

where 𝑉ep, 𝑉ea, 𝑉dp, 𝑉da are learnable parameters. 𝐺p
𝑘
activity auto-encoder. The anticipation of ˆ𝐵𝑘+1 is conditioned on latent variables 𝑍 p
order to both keep track of the short-term position information of the previous unrolling stage and

are the same graphs used in the

and 𝑍0, in

and 𝐺a
𝑘

𝑘

use the long-term spatiotemporal information in the partial observations.

Position prediction is also benefited by sequential prediction via several unrolling stages, since

the prediction granularity can be controlled. Similar to the activity auto-encoder, the position

auto-encoder at stage 𝑘 = 1 also takes the positions 𝐵𝑡0 of people on the last observed frame as
input. The activity auto-encoder and the position auto-encoder share the same graphs 𝐺p
𝑘

and 𝐺a
𝑘

and are both conditioned on the latent variable 𝑍0 (see Fig. 3.3).

3.3.4 Feature Aggregation for Prediction

SRAM returns both group activity and position representations at each of 𝐾 unrolling stages.

The 𝐾-th stage corresponds to the full observation status, which contains the most discriminative

information of an activity. We disregard all the outputs given by the activity autoencoders from

40

the 1-st to (𝐾 − 1)-th stages, and perform max-pooling on the output ˆ𝑋𝐾+1 given by the activity

autoencoder at the 𝐾-th stage as the group activity representations. The resulting feature vector is

used for group activity prediction. Similarly, we directly use the output ˆ𝐵𝐾+1 given by the 𝐾-th

position autoencoder to perform position prediction.

3.3.5 Loss Functions and Model Learning

Adversarial loss.

Inspired by Goodfellow et al. (2014), we encourage SRAM to generate

representations corresponding to ground-truth full observations. We use two discriminators for the

features generated by the sequential decoder. Discriminator 𝐷1 is an activity classifier implemented

by a softmax layer. Lcls is computed on the output of 𝐷1. Discriminator 𝐷2 is an adversarial
regularizer and tells the difference between the generated group features ˆ𝑋1:𝐾 and group features

of full videos 𝐹1:𝐾 (𝑋). Using the adversarial loss, SRAM is encouraged to generate features that

are indistinguishable from the group features of the corresponding full videos:

LGAN =E𝑋(1:𝑇 ) ∼𝑝data (𝑋(1:𝑇 ) ) log 𝐷2 (𝐹1:𝐾 (𝑋))

(3.7)

+E𝑋(1:𝑡0 ) ∼𝑝data (𝑋(1:𝑡0 ) ) log (cid:0)1 − 𝐷2( ˆ𝑋1:𝐾)).

Note that the generated group representation ˆ𝑋1:𝐾 is computed by SRAM 𝑆 from the partial

observation 𝑋1:𝑡0.

Sequential reconstruction loss is proposed to align the predicted activity representations to

become close to the ground-truth activity representations at each unrolling stage. Since our method

has 𝐾-stage sequential prediction, it is necessary to encourage the predicted group representations

ˆ𝑋1:𝐾 on each of the 𝐾 unrolling stages to become close to the ground-truth features at that timestamp.

This is different from adversarial loss that only align the generated features to be close to ground-

truth at full observation stage.

We train a separate ST-GCN 𝐹 (·) as a recognition model to obtain the group activity repre-

sentations of full videos 𝑋 for training. The resulting frame-wise group representations 𝐹 (𝑋) are

used to encourage the activity features of the 𝑖-th person generated at the 𝑘-th unrolling stage to be

similar to the ground-truth features using

41

Lrec =

1
𝐾 × 𝑁

𝐾
∑︁

𝑁
∑︁

∥ ˆx𝑘 (𝑖) − 𝐹𝑘 (𝑋, 𝑖) ∥2 .

(3.8)

𝑖=1
Here, 𝐹𝑘 (𝑋, 𝑖) is the features of the 𝑖-th person of the full video at the 𝑘-th stage. This loss function
sequentially computes the difference between the predicted features ˆx𝑘 (𝑖) (the 𝑖-th row on ˆ𝑋1:𝐾) for

𝑘=1

the 𝑖-th person at unrolling stage 𝑘 and the ground-truth features 𝐹𝑘 (𝑋, 𝑖), mimicking how a partial

observation is progressively approaching its corresponding full observation.

Position regression loss. We use the tracklets of individuals provided by Ibrahim and Mori

(2018) as the ground-truth of individuals positions. During training, we use the mean square

error between the predicted positions and ground-truth positions at the 𝐾 unrolling stages as loss

function:

Lreg =

1
𝐾 × 𝑁

𝐾
∑︁

𝑁
∑︁

|| ˆb𝑘 (𝑖) − b𝑘 (𝑖)||2,

(3.9)

𝑖=1
where the predicted position ˆb𝑘 (𝑖) is the 𝑖-th row of ˆ𝐵𝑘 , i.e., , the 𝑖-th person’s position predicted

𝑘=1

by the sequential decoder at the 𝑘-th stage.

Model learning. During training, the overall objective function is written as a sum of sequential

reconstruction loss Lrec, adversarial loss LGAN, classification loss Lcls implemented by softmax

loss, and position regression loss Lreg:

min
E,D

max
𝐷1,𝐷2

Lrec + LGAN + Lcls + Lreg.

(3.10)

Sequential relational anticipation model (E, D) and two discriminators (𝐷1, 𝐷2) are alternatively

trained until convergence.

3.3.6 Discussion

Group activity modeling and anticipation. Our SRAM captures the interactions of multiple

people in the observation encoder, and anticipates their future relations by a sequential decoder.

This is different from existing action prediction methods Kong et al. (2020); Qi et al. (2018) that

can only predict the action of an individual. We believe such a novel method will pave the way for

future research in other structured visual prediction.

42

Structured sequential prediction. Compared with group activity recognition methods Wu

et al. (2019); Qi et al. (2018); Ibrahim and Mori (2018), our method performs sequential prediction

of group activity, in form of future positions and activity representations. Our activity prediction

is also facilitated by explicitly predicting people’s future positions.

Activity evolution over time. Our sequential decoder progressively predicts future repre-

sentations through several unrolling stages, which boosts performance when only few frames are

observed. It is guided by a sequential reconstruction loss, mimicking how a partial observation

is sequentially approaching its full observation and an adversarial loss to make the generated full

observation features to become indistinguishable from the real full observation features.

3.4 Group Activity Prediction Evaluation

3.4.1 Datasets

Volleyball Dataset Ibrahim et al. (2016) consists of 4830 video clips distributed in 8 group

activities, such as left spiking and right setting. Each clip has 41 frames. Ibrahim et al. (2016)

provides the players’ tracklets and splits the dataset into training, validation and testing sets.

Existing group activity recognition methods Wu et al. (2019); Ibrahim et al. (2016); Ibrahim and

Mori (2018); Qi et al. (2018) use the middle 10 frames of each video. To generalize it to prediction

task, we extend it to use the middle 20 frames as full observations, in order to model sequential

dynamics. Note that the middle 20 frames contain complete group activity executions, because

athletes generally move quickly to complete a group activity, such as direct spiking in a volleyball

game.

Collective Activity Dataset (CAD) Choi et al. (2009) contains 44 videos with 5 group activities,

including crossing, queueing, walking, talking and waiting. The group activities in CAD are

labeled as the majority of people’s individual actions. We use the existing tracklet information and

training/testing splits following Wu et al. (2019). The number of the frames in videos ranges from

100 to 2000. Following Qi et al. (2018); Wu et al. (2019); Bagautdinov et al. (2017), we divide

each video into 10-frame video clips. This expands training and testing data to 1746 and 765 clips,

respectively. CAD mainly contains periodic activities such as walking, in which significant changes

43

can be seen in 10 frames.

3.4.2

Implementation Details

Following Wu et al. (2019), we extract a 1024-dimensional feature vector for each individual

with tracklets provided by Ibrahim et al. (2016), using Inception-v3 Szegedy et al. (2016) as

backbone and ROIAlign He et al. (2017). We use three steps for training: First, Inception-v3

pretrained on ImageNet is fine-tuned on single frames by jointly predicting individual actions and

group activities. Then, we freeze the backbone and finetune the recognition model 𝐹 (·) given full

videos in the training set. The recognition model contain two ST-GCN layers Yan et al. (2018b),

both with 256-d hidden units. After that, we train the proposed model. The observation encoder

has two layers ST-GCN with both 256-d hidden units. The activity auto-encoder’s encoder has

one graph convolution layer that encodes the input into 256-d latent feature space. The position

auto-encoder has two-layer graph convolution by encoding the 2-d positions into 64-d space and

then 256-d latent space. During training, SRAM plus classifier 𝐷1 and discriminator 𝐷2 are

alternatively updated.

The experiments are conducted with 10 different observation ratios ranging from 10% to 100%

of full videos length. The number of unrolling stages 𝐾 is set to 5. We use stochastic gradient

descent for optimization. For Volleyball dataset, the three steps are trained for 30 epochs, 10 epochs

and 20 epochs with learning rate 0.001, 0.001, 0.0001 respectively. For Collective Activity Dataset,

the three steps are trained for 20 epochs, 50 epochs and 10 epochs with learning rate 0.0001, 0.0001,

0.0005, respectively.

3.4.3 Comparison with State-of-the-art

We compare our method with the state-of-the-art prediction methods LRCN Tran et al. (2015),

DeepSCN Kong et al. (2017), IBoW and DBoW Ryoo (2011), KD Wang et al. (2019), AAPNet Kong

et al. (2020) and state-of-the-art group activity recognition methods, including HRN Ibrahim and

Mori (2018), HDTM Ibrahim et al. (2016) ARG Wu et al. (2019), SSU Bagautdinov et al. (2017).

Following these methods’ original setting, LRCN and HDTM adopt the AlexNet Krizhevsky et al.

(2012) as the backbone. HRN, IBow, DBow, DeepSCN and original AAPNet use VGG-19. Our

44

Tracklet Backbones

Models
LRCN
HDTM
IBoW
DBoW
DeepSCN
HRN
KD
AAPNet
e-AAPNet
SSU
ARG
Ours

No
Yes
No
No
No
Yes
No
No
Yes
Yes
Yes
Yes

AlexNet
AlexNet
VGG
VGG
VGG
VGG
VGG
VGG

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
48.17 51.61 54.67 57.44 59.76 61.23 63.75 64.32 64.77 65.37
52.43 59.09 66.04 76.37 80.48 81.82 84.07 84.47 84.60 84.06
58.03 60.72 64.84 65.26 67.51 70.80 73.45 74.24 74.29 75.63
58.03 55.56 56.16 58.93 59.90 61.97 63.79 63.06 63.88 64.78
59.46 62.23 65.52 70.38 72.55 77.37 79.75 80.35 80.31 80.78
52.58 56.99 64.32 74.49 76.96 80.36 83.72 84.74 84.08 85.30
65.67 67.68 70.00 70.83 71.96 72.10 73.22 73.30 73.30 73.90
59.53 65.37 68.29 72.25 75.24 77.79 79.91 80.25 80.18 80.78
InceptionV3 62.98 70.31 77.64 83.55 84.91 85.86 87.54 87.23 87.92 89.01
InceptionV3 63.20 70.65 79.66 84.07 87.13 87.65 88.30 88.18 88.41 89.01
InceptionV3 64.82 69.41 76.07 79.43 82.70 83.99 85.04 85.19 85.86 85.94
InceptionV3 77.86 82.57 84.97 87.06 88.63 88.93 89.08 88.93 88.48 91.97

Table 3.1 Group activity prediction accuracy (%) on Volleyball dataset with observation ratios
ranging from 10% to 100%. Group activity recognition results can be seen from the last column,
in which 100% frames are observed.

method follows ARG and SSU to use Inception-V3 method. HRN, HDTM, ARG, SSU and our

method adopt the tracklets of players provided by Ibrahim et al. (2016). To make a fair comparison,

we extend state-of-the-art action prediction method AAPNet (“e-AAPNet” for simplification) to

make use of tracking information and use Inception-V3 as backbone. We train all of the comparison

methods using the parameters described in their original papers.

3.4.3.1 Results on Volleyball dataset.

Table 3.1 summarizes the prediction performance of the proposed method, existing action

prediction methods and group activity recognition methods. Results demonstrate that our model

outperforms the comparison methods. Existing action prediction methods, e.g., LRCN, IBoW,

DBoW, DeepSCN, AAPNet, KD propose to improve the prediction performance by information

transfer. However, they regard multiple people as a single entity and do not consider the interactions

between multiple people. Thus, the extracted features do not contain informative cues of the

interactions of people, resulting in a low prediction performance. The proposed method uses

tracklets Ibrahim et al. (2016), while the existing predictors for individuals e.g. LRCN, IBoW,

DBoW, DeepSCN, AAPNet, KD do not. To make a fair comparison, we extend AAPNet to use

tracklet information. Experimental results show that our method can predict the dynamics of

interactions and better enrich partial observations.

45

Models
ARG Wu et al. (2019)
DeepSCN Kong et al. (2017)
AAPNet Kong et al. (2020)
e-AAPNet Kong et al. (2020)
Ours

Tracklet
Yes
No
No
Yes
Yes

50%
88.10
81.31
81.57
86.01
92.55

100%
88.37
82.22
82.75
86.67
92.81

Table 3.2 Prediction accuracy (%) on Collective Activity Dataset with observation ratios 50% and
100%.

Group activity recognition methods such as HDTM, SSU, HRN, ARG do not have capability

of gaining extra information from full activity executions. Thus, when the observation ratio is very

low (10% or 20% observations), their performance is much lower than our method. Note that ARG

applies random sampling strategy by sampling three frames from an entire video as input. In the

comparison experiment, this strategy is applied in each of the partial observations as input. The

proposed method consistently outperforms ARG, as our method captures the temporal dynamics

of multiple people in the group, and sequentially generates features close to the corresponding full

observations. It improves the representation power of the partial observations, and facilitates group

activity prediction.

3.4.3.2 Results on Collective Activity Dataset.

Comparison results are listed in Table 3.2. Our method outperforms existing methods ARG,

DeepSCN, and AAPNet by a large margin. Given tracklets as input, our method is 6.54% higher

than e-AAPNet at 50% observation ratio since the people’s actions and relations are predicted in

our model. Group activities such as group walking are cyclic, and thus the prediction performance

of our method at 50% observation ratio is close to the one at 100% observation ratio.

3.4.4 Ablation Study

We perform detailed ablation studies on the Volleyball dataset to evaluate the contributions of

the sequential prediction strategy, as well as the loss functions.

3.4.4.1 How much does the prediction loss help?

The impacts of loss functions are analyzed on Volleyball dataset in detail. The evaluation results

can also validate the contributions of the proposed sequential prediction strategy. We compare the

following the variants, including: (1) without the position regression loss Lreg defined in Eq. (3.9).

46

In this variant, the position auto-encoder for predicting future positions is not used. During

sequential prediction, we replace the individuals’ positions in the future frames by the ones given

by the last observed frame’s. The positions are used for computing 𝐺p for each unrolling stage.

(2) without adversarial loss LGAN. (3) without sequential reconstruction loss Lrec for generated

features of unrolling stages. (4) The proposed full network.

Compared with variant (1), the significant performance gains with all different observation

ratios show that the prediction of people’s positions is of high importance for group activity

prediction. Compared with variants (2) and (3), it shows that the adversarial loss LGAN and the

reconstruction loss Lrec in our method improve the performance by 0.55% and 0.63% on average,

respectively. Therefore, the proposed sequential decoder guided by LGAN and Lrec can generate

more discriminative activity representations at each stage.

3.4.4.2 How much does the sequential prediction help?

Our sequential decoder predicts group activity representations at 𝐾 unrolling stages. In this ex-

periment, we evaluate the effect of the number of unrolling stages 𝐾 on the prediction performance.

We set 𝐾 to 1, 2, 5, and 10, and compare the prediction performance. Table 3.3 indicates that the

best overall prediction performance is achieved when 𝐾 = 5. The prediction performance is slightly

affected when 𝐾 = 10, but the computational complexity of the prediction model is increased due

to the extra unrolling stages. The average prediction performance drops to 81.47% if 𝐾 = 1. The

variant with 𝐾 = 1 is the one that directly maps partial observation in one unrolling stage, similar

to what Kong et al. (2020, 2017) do. The result demonstrates the superiority of our progressive

prediction in anticipating discriminative group representations given partial observations. If more

stages are allowed (𝐾 = 5 or 𝐾 = 10), the sequential decoder in our model can progressively

generate discriminative features for group activity prediction even though it is given very limited

frames. Therefore, its prediction performance is improved.

47

Loss
(1)LGAN+Lrec+Lcls
(2) Lrec+Lreg+Lcls
(3) Lreg+LGAN+Lcls
(4) Ours

10%
75.09
76.14
77.61
77.86
(a)

40%
85.59
85.79
83.22
87.67

70%
87.06
88.18
85.64
89.08

Average
84.85
86.30
86.22
86.85

𝐾
1
2
5
10

10%
70.38
72.36
77.86
77.93

40%
80.02
86.59
87.06
86.69

(b)

70%
86.14
89.07
89.08
89.23

Average
81.47
85.22
86.85
86.79

Table 3.3 Ablation studies on volleyball dataset. We show the accuracy(%) given videos of
observation ratio at 10%, 40%, 70%.

Figure 3.4 Visualization of position predictions. The blue and the yellow lines denote the prediction
positions and ground-truth positions Ibrahim et al. (2016), respectively. “×” indicates the starting
point of the movement. Best viewed in color.

3.4.5 Position Prediction Evaluation

3.4.5.1 Visualization of predicted positions

As shown in Fig. 3.4, we visualize the movement of individuals learned by the position auto-

encoder in SRAM. The position auto-encoder progressively predicts the positions of individuals

at the unrolling stages. The visualization result shows our position auto-encoder can successfully

predict the directions and step sizes of individuals in the future based on partial observations.

Although Fig. 3.4 (bottom-right) shows the direction of the predicted movement is mostly accurate,

the future position of a person is not accurate if the person moves very fast.

3.4.5.2 Quantitative evaluation

We quantitatively evaluate our position prediction results compared to two popular trajectory

prediction methods SocialGAN Gupta et al. (2018) and SocialLSTM Alahi et al. (2016). Following

SocialGAN, Final Displacement Error (FDE) is used to compute the Euclidean distance between

the predicted positions and ground-truth positions at the final timestamp and Average Displacement

Error (ADE) is used to compute that at each unrolling stage.

48

left-spikingleft-winpointright-passingright-spikingMethod
SocialGAN Gupta et al. (2018)
SocialLSTM Alahi et al. (2016)
Ours

FDE
5.32
6.44
3.62

ADE
3.05
4.44
2.44

Table 3.4 Final Displacement Error (FDE) and Average Displacement Error (ADE) for position
prediction.

As shown in Tab. 3.4, the results demonstrate that our method can accurately predict the

future positions for a group of people, and our method outperforms the two trajectory prediction

methods. This is mainly because we capture the relational action dynamics of multiple people

while SocialGAN and SocialLSTM do not.

3.5 Summary

We have proposed a novel sequential relational anticipation model (SRAM) to predict group

activity given only the beginning frames of an activity execution. Our model captures complex

relational dynamics of multiple people in the observed frames.

It then anticipates the group

representations including group activity features and position features. A novel sequential decoder

is proposed to progressively anticipate the group representations through several unrolling stages.

Extensive results on two datasets demonstrate that our method significantly outperforms the state-

of-the-art methods. Results also validate that the progressive anticipation using multiple unrolling

stages facilitates group activity prediction. Further experimental results show that the modeling

and prediction of people’s positions improve our performance on group activity prediction.

49

BIBLIOGRAPHY

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S. (2016). Social

lstm: Human trajectory prediction in crowded spaces. In CVPR, pages 961–971.

Amer, M. R., Lei, P., and Todorovic, S. (2014). Hirf: Hierarchical random field for collective

activity recognition in videos. In ECCV, pages 572–585.

Azar, S. M., Atigh, M. G., Nickabadi, A., and Alahi, A. (2019). Convolutional relational machine

for group activity recognition. In CVPR, pages 7892–7901.

Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., and Savarese, S. (2017). Social scene understanding:
End-to-end multi-person action localization and collective activity recognition. In CVPR, pages
4315–4324.

Biswas, S. and Gall, J. (2018). Structural recurrent neural network (srnn) for group activity analysis.

In WACV, pages 1625–1632.

Cai, Y., Li, H., Hu, J.-F., and Zheng, W.-S. (2019). Action knowledge transfer for action prediction

with partial videos. pages 8118–8125.

Choi, W., Shahid, K., and Savarese, S. (2009). What are they doing?: Collective activity classifica-
tion using spatio-temporal relationship among people. In ICCV Workshops, pages 1282–1289.

Deng, Z., Vahdat, A., Hu, H., and Mori, G. (2016). Structure inference machines: Recurrent neural
networks for analyzing relations in group activity recognition. In CVPR, pages 4772–4781.

Gammulle, H., Denman, S., Sridharan, S., and Fookes, C. (2018). Multi-level sequence gan for

group activity recognition. In Asian Conference on Computer Vision, pages 331–346.

Gammulle, H., Denman, S., Sridharan, S., and Fookes, C. (2019). Predicting the future: A jointly

learnt model for action anticipation. In ICCV, pages 5562–5571.

Gavrilyuk, K., Sanford, R., Javan, M., and Snoek, C. G. M. (2020). Actor-transformers for group

activity recognition. In CVPR.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,

and Bengio, Y. (2014). Generative adversarial nets. In NIPS, pages 2672–2680.

Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., and Alahi, A. (2018). Social gan: Socially

acceptable trajectories with generative adversarial networks. In CVPR, pages 2255–2264.

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask R-CNN. In ICCV.

Hu, G., Cui, B., He, Y., and Yu, S. (2020). Progressive relation learning for group activity

50

recognition.

Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.-H., and Zhang, J. (2018). Early action prediction

by soft regression. TPAMI.

Ibrahim, M. S. and Mori, G. (2018). Hierarchical relational networks for group activity recognition

and retrieval. In ECCV, pages 721–736.

Ibrahim, M. S., Muralidharan, S., Deng, Z., Vahdat, A., and Mori, G. (2016). A hierarchical deep

temporal model for group activity recognition. In CVPR, pages 1971–1980.

Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional

networks. In International Conference on Representation Learning (ICLR).

Kong, Y., Gao, S., Sun, B., and Fu, Y. (2018). Action prediction from videos via memorizing

hard-to-predict samples. In AAAI.

Kong, Y., Kit, D., and Fu, Y. (2014). A discriminative model with multiple temporal scales for

action prediction. In ECCV, pages 596–611.

Kong, Y., Tao, Z., and Fu, Y. (2017). Deep sequential context networks for action prediction. In

CVPR, pages 1473–1481.

Kong, Y., Tao, Z., and Fu, Y. (2020). Adversarial action prediction networks. TPAMI.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convo-

lutional neural networks. In NIPS, pages 1097–1105.

Lan, T., Chen, T.-C., and Savarese, S. (2014). A hierarchical representation for future action

prediction. In ECCV, pages 689–704.

Lan, T., Sigal, L., and Mori, G. (2012). Social roles in hierarchical models for human activity

recognition. In CVPR, pages 1354–1361.

Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager, G. D. (2017). Temporal convolutional

networks for action segmentation and detection. In CVPR, pages 156–165.

Li, X. and Choo Chuah, M. (2017). Sbgar: Semantics based group activity recognition. In ICCV,

pages 2876–2885.

Ma, S., Sigal, L., and Sclaroff, S. (2016). Learning activity progression in lstms for activity

detection and early detection. In CVPR, pages 1942–1950.

Martinez, J., Black, M. J., and Romero, J. (2017). On human motion prediction using recurrent

neural networks. In CVPR, pages 2891–2900.

51

Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., and Van Gool, L. (2018). stagnet: An attentive semantic

rnn for group activity recognition. In ECCV, pages 101–117.

Ramanathan, V., Huang, J., Abu-El-Haĳa, S., Gorban, A., Murphy, K., and Fei-Fei, L. (2016).

Detecting events and key actors in multi-person videos. In CVPR, pages 3043–3053.

Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from

streaming videos. In ICCV, pages 1036–1043.

Sadegh Aliakbarian, M., Sadat Saleh, F., Salzmann, M., Fernando, B., Petersson, L., and Andersson,

L. (2017). Encouraging lstms to anticipate actions very early. In ICCV, pages 280–289.

Shi, Y., Fernando, B., and Hartley, R. (2018). Action anticipation with rbf kernelized feature

mapping rnn. In ECCV, pages 301–317.

Shu, T., Todorovic, S., and Zhu, S.-C. (2017). Cern: confidence-energy recurrent network for

group activity recognition. In CVPR, pages 5523–5531.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes

from videos in the wild. arXiv preprint arXiv:1212.0402.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception

architecture for computer vision. In CVPR, pages 2818–2826.

Tang, Y., Wang, Z., Li, P., Lu, J., Yang, M., and Zhou, J. (2018). Mining semantics-preserving

attention for group activity recognition. In Multimedia, pages 1283–1291.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal

features with 3d convolutional networks. In ICCV.

Vondrick, C., Pirsiavash, H., and Torralba, A. (2016). Anticipating visual representations from

unlabeled video. In CVPR, pages 98–106.

Wang, M., Ni, B., and Yang, X. (2017). Recurrent modeling of interaction context for collective

activity recognition. In CVPR, pages 3048–3056.

Wang, X., Hu, J.-F., Lai, J.-H., Zhang, J., and Zheng, W.-S. (2019). Progressive teacher-student

learning for early action prediction. In CVPR, pages 3556–3565.

Wang, Z., Shi, Q., Shen, C., and Van Den Hengel, A. (2013). Bilinear programming for human

activity recognition with unknown mrf graphs. In CVPR, pages 1690–1697.

Wichers, N., Villegas, R., Erhan, D., and Lee, H. (2018). Hierarchical long-term video prediction

without supervision. pages 6038–6046.

52

Wu, J., Wang, L., Wang, L., Guo, J., and Wu, G. (2019). Learning actor relation graphs for group

activity recognition. In CVPR, pages 9964–9974.

Yan, R., Tang, J., Shu, X., Li, Z., and Tian, Q. (2018a). Participation-contributed temporal dynamic

model for group activity recognition. In Multimedia, pages 1292–1300.

Yan, S., Xiong, Y., and Lin, D. (2018b). Spatial temporal graph convolutional networks for

skeleton-based action recognition. In AAAI.

Yan, Y., Ni, B., and Yang, X. (2017). Predicting human interaction via relative attention model.

ĲCAI, pages 3245–3251.

Yao, T., Wang, M., Ni, B., Wei, H., and Yang, X. (2018). Multiple granularity group interaction

prediction. In CVPR, pages 2246–2254.

Zhao, H. and Wildes, R. P. (2019). Spatiotemporal feature residual propagation for action prediction.

In ICCV, pages 7003–7012.

53

CHAPTER 4

GATED HISTORY UNIT WITH BACKGROUND SUPPRESSION FOR ONLINE
ACTION DETECTION

4.1

Introduction

Online action detection is the task to predict actions in a streaming video as they unfold De Geest

et al. (2016). It is critical to applications including autonomous driving, public safety, virtual and

augmented reality. Unlike action detection in the offline setting, where the entire untrimmed

video is observable at any given moment, a major challenge for online action detection is that the

predictions are solely based on observations of history without access to video frames in the future.

The model needs to build a causal reasoning of the present in correlation to what happened hitherto,

and as efficiently as possible for the online setting.

Prior work for online action detection Xu et al. (2019); Eun et al. (2020, 2021); Gao et al. (2021);

Qu et al. (2020); Zhao et al. (2020) includes recurrent-based LSTMs Hochreiter and Schmidhuber

(1997) and GRUs Cho et al. (2014) that are prone to forgetting informative history as sequential

frame processing is ineffective in preserving long-range interactions. Emerging methods Wang

et al. (2021b); Xu et al. (2021a) employ transformers Vaswani et al. (2017) to mitigate this by

encoding sequential frames in parallel via self-attention. Some improve model efficiency by using

cross-attention Xu et al. (2021a); Jaegle et al. (2021a) to compress the video sequence into a

fixed-sized latent encoding for prediction.

Fig. 4.1 shows an example video stream (middle row) where the latest (current) frame contains

Cliff Diving action. It is worth noting that, as commonly observed in video sequences, not every

history frame is informative for current frame prediction (e.g., frames showing people cheering or

camera panning in Fig. 4.1). Existing transformer-based approaches Xu et al. (2021a) use vanilla

cross-attention to learn attention weights for history frames that determine their contribution to

the current frame prediction. Such attention weights do not correlate with how informative each

history frame is to current frame prediction. As shown in Fig. 4.1 (top row), when history frames

are ordered from lower to higher cross-attention weights for vanilla cross-attention, frames that are

54

Figure 4.1 We show an example video stream (middle row) where the current frame (magenta)
contains Cliff Diving action. Weights from vanilla cross-attention (top row) do not correlate with
how informative each history frame is to current frame prediction, leading to incorrect prediction of
Background. Our novel Gated History Unit (GHU) (bottom row) calibrates cross-attention weights
using gating scores to enhance history frames that are informative to current frame prediction
(green) and suppress uninformative ones (red), leading to accurate prediction of Cliff Diving.

informative for current frame prediction may have lower weights while uninformative frames may

have higher weights, leading to incorrect current frame prediction. Another common challenge for

existing methods is false positive prediction for background frames that closely resemble action

frames (e.g., pre-shot routine before golf swing). Existing methods also do not leverage that

although future frames are not available for the current frame prediction, subsequently observed

frames that are future to the history can be leveraged to enhance history encoding, which in return

improves current frame prediction.

To address the above limitations, we propose GateHUB, Gated History Unit with Background

suppression. GateHUB comprises a novel Gated History Unit (GHU), a position-guided gated

cross-attention module that enhances informative history while suppressing uninformative frames

via gated cross-attention (as shown in Fig. 4.1, bottom row). GHU enables GateHUB to encode more

informative history into the latent encoding to better predict for current frame. GHU combines the

benefit of an LSTM-inspired gating mechanism to filter uninformative history with the transformer’s

ability to effectively learn from long sequences.

55

……Vanilla Cross-AttentionCliff DivingBackground✕✓Latent EncodingInformativeInformativeUninformativeUninformativeLower Cross-Attention WeightHigher Cross-Attention Weight…Gated History Unit (GHU) –Gated Cross-AttentionLatent EncodingLower Gated Cross-Attention Weight(More Suppressed)Higher Gated Cross-Attention Weight (More Enhanced)HistoryUninformativeInformativeUninformativeInformativeGateHUB leverages future frames for history by introducing Future-augmented History (FaH).

FaH extracts features for a history frame using its future, i.e., the subsequently observed frames.

This makes a history frame aware of its future and helps it to be more informative for current

frame prediction. To tackle the common false positives in prior art, GateHUB proposes a novel

background suppression objective that has different treatments for low-confident action and back-

ground predictions. These novel approaches enable GateHUB to outperform all existing methods

on common benchmark datasets THUMOS Idrees et al. (2017), TVseries De Geest et al. (2016),

and HDD Ramanishka et al. (2018a). Keeping model efficiency in mind for the online setting,

we also validate that GateHUB is more efficient than the existing best method Xu et al. (2021a)

while being more accurate. Moreover, our proposed optical flow-free variant is 2.8× faster than all

existing methods that require both RGB and optical flow data with higher or close accuracy.

To summarize, our main contributions are:

1. Gated History Unit (GHU), a novel position-guided gated cross-attention that explicitly

enhances or suppresses parts of video history as per how informative they are to predicting

action for the current frame.

2. Future-augmented History (FaH) to extract features for a history frame using its subsequently

observed frames to enhance history encoding.

3. A background suppression objective to mitigate the false positive prediction of background

frames that closely resemble the action frames.

4. GateHUB is more accurate than all existing methods and is also more efficient than the

existing best work. Moreover, our proposed optical flow-free model is 2.8× faster compared

to all existing methods that require both RGB and optical flow information while achieving

higher or close accuracy.

4.2 Related Work

Online Action Detection. Previous methods for online action detection include use 3D Con-

56

vNet De Geest et al. (2016), reinforcement learning Gao et al. (2017a), recurrent networks Xu et al.

(2019); Eun et al. (2020); Qu et al. (2020); Gao et al. (2021); Zhao et al. (2020); Qu et al. (2020)

and more recently, transformers Wang et al. (2021b); Xu et al. (2021a). The primary challenge

in leveraging history is that for long untrimmed videos, its length becomes intractably long over

time. To make it computationally feasible, some Eun et al. (2020); Wang et al. (2021b); Gao et al.

(2021); Qu et al. (2020) make the online prediction conditioned only on the most recent frames

spanning less than a minute. This way the history beyond this duration that might be informative

to current frame predictions is left unused. TRN Xu et al. (2019) mitigates this by the hidden state

in LSTMs Hochreiter and Schmidhuber (1997) to memorize the entire history during inference.

But LSTM limits its ability to model long-range temporal interactions. More recently, Xu et al.

(2021a) proposes to scale transformers to the history spanning longer duration. However, not

every history frame is informative and useful. Xu et al. (2021a) lacks the forgetting mechanism

of LSTM to filter uninformative history which causes it to encode uninformative history into the

encoding leading to incorrect predictions. Our Gated History Unit (GHU) and Future-augmented

History (FaH) combine the benefits of LSTM’s selective encoding and transformer’s long-range

modeling to leverage long-duration history more informatively to outperform all previous methods.

Transformers for Video Understanding. Transformers can achieve superior performance on

video understanding tasks by effectively modeling the spatiotemporal context via attention. Most

of the previous transformer-based methods Bertasius et al. (2021); Arnab et al. (2021); Fan et al.

(2021); Neimark et al. (2021) focus on action recognition in trimmed videos Carreira and Zis-

serman (2017) (videos spanning few seconds) due to the quadratic complexity w.r.t. video length.

Untrimmed videos have a longer duration from a few minutes to hours and contain frames with

irrelevant actions (labeled as background). Temporal action localization (TAL) Shou et al. (2016);

Xu et al. (2017); Gao et al. (2017b); Shou et al. (2017); Zhao et al. (2017); Buch et al. (2017); Liu

et al. (2019); Lin et al. (2019); Zhu et al. (2021); Zhang et al. (2021) and temporal action proposal

generation (TAP) Lin et al. (2018, 2019); Tan et al. (2021) are two fundamental tasks in untrimmed

video understanding. AGTNawhal and Mori (2021) proposes activity graph transformer for TAL

57

based on DETR Carion et al. (2020). TAPGWang et al. (2021a) applies transformer to predict

the activity boundary for TAP. However, unlike TAL and TAP which are both offline tasks having

access to the entire video, online action detection does not have access to the future and requires

causal understanding from history to present. We follow the existing transformer-based streaming

tasksGirdhar and Grauman (2021); Chen et al. (2021); Xu et al. (2021a) and apply a causal mask

to address online action detection.

Long Sequence Modeling. To model long input sequences, recent work Dosovitskiy et al. (2021)

proposes to reduce complexity by factorizing Touvron et al. (2021) or subsampling the inputs Chen

et al. (2020). Another group of work focuses on modifying the internal dense self-attention module

to boost the efficiency Beltagy et al. (2020); Wang et al. (2020). More recently, Perceiver Jaegle

et al. (2021b) and PerceiverIO Jaegle et al. (2021a) propose to cross-attend long-range inputs to

a small fixed-sized latent encoding, adding further flexibility in terms of input and reducing the

computational complexity. However, unlike our GHU, PerceiverIO lacks an explicit mechanism to

enhance/suppress history frames making it sub-optimal for online action detection. Our method

uses LSTM-inspired gating to calibrate cross-attention to enhance/suppress history frames per their

informative-ness while employing transformers to learn from long history sequences effectively.

4.3 Methodology

Given a streaming video sequence h = [ℎ𝑡]0

𝑡=−𝑇+1, our task is to identify if and what action
𝑦0 ∈ {0, 1, ..., 𝐶} occurs at the current frame ℎ0. We have a total of 𝐶 action classes and label 0 for

background frames with no action. Since future frames ℎ1, ℎ2, ..., are NOT accessible, the model

makes the 𝐶 + 1-way prediction for the current frame based on the recent 𝑇 frames, [ℎ𝑡]0

𝑡=−𝑇+1,
observed up until the current frame. While 𝑇 may be large in an untrimmed video stream, as

shown in the top row of Fig. 4.1, all frames observed in past history [ℎ𝑡]−1

𝑡=−𝑇+1 may not be equally

informative to the prediction for the current frame.

4.3.1 Gated History Unit based History Encoder

To make the 𝐶 + 1-way prediction accurately for current frame ℎ0 based on 𝑇 history frames,

h = [ℎ𝑡]0

𝑡=−𝑇+1, we employ transformers to first encode the video sequence history and then

58

Figure 4.2 Model Overview. GateHUB comprises a novel Gated History Unit (GHU) (a) as
part of History Encoder (b) to explicitly enhance or suppress history frames, i.e., streaming video
frames observed so far, as per how informative they are to current frame prediction. GHU encodes
them by cross-attending with a latent encoding (Q). GateHUB uses Future-augmented History
features (FaH) (d) to encode each history frame using 𝑡 𝑓 subsequently observed future frames.
The Present Decoder (c) correlates with history by cross-attending the encoded history with the
present, i.e., , a small set of most recent frames, to make current frame prediction. We subject the
prediction to a background suppression loss (e) to reduce false positives by effectively separating
action frames from closely resembling background frames.

associate the current frame with the encoding for prediction. Inspired by the recently introduced

PerceiverIO Jaegle et al. (2021a), our method consists of a History Encoder (Fig. 4.2b) that uses

cross-attention to project the variable length history to a fixed-length learned latent encoding. Using

cross-attention is more efficient than using self-attention because its computational complexity is

quadratic w.r.t. latent encoding size instead of the video sequence length which is typically orders

of magnitude larger. This is crucial to developing a model for the online setting. However, as

shown in Fig. 4.1, vanilla cross-attention, as used in PerceiverIO and LSTR Xu et al. (2021a), fails

to learn attention weights for history frames that correlate with how informative each history frame

is for ℎ0 prediction. We therefore introduce a novel Gated History Unit (GHU) (Fig. 4.2a) that has a

position-guided gated cross-attention mechanism which learns a set of gating scores 𝐺 to calibrate

the attention weights to effectively enhance or suppress history frames based on how informative

they are to current frame prediction.

Specifically, given h = [ℎ𝑡]0

𝑡=−𝑇+1 as the streaming sequence of 𝑇 history frames ending

59

Unavailable FutureC+1way prediction(b) History EncoderPresentGHUQCross-AttentionClassifierSelf-Attentionw/ Causal MaskSelf-AttentionGround Truth(e) Background Suppressionx NSelf-Attention(c) Present DecoderCross-AttentionFeaturesHistory(d) FaH(d) FaH(d) Future-augementedHistory (FaH)Streaming Video FramesLong HistoryOutputGateKGVMatMulScaleSigmoidlogAddAdd (Soft Gating)SoftMaxMatMul(a) Gated History Unit (GHU)Qat current frame ℎ0, we encode h with a feature extraction backbone, 𝑢, followed by a linear

encoding layer E. We then subject the output to a learnable position encoding, Epos, relative to
the current frame, ℎ0, to give zh = 𝑢(h)E + Epos where 𝑢(h) ∈ R𝑇×𝑀, E ∈ R𝑀×𝐷, zh ∈ R𝑇×𝐷
and Epos ∈ R𝑇×𝐷. 𝑀 and 𝐷 denote the dimensions of extracted features and post-linear encoding
features, respectively. We also define a learnable latent query encoding, q ∈ R𝐿×𝐷, that we cross-

attend with h. Following the standard multi-headed cross-attention setup Jaegle et al. (2021b,a),

let 𝑁ℎ𝑒𝑎𝑑𝑠 be the number of heads in GHU such that 𝑄𝑖 = qW

𝑖 be
the queries, keys and values, respectively, for each head 𝑖 ∈ {1, . . . , 𝑁ℎ𝑒𝑎𝑑𝑠} (Fig. 4.2a) where

𝑖 , 𝑉𝑖 = zhW𝑣

𝑞
𝑖 , 𝐾𝑖 = zhW𝑘

projection matrices W

𝑞
𝑖 , W𝑘

𝑖 ∈ R𝐷×𝑑𝑘 and W𝑣

𝑖 ∈ R𝐷×𝑑𝑣 . We assign 𝑑𝑘 = 𝑑𝑣 = 𝐷/𝑁ℎ𝑒𝑎𝑑𝑠 in our set

up Vaswani et al. (2017). Next, we obtain the position-guided gating scores, 𝐺, for h as,

zg = 𝜎(zhW𝑔)

𝐺 = log(zg) + zg

(4.1)

(4.2)

where W𝑔 ∈ R𝐷×1 is the matrix projecting each history frame to a scalar. zg ∈ R𝑇×1 is a sequence

of scalars for the history frames h after applying sigmoid 𝜎. 𝐺 ∈ R𝑇×1 is the gating score sequence

for history frames in GHU. By using zh which already contains the position encoding, the gates

are guided by the relative position of the history frame to the current frame ℎ0. As also shown in

Fig. 4.2a, we now compute the gated cross-attention for each head, 𝐺𝐻𝑈𝑖, as,

𝐺𝐻𝑈𝑖 = Softmax

(cid:32) 𝑄𝑖𝐾𝑇
𝑖
√
𝑑𝑘

(cid:33)

+ 𝐺

𝑉𝑖

and multi-headed gated cross-attention defined as,

MultiHeadGHU(𝑄, 𝐾, 𝑉, 𝐺) = Concat([𝐺𝐻𝑈𝑖] 𝑁ℎ𝑒𝑎𝑑𝑠

𝑖=0

)Wo

(4.3)

(4.4)

where W𝑜 ∈ R𝐷×𝐷 re-projects the attention output to 𝐷 dimension.

It is possible to define

𝐺 separately for each head but in our method, we find sharing 𝐺 across all heads to perform

better (Sec. 4.4.4). From Eqn. 4.1 and 4.2, we can observe that each scalar in zg lies in [0, 1] due

to sigmoid which implies that each gating score in 𝐺 lies in [− inf, 1]. This enables the softmax

function in Eqn. 4.3 to calibrate the attention weight for each history frame by a factor in [0, 𝑒]

60

such that a factor in [0, 1) suppresses a given history frame and a factor in (1, 𝑒] enhances a given

history frame. This provides an explicit ability to GHU to learn to calibrate the attention weight of

a history frame based on how informative the history frame is for prediction of ℎ0. Unlike previous

methods with relative position bias Liu et al. (2021); Dai et al. (2019), 𝐺 is input-dependent and

learns based on the history frame and its position w.r.t. current frame. This enables GHU to assess

how informative each history frame is based on its feature representation and relative position from

the current frame ℎ0. We feed the output of GHU to a series of 𝑁 self-attention layers to obtain the

final history encoding (Fig. 4.2b).

4.3.2 Hindsight is 2020: Future-augmented History

Existing methods Wang et al. (2021b); Xu et al. (2019, 2021a); Gao et al. (2021); Eun et al.

(2020) extract features for each frame by feed-forwarding the frame and optionally, a small set of

past consecutive frames through pretrained networks like TSN Wang et al. (2016) and I3D Carreira

and Zisserman (2017). It is worth noting that although for current frame prediction its future is not

available, for the history frames their future is accessible and this hindsight can potentially improve

the encoding of history for current frame prediction. Existing methods do not have a mechanism

to leverage this. This inspires us to propose a novel feature extraction scheme, Future-augmented

History (FaH), where we aggregate observed future information into the features of a history frame

to make it aware of its so far observable future. Fig. 4.2d illustrates the FaH feature extraction

process. For a history frame ℎ𝑡 and a feature extraction backbone 𝑢, when 𝑡 𝑓 future history frames

for ℎ𝑡 can be observed, FaH extracts features for ℎ𝑡 using a set of frames [ℎ𝑖]

𝑡+𝑡 𝑓
𝑖=𝑡

(i.e., history frame

itself and its subsequently observed 𝑡 𝑓 future frames). Otherwise, FaH extracts features for ℎ𝑡 using

a set of frames [ℎ𝑖]𝑡

𝑖=𝑡−𝑡 𝑝𝑠

(i.e., history frame itself and its past 𝑡 𝑝𝑠 frames),

𝑢(ℎ𝑡) =





𝑢([ℎ𝑖]𝑡

𝑖=𝑡 −𝑡𝑝𝑠
𝑡+𝑡 𝑓
𝑖=𝑡 )

𝑢([ℎ𝑖]

)

if 𝑡 > −𝑡 𝑓

if 𝑡 <= −𝑡 𝑓

(4.5)

At each new time step with one more new frame getting observed, FaH will feed-forward through

𝑢 twice to extract features for (1) the new frame using [ℎ𝑖]0

𝑖=−𝑡 𝑝𝑠

frames and (2) ℎ−𝑡 𝑓 that is now

eligible to aggregate future information using [ℎ𝑖]0

𝑖=−𝑡 𝑓

frames (as shown in Fig. 4.2d purple and

61

green cuboid respectively). Thus, FaH has the same time complexity as existing feature extraction

methods. FaH does not trivially incorporate all available subsequently observed frames. Instead,

it encodes only from a set of future frames that are the most relevant to a history frame (as we

empirically explain later in Chapter 4.4.4).

4.3.3 Present Decoder

In order to correlate the present with history to make current frame prediction, we sample a

subset of 𝑡 𝑝𝑟 most recent history frames [ℎ𝑡]0

𝑡=−𝑡 𝑝𝑟 −1 to model the present (i.e., the most immediate
context) for ℎ0 using the Present Decoder (Fig. 4.2c). After extracting the features via FaH, we

apply a learnable position encoding, Epr

pos, to each of the 𝑡 𝑝𝑟 frame features and subject them to a
multi-headed self-attention with a causal mask. The causal mask limits the influence of only the

preceding frames on a given frame. We then cross-attend the output from self-attention with the

history encoding from the History Encoder. Inspired by Perceiver Jaegle et al. (2021b), we repeat

this process twice and the self-attention does not need a causal mask the second time. Finally, we

feed the output corresponding to each of 𝑡 𝑝𝑟 frames to the classifier layer for prediction.

4.3.4 Background Suppression Objective

Existing online action detection methods Wang et al. (2021b); Xu et al. (2019, 2021a); Gao

et al. (2021); Eun et al. (2020) apply standard cross entropy loss for 𝐶 + 1-way multi-label per-frame

prediction. Standard cross entropy loss does not consider that the “no action” background class

does not belong to any specific action distribution and is semantically different from the 𝐶 action

classes. This is because background frames can be anything from completely blank at the beginning

of a video to closely resemble action frames without actually being action frames (e.g., , aiming

before making a billiards shot). The latter is a common cause for false positives in online action

detection. In addition to the complex distribution of background frames, untrimmed videos suffer

from a sharp data imbalance where background frames significantly outnumber action frames.

To tackle these challenges, we design a novel background suppression objective that applies

separate emphasis on low-confident action and background predictions during training to increase

the margin between action and background frames (Fig. 4.2e). Inspired by focal loss Lin et al.

62

(2017), our objective function, L𝑡 for frame ℎ𝑡 is defined as,

L𝑡 =





−𝑦0

𝑡 (1 − 𝑝0
𝑖=1𝑦𝑖

𝑡 (1 − 𝑝𝑖

−Σ𝐶

𝑡 ) 𝛾𝑏 log( 𝑝0
𝑡 )

if 𝑦0

𝑡 = 1

(4.6)

𝑡 ) 𝛾𝑎 log( 𝑝𝑖

𝑡 ) otherwise

where 𝛾𝑎, 𝛾𝑏 > 0 enables low-confident samples to contribute more to the overall loss forcing the

model to put more emphasis on correctly predicting these samples. Unlike original focal loss Lin

et al. (2017), our background suppression objective specializes for online action detection by

applying separate 𝛾 to action classes and background. This separation is necessary to distinguish

the action classes that have a more constrained distribution from the background class whose

distribution is more complex and unconstrained. Our objective is the first attempt in online action

detection to put separate emphasis on low-confident hard action and background predictions.

4.3.5 Flow-free Online Action Detection

Existing methods Xu et al. (2019); Wang et al. (2021b); Eun et al. (2020) for online action

detection use optical flow in addition to RGB to capture fine-grained motion among frames.

Computing optical flow takes much more time than feature extraction or model inference and can

be unrealistic for time-critical applications such as autonomous driving. This motivates us to

develop an optical flow-free version of GateHUB that is able to achieve higher or close accuracy

compared to existing methods without time-consuming optical flow estimation. To capture motion

without optical flow using only RGB frames, we leverage multiple temporal resolutions using a

spatiotemporal backbone such as TimeSformer Bertasius et al. (2021). We extract two feature

vectors for a frame ℎ𝑡 by encoding a frame sequence sampled at a higher frame rate spanning a

smaller time duration and another frame sequence sampled at a lower frame rate spanning a larger

time duration. Similar to the setup using RGB and optical flow features, we concatenate the two

feature vectors before feeding them to GateHUB.

4.4 Experiments

4.4.1 Datasets

Following existing online action detection work Wang et al. (2021b); Xu et al. (2019); Eun et al.

(2020); Gao et al. (2017a); Xu et al. (2021a), we evaluate GateHUB on three common benchmark

63

datasets – THUMOS’14, TVSeries, and HDD.

THUMOS’14 Idrees et al. (2017) consists of over 20 hours of sports video and is annotated

with 20 actions. We follow prior work Wang et al. (2021b); Xu et al. (2019) and train on the

validation set (200 untrimmed videos) and evaluate on the test set (213 untrimmed videos).

TVSeries De Geest et al. (2016) includes 27 episodes of 6 popular TV shows with a total

duration of 16 hours. It is annotated with 30 real-world everyday actions, e.g., open door, run,

drink.

HDD (Honda Research Institute Driving Dataset) Ramanishka et al. (2018b) includes 137

driving videos with a total duration of 104 hours. Following prior work Wang et al. (2021b), we

use the vehicle sensor as input signal and divide data into 100 sessions for training and 37 sessions

for testing.

4.4.2

Implementation Details

For TVSeries and THUMOS’14, following Wang et al. (2021b); Xu et al. (2019); Eun et al.

(2020); Gao et al. (2017a); Xu et al. (2021a), we resample the videos at 24 FPS (frames per second)

and then extract frames at 4 FPS for training and evaluation. The sizes of history and present are

set to 1024 and 8 most recently observed frames, respectively, spanning durations of 256s and 2s

correspondingly at 4 FPS. For HDD, following OadTR Wang et al. (2021b), we extract the sensor

data at 3 FPS for training and evaluation. The sizes of history and present are 48 and 6 most

recently observed frames respectively, spanning durations of 16s and 2s correspondingly at 3 FPS.

Feature Extraction. Following Xu et al. (2021a); Wang et al. (2021b), we use mmaction2 Con-

tributors (2020)-based two-stream TSN Wang et al. (2016) pretrained on Kinetics-400 Carreira and

Zisserman (2017) to extract frame-level RGB and optical flow features for THUMOS’14 and

TVSeries. We concatenate the RGB and optical flow features along channel dimension before

feeding to the linear encoding layer in GateHUB. For HDD, we directly feed the sensor data as

input to GateHUB. To fully leverage the proposed FaH, the feature extraction backbone needs to

support multi-frame input. Since TSN only supports single-frame input, we explore spatiotemporal

TimeSformer Bertasius et al. (2021) (pretrained on Kinetics-600 using 96 × 4 frame sampling) that

64

supports multiple-frame input. We set the time duration for past 𝑡 𝑝𝑠 and future 𝑡 𝑓 frames under

FaH to be 1s and 2s respectively. We use TimeSformer to extract RGB features and use TSN-based

optical flow features as TimeSformer only supports RGB. For our flow-free version, we replace

optical flow features with features obtained from an additional multi-frame input of RGB frames

uniformly sampled from a duration of 2s.

Training. We train GateHUB for 10 epochs using Adam optimizer Kingma and Ba (2015),

weight decay of 5𝑒−5, batch size of 50, OneCycleLR learning rate schedule of PyTorch Paszke et al.

(2017) with pct_start of 0.25, 𝐷 = 1024, latent encoding size 𝐿 = 16, number of self-attention

layers in History Decoder 𝑁 = 2, 𝑁ℎ𝑒𝑎𝑑𝑠 = 16 for each attention layer and 𝛾𝑎 = 0.6, 𝛾𝑏 = 0.2 for

background suppression.

Evaluation Metrics We follow the protocol of per-frame mean average precision (mAP) for

THUMOS and HDD and calibrated average precision (mcAP) De Geest et al. (2016) for TVSeries.

4.4.3 Comparison with State-of-the-Art

Method

FATS Kim et al. (2021)
IDN Eun et al. (2020)
TRN Xu et al. (2019)
PKD Zhao et al. (2020)
OadTR Wang et al. (2021b)
WOAD Gao et al. (2021)
LSTR Xu et al. (2021a)
GateHUB (Ours)
TRN Xu et al. (2019)
OadTR Wang et al. (2021b)
LSTR Xu et al. (2021a)
GateHUB (Ours)

Feature Backbone

THUMOS14

RGB

Optical Flow mAP (%)

TSN

TSN

TimeSformer

TSN

59.0
60.3
62.1
64.5
65.2
67.1
69.5
70.7
68.5
65.5
69.6
72.5

Table 4.1 Online action detection results on THUMOS’14 comparing GateHUB with SoTA methods
on mAP (%) when the RGB-based features are extracted from either TSN or TimeSformer. Optical
flow-based features are extracted from TSN in all settings.

Table 4.1 compares GateHUB with existing state-of-the-art (SoTA) online action detection

methods on THUMOS’14 for two different setups, one using RGB features from TSN Wang

et al. (2016) and the other using RGB features from TimeSformer Bertasius et al. (2021). Both

setups use optical flow features from TSN. WOAD Gao et al. (2021) uses RGB features from

65

I3D (equivalent to TSN). For TSN RGB features, all mAP in Table 4.1 are as reported in the

references. For TimeSformer RGB features, we use the official code for TRN, OadTR and LSTR

for fair comparison. From the table, we can observe that GateHUB outperforms all existing methods

by at least 1.2% when using RGB features from TSN. Moreover, GateHUB outperforms existing

methods by a larger margin of at least 2.9% using RGB features from TimeSformer. GateHUB is

also the first approach to surpass 70% on THUMOS’14 benchmark. This validates that GateHUB,

comprising GHU, Background Suppression and FaH to holistically leverage the long history more

informatively, outperforms all SoTA on THUMOS’14.

We further compare GateHUB with SoTA on TVSeries and HDD in Table 4.2a and 4.2b,

respectively. Following protocol, we use RGB and optical flow features from TSN for TVSeries

and sensor data for HDD. All results from SoTA are as reported in the references. We can observe

that GateHUB outperforms all SoTA on both TVSeries and HDD. The large improvement on HDD

using sensor data validates that GateHUB is also effective on data modalities other than RGB or

optical flow.

Method
FATS Kim et al. (2021)
IDN Eun et al. (2020)
TRN Xu et al. (2019)
PKD Zhao et al. (2020)
OadTR Wang et al. (2021b)
LSTR Xu et al. (2021a)
GateHUB (Ours)

(a)

mcAP (%)
84.6
86.1
86.2
86.4
87.2
89.1
89.6

Method
CNN De Geest et al. (2016)
LSTM Ramanishka et al. (2018a)
RED Gao et al. (2017a)
TRN Xu et al. (2019)
OadTR Wang et al. (2021b)
GateHUB (Ours)

mAP (%)
22.7
23.8
27.4
29.2
29.8
32.1

(b)

Table 4.2 Online action detection results comparing GateHUB with state-of-the-art methods on (a)
TVSeries using RGB + Optical Flow data as input on mcAP metric and (b) HDD using sensor data
as input on mAP metric.

4.4.4 GateHUB: Ablation Study

In this section, we conduct an ablation study to highlight the impacts of the novel components

of GateHUB. Unless stated otherwise, all experiments are on THUMOS’14 using RGB and optical

flow features from TSN.

Impact of Gated History Unit (GHU). We conduct an experiment where we test different

variants of our Gated History Unit (GHU) by removing one or more of its design elements.

66

Method
w/ GHU (Ours)
w/o GHU
w/ GHU suppress only
w/ GHU enhance only
w/ GHU w/o position-guidance
w/ GHU per head

mAP (%)
70.7
69.6
70.5
70.5
70.3
68.0

mAP (%)
Method
Ours 𝛾𝑎 > 𝛾𝑏
70.7
70.2
Ours 𝛾𝑎 < 𝛾𝑏
69.9
w/ cross-entropy
w/ standard focal loss 70.2

Method
w/o FaH

w/ FaH

(a)

(b)

Future Duration mAP (%)

-
0.5
1s
2s
4s

(c)

71.5
71.1
72.0
72.5
71.4

Table 4.3 Ablation study comparing different variants of (a) Gated History Unit (GHU), (b) back-
ground suppression objective and (c) Future-augmented History (FAH). Ablation in (a) and (b) is
conducted with RGB features from TSN and in (c) are conducted with RGB features from TimeS-
former. Optical flow features from TSN are used in all settings.

Table 4.3a summarizes the results of this experiment. In the table, ‘w/o GHU’ refers to replacing

GHU with vanilla cross-attention from Perceiver IO Jaegle et al. (2021a) and LSTR Xu et al.

(2021a), i.e., , CrossAttention(𝑄, 𝐾, 𝑉) = SoftMax(𝑄𝐾 ⊺/

√

𝑑). In ‘w/ GHU enhance only’, we

remove log(zg) from Eqn. 4.2 that suppresses history frames, i.e., 𝐺 = zg. Conversely, in ‘w/ GHU

suppress only’, we remove zg from Eqn. 4.2 that enhances history frames, i.e., 𝐺 = log(zg). In

‘w/ GHU w/o position guidance’, we operate on frame features before subjecting them to learned

position encoding, i.e., 𝐺 = log(z˜g) + z˜g where z˜g = 𝑞(h)E. We also compare with ‘w/ GHU per

head’ where G is learned separately for each cross-attention head.

Table 4.3a shows that our implementation of GHU significantly outperforms all other variants

of GHU and cross-attention. We can observe that ‘w/o GHU’ performs 1.1% worse than ‘w/

GHU’. This is because, without explicit gating, vanilla cross-attention fails to learn attention

weights for history frames that correlate with how informative history frames are to current frame

prediction (also depicted in Figure 4.1). Moreover, the lower performances of ‘w/ GHU suppress

only’ and ‘w/ GHU enhance only’ validate that we need to both enhance the informative history

frames and suppress the uninformative ones to achieve the best performance. Without the ability

to both enhance and suppress, the model may encode uninformative history frames into the latent

encoding or inadequately emphasize the informative ones, leading to worse performance. The

performance is also lower when using history frame features without position encoding (‘w/ GHU

w/o position guidance’). This is because without position guidance, the model cannot assess the

relative position of a particular history frame w.r.t. the current frame which is an important factor

67

in deciding how informative a history frame is to current frame prediction. We also find having

separate G per head (‘w/ GHU per head) performs much worse than sharing G across heads due to

overfitting from 𝑁ℎ𝑒𝑎𝑑𝑠 times more parameters.

Impact of Background Suppression. We compare our background suppression objective with

standard cross-entropy loss (i.e., , 𝛾𝑎 = 𝛾𝑏 = 0) and standard focal loss(i.e., , 𝛾𝑎 = 𝛾𝑏 ≠ 0) Lin

et al. (2017) as shown in Table 4.3b. First, compared to our background suppression objective, both

standard cross-entropy and focal loss achieve lower accuracy. This validates that it is important

to put separate emphasis on the low-confident action and background predictions to effectively

differentiate action frames and closely resembling background frames. Furthermore, we find that

across different combinations of 𝛾𝑎 and 𝛾𝑏, choosing a pair where 𝛾𝑎 > 𝛾𝑏 leads to higher accuracy.

Specifically, we find 𝛾𝑎 = 0.05 and 𝛾𝑏 = 0.025 to give the highest accuracy. This can be attributed

to the high data imbalance. Action frames are much lower in number than background frames and

therefore require a stronger emphasis than the background.

Impact of Future-augmented History (FaH). Table 4.3c shows the ablation on FaH. Since

the TSN backbone is not compatible with multi-frame input, we conduct this study using RGB

features from TimeSformer. The table shows that with 2s of future information incorporated into

history features, we achieve the best accuracy which is 1% higher than without future-augmented

history (‘w/o FaH’). The accuracy is also improved with 1s of future information incorporated into

history features. We further observe that the accuracy drops when future duration is much longer

e.g., 4s or much shorter e.g., 0.5s. This shows that making a history frame aware of its future

enables it to be more informative for current frame prediction. At the same time, future duration

up to a certain extent (in our case, 2s) can encode meaningful future into history frames. Much

beyond that, the future changes enough to be of little use for a given history frame, while much

shorter future duration may also add noise rather than information. We wish to emphasize that all

future duration are bound by the frames observed so far and do not extend into inaccessible future

frames.

GateHUB Present Decoder. Table 4.4 shows the ablation study on our Present Decoder by

68

Method
Ours
w/o self-attention
w/ cross-attention only at layer 1
w/ disjoint history and present

mAP (%)

70.7
67.7
68.6
69.4

Table 4.4 Ablation study for Present Decoder by altering different aspects of the design.

altering different aspects of the design. Unlike the original PerceiverIO Jaegle et al. (2021a),

where the output queries are independent, we model the present (equivalent of output queries in

our method) via a causal self-attention and cross-attend it with history encoding multiple times

(inspired by Perceiver Jaegle et al. (2021b)). We can observe in Table 4.4 that treating present

frames independently (‘i.e., w/o self-attention’) and having only one cross-attention (‘i.e., w/ cross-

attention only at first layer’) both reduce the accuracy considerably. Unlike LSTR Xu et al. (2021a)

that uses a FIFO queue with disjoint long-term and short-term memory, in our design, the sequences

of history and present frames fully overlap. Table 4.4 shows that having disjoint history and present

frames (i.e., , ‘w/ disjoint history and present’) leads to a 1.3% lower performance, further validating

our design of Present Decoder and GateHUB overall.

4.4.5 GateHUB Efficiency

For online action detection setting, model efficiency is an important metric. We compare

GateHUB with existing methods w.r.t. parameter count, GFLOPs, and inference speed in terms

of FPS as shown in Table 4.5. We first observe that GateHUB achieves the highest accuracy

with the least number of model parameters compared to all existing methods. We also note that

while methods like OadTR Wang et al. (2021b) and TRN Xu et al. (2019) are more efficient in

terms of GFLOPs, their accuracy is much lower. GateHUB achieve a more favorable accuracy-

efficiency trade-off with fewer GFLOPs than the existing best method LSTR Xu et al. (2021a) while

obtaining a higher accuracy. All aforementioned methods require optical flow computation which

is time-consuming, therefore the inference speed of these methods is governed by the optical flow

computation speed of 8.1 FPS. Meanwhile, our flow-free model obviates optical flow computation

by using RGB features from TimeSformer at two different frame rates and attains higher or close

accuracy compared to existing work at 2.8× faster inference speed. When compared with flow-free

69

Method

TRN Xu et al. (2021b)
OadTR Wang et al. (2021b)
LSTR Xu et al. (2021a)(Flow-free)
LSTR Xu et al. (2021a)
Ours (Flow-free)
Ours

Model

Inference Speed (FPS)

GFLOPs

Parameter
Count
402.9M 1.46
2.54
75.8M
6.33
54.2M
7.53
58.0M
3.47
41.8M
6.98
45.2M

Optical Flow
Computation
8.1
8.1
-
8.1
-
8.1

RGB Feature
Extraction
70.5
70.5
22.7
70.5
22.7
70.5

Flow Feature
Extraction
14.6
14.6
-
14.6
-
14.6

Model Overall

123.3
110.0
99.2
91.6
83.3
71.2

8.1
8.1
22.7
8.1
22.7
8.1

mAP(%)

62.1
65.2
63.5
69.5
66.5
70.7

Table 4.5 Efficiency comparison of GateHUB using RGB and optical flow features and our optical
flow-free version with existing methods. GateHUB using RGB and optical flow has the least
parameter count compared to existing methods, and higher accuracy and lower GFLOPs than the
existing best method. Moreover, our flow-free version attains higher or close accuracy compared
to existing methods that require RGB and optical flow features at 2.8× faster inference speed.

LSTR, GateHUB achieves 3% higher mAP, thus providing a significantly better speed-accuracy

tradeoff than the existing best method.

4.4.6 Qualitative Evaluation

Gated History Unit (GHU). We qualitatively assess the effect of GHU by visualizing examples

of the most suppressed and most enhanced history frames in a streaming video when ordered as

per the gating scores 𝐺 learned by GHU in Eqn. 4.2. Fig. 4.3 shows examples from three videos

where frames in the same row belong to the same video. From the figure, we can observe that

GHU learns to suppress frames that exhibit no discernible action from the 𝐶 action classes. The

suppressed frames either have people arbitrarily moving or are uninformative background frames

(e.g., crowd cheering) that convey no useful information to predict action for the current frame.

On the other hand, GHU learns to maximize emphasis on history frames with action from the 𝐶

classes and on background frames that provide meaningful context to determine the current frame

action (e.g., long jump athlete running toward the pit).

Current Frame Prediction. We visualize GateHUB’s current frame prediction in Fig. 4.4.

The confidence in the range [0, 1] on y-axis denotes the probability of predicting the correct

action (i.e., High Jump in Fig. 4.4). We can observe that GateHUB with GHU (red) is effective

in reducing false positives for background frames that closely resemble action frames compared to

without GHU (orange).

70

Figure 4.3 Examples of the most suppressed and most enhanced history frames as per the gating
score learned by GHU. Frames in the same row belong to the same video.

Figure 4.4 Visualization of GateHUB’s online prediction. The curves indicate the predicted
confidence of the ground-truth class (High Jump) using TSN backbone with and without GHU.

4.5 Summary

We present GateHUB for online action detection in untrimmed streaming videos. It consists

of novel designs including Gated History Unit (GHU), Future-augmented History (FaH), and a

background suppression loss to more informatively leverage history and reduce false positives for

current frame prediction. GateHUB achieves higher accuracy than all existing methods for online

action detection, and is more efficient than the existing best method. Moreover, its optical flow-free

variant is 2.8× faster than previous methods that require both RGB and optical flow while obtaining

higher or close accuracy.

While GateHUB outperforms all existing methods, there is ample room for improvement.

Although GateHUB can leverage long history, the length is still finite and may not be adequate

when actions occur infrequently over long duration. It would be worthwhile to investigate ways

to leverage history sequences of any length. Another challenge is slow motion action which is

uncommon and can have considerably different temporal distribution, making it difficult to predict

as accurately as common actions.

71

Enhanced History FramesSuppressed History FramesHighJumpBackgroundBackgroundw/o GHU TSNw/ GHU TSNBIBLIOGRAPHY

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. (2021). Vivit: A video
vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), pages 6836–6846.

Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The long-document transformer.

arXiv preprint arXiv:2004.05150.

Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-time attention all you need for video

understanding?

Buch, S., Escorcia, V., Shen, C., Ghanem, B., and Niebles, J. C. (2017). SST: Single-stream

temporal action proposals. In CVPR.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-

end object detection with transformers. In ECCV.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? A new model and the

Kinetics dataset. In CVPR.

Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020). Generative

pretraining from pixels. In ICML.

Chen, X., Wu, Y., Wang, Z., Liu, S., and Li, J. (2021). Developing real-time streaming trans-
former transducer for speech recognition on large-scale dataset. In ICASSP 2021-2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine
translation.

Contributors, M. (2020). Openmmlab’s next generation video understanding toolbox and bench-

mark. https://github.com/open-mmlab/mmaction2.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-XL:

Attentive language models beyond a fixed-length context.

De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., and Tuytelaars, T. (2016). Online action

detection. In ECCV.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale.

72

Eun, H., Moon, J., Park, J., Jung, C., and Kim, C. (2020). Learning to discriminate information for

online action detection. In CVPR.

Eun, H., Moon, J., Park, J., Jung, C., and Kim, C. (2021). Temporal filtering networks for online

action detection. Pattern Recognition.

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., and Feichtenhofer, C. (2021).

Multiscale vision transformers.

Gao, J., Yang, Z., and Nevatia, R. (2017a). RED: Reinforced encoder-decoder networks for action

anticipation. In BMVC.

Gao, J., Yang, Z., Sun, C., Chen, K., and Nevatia, R. (2017b). TURN TAP: Temporal unit regression

network for temporal action proposals. In ICCV.

Gao, M., Zhou, Y., Xu, R., Socher, R., and Xiong, C. (2021). WOAD: Weakly supervised online

action detection in untrimmed videos. In CVPR.

Girdhar, R. and Grauman, K. (2021). Anticipative video transformer.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation,

9(8):1735–1780.

Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev, I., Sukthankar, R., and Shah, M. (2017).

The THUMOS challenge on action recognition for videos “in the wild”. CVIU.

Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran,
D., Brock, A., Shelhamer, E., et al. (2021a). Perceiver io: A general architecture for structured
inputs & outputs. arXiv preprint arXiv:2107.14795.

Jaegle, A., Gimeno, F., Brock, A., Zisserman, A., Vinyals, O., and Carreira, J. (2021b). Perceiver:

General perception with iterative attention. arXiv:2103.03206.

Kim, Y. H., Nam, S., and Kim, S. J. (2021). Temporally smooth online action detection using

cycle-consistent future anticipation. Pattern Recognition.

Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Lin, T., Liu, X., Li, X., Ding, E., and Wen, S. (2019). BMN: Boundary-matching network for

temporal action proposal generation. In ICCV.

Lin, T., Zhao, X., Su, H., Wang, C., and Yang, M. (2018). BSN: Boundary sensitive network for

temporal action proposal generation. In ECCV.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object

73

detection. In ICCV, pages 2980–2988.

Liu, X., He, P., Chen, W., and Gao, J. (2019). Multi-task deep neural networks for natural language

understanding. arXiv preprint arXiv:1901.11504.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer:

Hierarchical vision transformer using shifted windows.

Nawhal, M. and Mori, G. (2021). Activity graph transformer for temporal action localization.

arXiv:2101.08540.

Neimark, D., Bar, O., Zohar, M., and Asselmann, D. (2021). Video transformer network.

arXiv:2102.00719.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,

Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.

Qu, S., Chen, G., Xu, D., Dong, J., Lu, F., and Knoll, A. (2020). LAP-Net: Adaptive features

sampling via learning action progression for online action detection. arXiv:2011.07915.

Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K. (2018a). Toward driving scene understand-
ing: A dataset for learning driver behavior and causal reasoning. In CVPR, pages 7699–7707.

Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K. (2018b). Toward driving scene understand-

ing: A dataset for learning driver behavior and causal reasoning. In CVPR.

Shou, Z., Chan, J., Zareian, A., Miyazawa, K., and Chang, S.-F. (2017). CDC: Convolutional-
de-convolutional networks for precise temporal action localization in untrimmed videos.
In
CVPR.

Shou, Z., Wang, D., and Chang, S.-F. (2016). Temporal action localization in untrimmed videos

via multi-stage cnns. In CVPR.

Tan, J., Tang, J., Wang, L., and Wu, G. (2021). Relaxed transformer decoders for direct action

proposal generation. arXiv:2102.01894.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. (2021). Training

data-efficient image transformers & distillation through attention. In ICML.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing
systems (NeurIPS), volume 30.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016). Temporal

segment networks: Towards good practices for deep action recognition. In ECCV.

74

Wang, L., Yang, H., Wu, W., Yao, H., and Huang, H. (2021a). Temporal action proposal generation

with transformers. arXiv preprint arXiv:2105.12043.

Wang, S., Li, B., Khabsa, M., Fang, H., and Ma, H. (2020). Linformer: Self-attention with linear

complexity. arXiv:2006.04768.

Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., and Sang, N. (2021b). Oadtr: Online

action detection with transformers.

Xu, H., Das, A., and Saenko, K. (2017). R-C3D: Region convolutional 3d network for temporal

activity detection. In ICCV.

Xu, M., Gao, M., Chen, Y.-T., Davis, L. S., and Crandall, D. J. (2019). Temporal recurrent networks

for online action detection. In ICCV.

Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., and Soatto, S. (2021a). Long short-term

transformer for online action detection.

Xu, W., Xu, Y., Chang, T., and Tu, Z. (2021b). Co-scale conv-attentional image transformers.

arXiv:2104.06399.

Zhang, C., Gupta, A., and Zisserman, A. (2021). Temporal query networks for fine-grained video

understanding. In CVPR, pages 4486–4496.

Zhao, P., Wang, J., Xie, L., Zhang, Y., Wang, Y., and Tian, Q. (2020). Privileged knowledge

distillation for online action detection. arXiv:2011.09158.

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. (2017). Temporal action detection

with structured segment networks. In ICCV.

Zhu, Z., Tang, W., Wang, L., Zheng, N., and Hua, G. (2021). Enriching local and global contexts

for temporal action localization. In ICCV, pages 13516–13525.

75

CHAPTER 5

ACTIVITY-DRIVEN WEAKLY-SUPERVISED SPATIAL-TEMPORAL VIDEO OBJECT
GROUNDING

5.1

Introduction

Grounding natural language in visual data is a fundamental task in the multimedia and computer

vision communities with a variety of applications, including image/video retrieval Karpathy and

Fei-Fei (2015), robotics Alomari et al. (2017) and human-computer interactions Shridhar and Hsu

(2018). Given an image/video and its description sentence, for example “break the eggs”, visual

grounding aims at localizing the query objects described in the sentence on the given image or

video. Recently, great progress has been made on image grounding Yang et al. (2019b); Karpathy

and Fei-Fei (2015); Chen et al. (2018); Yang et al. (2019a). On the basis of this, researchers

started to explore grounding in the video domain Zhou et al. (2018a); Shi et al. (2019); Chen et al.

(2019b,a); Huang et al. (2018).

Nevertheless, in video grounding, it is labor-intensive to annotate a considerable number of

bounding boxes for queries in videos. To address this challenge, multiple instance learning (MIL)

methods Zhou et al. (2018a); Shi et al. (2019); Chen et al. (2019b,a); Huang et al. (2018) were

proposed, which do not require bounding box annotations in the training videos. Video object

grounding is achieved in a weakly-supervised fashion, where only a video and its description

sentence are required during training. However, these methods are only able to infer the spatial

occurrence of the query objects, and cannot tell the temporal occurrence of the objects. This

problem was later addressed in Chen et al. (2019b) by generating region proposal tubes using

object tracking methods. But their method is only applicable to trimmed videos without camera

shot cuts.

We argue that a successful video grounding method should infer both the spatial and temporal

occurrence of a query object without the need of expensive annotations. In addition, the method

is expected to work in untrimmed videos, which can be of long duration and contain frequent

visual inconsistency mainly caused by frame flickerings and camera shot cuts (see Fig. 5.1). A

76

Figure 5.1 Given a video and its description sentence, our goal is to achieve the spatio-temporal
grounding of the described queries on challenging untrimmed videos, where camera shot cuts
frequently appear. Spatio-temporal grounding grounds each query to specific spatial regions and
the frames of a video where the query object appears.

query object may appear discontinuously (it frequently appears and disappears) across frames in

an untrimmed video. Existing video grounding methods Chen et al. (2019b) that rely on visual

trackers Wang et al. (2019b,a) would undoubtedly fail as the trackers can be distracted if a camera

shot cut appears.

We propose a novel multiple instance learning method for spatio-temporal grounding on

untrimmed videos. Our method does not require extensive annotations of spatial and temporal

occurrence1 of the query object in training. At the spatial level, we assign each textual query to

one of the region proposals in a frame, while at the temporal level, we represent each frame by

query-specific region and ground each query to its relevant frames. We formulate spatio-temporal

grounding as two MIL problems. The spatial MIL aims at selecting the best instance (top-ranked

region) from a bag (frame). The temporal MIL aims at selecting the multiple instances (query

occurring frames) from a bag (video). Two MILs are mutually guided to achieve the optimal

spatio-temporal grounding results.

We also propose to model human activity operating on the query object. This allows us to

capture the physical states of the object as well as the spatial relations between human and the

1Temporal occurrence of an object means the object appear in some frames of a video.

77

Description: add lemon zest mayonnaise basil and black pepper into a blender.Description: break the eggs and separate the egg white and add water.Temporal 
occurrenceTemporal 
occurrenceobject. Most of existing visual grounding methods Karpathy and Fei-Fei (2015); Shi et al. (2019);

Zhou et al. (2018a); Chen et al. (2019a) simply compute the similarity between the visual and

the textual features of the query object as a measurement for selecting candidate regions for the

query. However, there is a granularity gap between the coarse textual and rich visual modalities.

For example, the text-level query object “potatoes” might correspond to “potatoes” in different

physical states: “mashed potatoes” means visually paste-like potatoes, while “peel potatoes and

cut” corresponds to cube-shape potatoes. Directly computing the feature similarity as in Karpathy

and Fei-Fei (2015); Shi et al. (2019); Zhou et al. (2018a) leads to a large discrepancy between the

text and visual features. To address this, first, we propose to enrich the textual representation by

incorporating the activity performed on the query to better align the text feature with the diverse

visual features. Second, we propose an activity-driven region proposal refinement to find high-

quality region proposals. Most of existing visual grounding methods Shi et al. (2019); Zhou et al.

(2018a); Chen et al. (2019a); Karpathy and Fei-Fei (2015) build a candidate pool of top-𝑁 region

proposals, in which query objects could be missed. Proposals with a high recall rate by increasing

𝑁 typically lead to a large search space for grounding a query object. To tackle this dilemma, we

exploit the intrinsic spatial relations between human and object using human activity to refine the

search space for region proposal generation.

Our work is different from Zhou et al. (2018a); Shi et al. (2019), which also focus on grounding

untrimmed videos. However, they ground query objects every frame even if the frame does not

contain the query. This could result in a lot of false positives in the frames where the queried object

does not appear due to its sparse existence in a long untrimmed video. On the contrary, we infer

both the bounding box and the temporal occurrence of the query object. Therefore, our method can

be used in more realistic scenarios.

Our main contribution can be summarized as follows: 1). We propose a spatio-temporal

Multiple Instance Learning method to learn a spatio-temporal video grounding model for the

challenging untrimmed videos in a weakly-supervised fashion; 2). We exploit the activity cues in

the description sentence of the video, including enriching the query representation with activity

78

effect and refine the object proposal generation; 3). Extensive results demonstrate that our method

outperforms state-of-the-art weakly-supervised object grounding model in untrimmed videos by a

large margin.

5.2 Related Work

Weakly-supervised Visual Grounding. Weakly-supervised image grounding Karpathy and

Fei-Fei (2015); Chen et al. (2018); Rohrbach et al. (2016) has been extended to video domain Zhou

et al. (2018a); Shi et al. (2019); Huang et al. (2018); Chen et al. (2019b,a), but they are only

applicable to constrained scenarios. Early work Yu and Siskind (2013) grounded sentences to

objects in the constrained videos that are recorded in lab. A reference grounding model Huang et al.

(2018) extends proposal ranking Karpathy and Fei-Fei (2015) to video domain and further enhances

the performance by modeling the reference relationships between video segments. Following

Karpathy and Fei-Fei (2015), the work in Zhou et al. (2018a) extends proposal ranking to video

domain via a frame-wise weighting strategy. They also introduce an object grounding dataset based

on YouCookII Zhou et al. (2018b). The work in Shi et al. (2019); Chen et al. (2019a) follow the

same problem setup as Zhou et al. (2018a) and boost grounding performance by using contextual

similarity and cross-modal context reasoning. However, during inference Shi et al. (2019); Zhou

et al. (2018a); Chen et al. (2019a) only ground query in the frames where the objects occur without

grounding frame occurrence in the temporal domain. Thus, the output of their methods contain a

lot of false positives in the frames without the presence of query objects. The VID-sentence dataset

is introduced in Chen et al. (2019b), which first grounds spatio-temporal tubes for a query. But

their method and dataset are only for trimmed videos.

In this work, we aim at object grounding on untrimmed video streams by localizing the query

objects in both spatial region and frame-level occurrence. Our method does not rely on tracking

tubes due to frame flickering and camera shot cut in untrimmed videos.

Fully-supervised Spatio-Temporal Grounding has been developed by combining with other

tasks, such as object tracking Yang et al. (2019c), video captioning Zhou et al. (2019) and visual

question answering Cadene et al. (2019). Yang et al. (2019c) add language description on an

79

Figure 5.2 Overview of our framework. Given a video and its description sentence as input, we first
extract the region features via a pre-trained region proposal network (RPN) and find the high-quality
proposals by the proposed activity-driven object proposal refinement module. Given text data, we
first encode the sentence by BERT and propose an activity-driven object state encoding module to
enrich query representation by incorporating activity effect. Then, through a similarity alignment
between the two modalities, spatio-temporal grounding is achieved by retrieving the frames that
contain the queries and selecting the best-match spatial region in the frames where the queries
appear. During training, the spatial and temporal levels are mutually guided. Training details can
be seen in Section 5.3.3.

object tracking dataset Fan et al. (2019) to make it for grounding task and propose a grounding

and tracking integration model. But this dataset only contains single object in a video, which is

much easier than our goal that multiple objects in a query need to be grounded. Zhou et al. (2019)

augment the challenging ActivityNet Captions dataset with 158K bounding boxes annotations and

provide a framework to not only generate video captions but also link the sentence to the evidence

in the video. However, these methods require dense spatio-temporal tube annotations for training,

which are especially expensive to obtain. Our paper aims at solving spatio-temporal grounding in

a weakly-supervised setting without bounding box annotations.

Weakly-supervised Video Object Localization localizes an object class or a video tag in the

visual content. Object class or video tag comes from human labeling while the descriptive sentence

in visual grounding can be accessed from the existing web video descriptions uploaded by users

or the YouTube Automatic Speech Recognition scripts, which requires less human effort. Existing

work of weakly-supervised object localization also formulate it as an MIL approach. Kwak et al.

(2015) integrated object tracking and frame-wise object detection together to achieve video object

localization. Prest et al. (2012) extract spatio-temporal tubes as proposals to be ranked and selected.

However, similar to Yang et al. (2019c); Chen et al. (2019b), these methods heavily rely on tracking

80

Visual encodingBERTkeypoint distributionSeason the lettuce with salt and pepperQuery embedding vt1vt2vtNvt1vt2vtNvt1vt2vtNsimilarityV={Vt}Tt=1(season, lettuce)(with, pepper)(with, salt)SOpenPoseRPNmaskSentence featuresBertParserq1q2q3p1p2p3Region embedding <object, activity>Activity-driven Object State EncodingActivity-driven Object Proposal ReﬁnementSpatial 
locationTemporal 
occurrenceq1q2q3Spatio-temporal groundingQuery speciﬁc regionQuery occurring framesmutually guidedwhich is not applicable to long untrimmed video due to camera shot cut.

Weakly-supervised Temporal Grounding focuses on identifying relevant frames in a video

from text descriptions without the annotations of temporal boundaries. Existing work Mithun et al.

(2019); Gao et al. (2019); Duan et al. (2018); Chen et al. (2020) extract a set of pre-defined temporal

segment proposals and select one of them that semantically best matches the description. Our task

is more challenging as we ground the query not only to its occurring frames but also to specific

locations. Moreover, query appears discontinuously in temporal domain due to the frequent camera

shot cut, which may not be addressed by finding the best matched temporal proposal.

5.3 Our Approach

5.3.1 Problem Setup

Given an untrimmed video and its description sentence, video grounding task grounds each

query object described in the sentence to spatio-temporal visual regions in the video. The query

can be either a noun, e.g. word “potato” or a pronoun with reference meaning, e.g. “they”. Video

grounding on untrimmed videos is of great significance while more challenging than trimmed

video, since the untrimmed video contains large temporal incoherence caused by camera motion

and camera shot cut2.

We propose a spatio-temporal grounding model that can be applied to untrimmed videos with

significant camera motion. Our model is trained in a weakly-supervised fashion, where only

the video-description pairs are given during training; spatial and temporal occurrence of the query

objects are not given. As shown in Fig. 5.2, our grounding model takes a video 𝑉 and its description

sentence 𝑆 as a pairwise input, and predicts the temporal occurrence (i.e., , what frames contain

the object) of the query object across frames and the spatial location (using a bounding box) of the

object on the frames where it appears. The video contains 𝑇 frames and is denoted as 𝑉 = {𝑉𝑡 }𝑇

𝑡=1.
Following Shi et al. (2019); Zhou et al. (2018a), each frame consists of 𝑁 region proposals, denoted

as 𝑉𝑡 = (cid:8)𝑣𝑛
𝑡

(cid:9) 𝑁
𝑛=1

, where 𝑛 indexes the proposals in the 𝑡-th frame. The description sentence 𝑆

includes 𝐾 queries, e.g. query “lettuce” and “pepper” in the description sentence “season the

2A camera shot cut is the view change from one shot to another, e.g., from a distant view shot to a close view shot.

81

lettuce with salt and pepper”. Each query 𝑠𝑘 corresponds to a word or a phrase in 𝑆 and all of

queries in a sentence is denoted as {𝑠𝑘 }𝐾

𝑘=1. The visual feature 𝑣𝑛

𝑡 and query feature 𝑄 𝑘 of a query

are encoded into a joint feature space, and their similarity is computed for the grounding purpose.

We formulate spatio-temporal grounding as a multiple instance learning (MIL) problem for

untrimmed videos. We propose two ranking losses (one on spatial level and one on temporal level)

mutually guiding each other to learn a shared metric space for grounding. We consider a weakly-

supervised learning scenario without the annotations of bounding boxes and temporal occurrences.

An activity-driven encoder is proposed to better align the visual and text modalities by considering

the object state variations and spatial location prior of region proposals.

5.3.2 Activity-driven Encoding

Activity cues in both text and visual modalities are informative for grounding objects in an

untrimmed video. For example, as shown in Fig. 5.3, the activity in the description sentence “mash

the potatoes” results in paste-like potatoes in the visual data, while “peel potatoes and cut” results in

cube-shaped potatoes. By modeling activities, various physical states of an object can be modeled

at a fine-grained representation level, which allows us to accurately ground the object. In addition,

the activity provides a spatial location prior for the object to be grounded. For example, “cut

potatoes” indicates that query potatoes should appear close to human hand. The spatial location

prior can be exploited to refine the candidate region proposals of visual data.

5.3.2.1 Activity-driven Object State Encoding

To encode the query into a representative feature, previous work Shi et al. (2019); Zhou et al.

(2018a) extract each query word (e.g. “potatoes”) from the description sentence and then represent

it based on GloVe features Pennington et al. (2014). This is ineffective because there is a semantic

granularity gap between the text modality and visual modality (see Fig. 5.3). Existing methods

simply attach the same textual representation to the diverse visual representation, which results in

text-visual misalignment problem.

We propose to enrich the textual representation and align it with diverse visual objects. This

allows us to capture rich cues of object physical states. Specifically, we introduce an activity-driven

82

Figure 5.3 Examples of an object in different states. For example, “potatoes” in “mashed” and
“peeled” are different in appearance. Blue word indicates query object and red word indicate the
activity applied on the object. Best view in color.

object state encoding module to enrich the query textual representation. We consider each query

with the predicate performing on it and reformulate each query as an object-activity pair. We use

Stanford CoreNLP parser Manning et al. (2014) to parse each noun or pronoun and its predicate

from a sentence 𝑆. Meanwhile, each sentence is encoded by a pre-trained BERT model Devlin

et al. (2019). Then, we crop the features of the 𝑘-th query and its predicate from the sentence

representation as 𝑞𝑘 and 𝑝𝑘 , respectively. The textual representation of 𝑘-th query 𝑠𝑘 in the sentence

is enriched as (𝑞𝑘 , 𝑝𝑘 ), which is an object-activity pair. Query set 𝑄 = {(𝑞1, 𝑝1), ..., (𝑞𝐾, 𝑝𝐾)} is

denoted as the textual representations of every query in sentence. With activity-driven object states

encoding, textual and visual modalities can be better aligned without the large granularity gap.

5.3.2.2 Activity-driven Object Proposal Refinement

Most of existing visual grounding methods Karpathy and Fei-Fei (2015); Shi et al. (2019);

Zhou et al. (2018a); Chen et al. (2019a) are based on selecting the best matched proposal out of a

candidate pool that contains top-𝑁 region proposals as a grounded object. However, query objects

may not be included in these proposals and thus are unlikely to be grounded. A naive solution is to

increase 𝑁 but this will lead to a large search space especially for long untrimmed videos.

Human activity can provide spatial location prior to refine the region proposal generation.

Intuitively, there is a spatial dependency between the activity performer and the activity receiver.

For example, if a video is describing “peel a potato”, “potato” tends to occur around “human"

83

dice potatoes/home/junwen/opengit/NAFAE/output/Visualize/ﬁlter/data/YouCookII/sampled_frames_splnum-1/testing/210/r5DLZkV_Pi4.mp4peel potatoeswhisk eggbeat eggscut potatoesstir-fry eggsmash potatoescrack eggshands, which indicates the potential of spatial location prior between activity performer (human)

and receivers (objects). We propose to model the spatial prior as a truncated normal distribution

N (𝜇𝑎, 𝜈𝑎) to mask out the irrelevant region proposals. 𝜇𝑎 is the pixel coordinate of activity-

relevant joint and 𝜈𝑎 is the hyperparameter. We apply the normal distribution to activity-relevant

key joints to form a mixture of Gaussian (see the heatmap in Fig. 5.2), and select top-𝑁 proposals by

considering the densities. We use a pre-trained human detector He et al. (2017) and OpenPose Cao

et al. (2019) to extract the human’s joints. If there is no human detected in a frame like Fig. 5.3,

most likely the frame is captured in a close view. Then, we simply keep the top-𝑁 region proposals

without refinement. Using this method, we can include more query-related proposals and make the

query more likely to be grounded.

5.3.3 Spatio-Temporal MIL for Video Grounding

We consider a weakly-supervised learning scenario, where the only supervision is the sentence

description of the video. Existing work Chen et al. (2019b) addresses the weakly-supervised

spatio-temporal grounding for trimmed videos using object tracking to generate proposal tubes.

Different from Chen et al. (2019b), our goal is to achieve untrimmed video grounding. This

is more challenging as an object does not necessarily appear on every frame and usually occur

discontinuously due to the frequent camera shot cut in untrimmed videos. In this case, the tracking-

based grounding methods would undoubtedly fail.

To address this problem, we resort to MIL and propose a novel spatio-temporal MIL framework.

At the spatial level, we aim at grounding each textual query to one of 𝑁 object proposals 𝑣𝑛
𝑡

(𝑛 ∈ [1, 𝑁]) extracted on a frame. At the temporal level, we aim at grounding each textual query

to the frames where it occurs on a video. Both the temporal and spatial MILs are formulated by

pair-wise ranking losses. The losses encourage the correct matching of an aligned video-sentence

pair and discourages the matching of an unaligned pair, i.e., , the sentence does not belong to

the video3. The spatial and the temporal MILs are mutually guided to learn a spatio-temporal

3We consider a video and its descriptive sentence as an aligned video-sentence pair and define a video and the
query in its descriptive sentence as an aligned video-query pair. Similarly, an unaligned video-sentence pair is that the
sentence does not describe the video but describes other video in the current batch.

84

grounding model.

5.3.3.1 Spatial level MIL

The goal of spatial grounding is to ground each referred query to one of the top-𝑁 region

proposals on a frame.

In order to obtain the query specific region, we use normalized cosine

distance as the metric of region-query similarity 𝑎(𝑣𝑛

𝑎(𝑣𝑛

𝑡 , 𝑄 𝑘 ) =

,

(5.1)

𝑡 , 𝑄 𝑘 ):
T · (𝑞𝑘 + 𝑝𝑘 )
(cid:13)
(cid:13)∥(𝑞𝑘 + 𝑝𝑘 ) ∥

𝑣𝑛
𝑡
(cid:13)
(cid:13)𝑣𝑛
𝑡

where (𝑞𝑘 + 𝑝𝑘 ) and 𝑣𝑛

𝑡 are feature embeddings of the textual object-activity pair of 𝑘-th query

and the region proposal, respectively, in a joint 𝑑-dimensional feature space. 𝑘, 𝑡, 𝑛 index queries,

frame, and region proposals, respectively. T is a transpose.

The spatial-level MIL regards each frame as a bag and all region proposals in the frame

as instances in the bag. Instance score w.r.t. query 𝑄 𝑘 is the region-query similarity 𝑎(𝑣𝑛

𝑡 , 𝑄 𝑘 )

computed by Eq. 6.4. Following MIL, a bag is represented by its most positive instance, which can be

achieved by a 𝑚𝑎𝑥 operation. Thus, the bag level score is computed as 𝑆(𝑉𝑡, 𝑄 𝑘 ) = 𝑚𝑎𝑥

𝑛

𝑎(𝑣𝑛

𝑡 , 𝑄 𝑘 ),

which denotes the frame-query similarity. Following Shi et al. (2019); Chen et al. (2019b); Zhou

et al. (2018a), spatial MIL is formulated as a pair-wise ranking:

𝑆(𝑉𝑡, 𝑄 𝑘 ) > 𝑚𝑎𝑥

(cid:16)

𝑆(𝑉 ′

𝑡 , 𝑄 𝑘 ), 𝑆(𝑉𝑡, 𝑄′
𝑗 )

(cid:17)

,

(5.2)

where (𝑉𝑡, 𝑄′

𝑗 ) and (𝑉 ′

𝑡 , 𝑄 𝑘 ) are the two cases of unaligned frame-query pairs. 𝑄′

𝑗 is the 𝑗-

th unaligned query w.r.t.

region proposal 𝑉𝑡, while 𝑉 ′

𝑡 consists of region proposals in a video

frame unaligned with current query 𝑄 𝑘 . Eq. 5.2 encourages the correct proposal matching for a

query 𝑄 𝑘 by 𝑆(𝑉𝑡, 𝑄 𝑘 ) > 𝑆(𝑉 ′

𝑡 , 𝑄 𝑘 ) and encourages the correct query matching for a frame 𝑉𝑡 by

𝑆(𝑉𝑡, 𝑄 𝑘 ) > 𝑆(𝑉𝑡, 𝑄′

𝑘 ).

To achieve the pair-wise ranking in Eq. 5.2, the frame-query ranking loss with margin Δ𝑠 needs

to be minimized in training:

L

𝑡,(𝑘, 𝑗)
𝑟𝑎𝑛𝑘 = 𝑚𝑎𝑥

(cid:16)0, 𝑚𝑎𝑥

(cid:16)

𝑆(𝑉 ′

𝑡 , 𝑄 𝑘 ), 𝑆(𝑉𝑡, 𝑄′
𝑗 )

(cid:17)

− 𝑆(𝑉𝑡, 𝑄 𝑘 ) + Δ𝑠

(cid:17)

.

(5.3)

This objective encourages the similarities of aligned pairs larger than those of unaligned pair

with gap Δ𝑠. Furthermore, by aggregating every query in the unaligned description sentence, the

85

spatial-level ranking loss is defined as:

L𝑡,𝑘

𝑟𝑎𝑛𝑘 =

1
𝐾′

𝐾 ′
∑︁

𝑗=1

I (𝑄′

𝑗 ≠ 𝑄 𝑘 )L

𝑡,(𝑘, 𝑗)
𝑟𝑎𝑛𝑘 ,

(5.4)

where the negative query set 𝑄′ contains 𝐾′ queries. Note that if 𝑄 𝑘 meets the query 𝑄′

𝑗 in negative

query set 𝑄′, it will not contribute to the ranking loss. But if the queries in 𝑄 and 𝑄′ only share

the object and have different activities, such as “mash the potato” and “peel the potato”, it still

contributes to the ranking loss, because of the large discrepancy in appearance.

Spatial MIL considers each frame as a bag. However, in untrimmed videos, query only appears

in a part of frames. The frames without the query occurring are actually noisy positive bags. In

Sec 3.3.2 and 3.3.3, we will discuss how to alleviate the false positive bags with the guidance of

temporal grounding.

5.3.3.2 Temporal level MIL

In temporal grounding, we aim at predicting the temporal occurrence of the queries across

frames.

In our weakly-supervised setting, we do not have access to the temporal occurrence

annotations. Thus, we still formulate it as a MIL problem. In this case, each video is considered as

a bag and the frames of the video are considered as instances in the bag.

Instance score is frame-query similarity. But query object only occurs in a small region of the

frame. It is not effective to align the query with the entire frame. Thus, we resort to spatial level

MIL results as guidance and propose to represent each instance as the best matched region proposal

such that the instance score is denoted as 𝑆(𝑉𝑡, 𝑄 𝑘 ) = 𝑚𝑎𝑥

𝑛

𝑎(𝑣𝑛

𝑡 , 𝑄 𝑘 ).

In an untrimmed video, the query object may appear discontinuously across frames. Thus, it

is not appropriate to represent the bag by the best-matched instance, as it ignores other positive

instances that contain the query. This is different from the spatial level MIL where an object tends

to appear concentrated in frame. Thus, in temporal level MIL, the bag score which is video-query

pair-wise similarity should be the overall score of all positive instances in the bag 1
𝑇

(cid:205)𝑡∈𝑇 𝑆(𝑉𝑡, 𝑄 𝑘 ),
instead of the best matched instance. Since we have video-query pair as bag-level annotation,

86

temporal level MIL is formulated as a ranking problem:

1
𝑇

𝑇
∑︁

𝑡=1

𝑆(𝑉𝑡, 𝑄 𝑘 ) > 𝑚𝑎𝑥

(cid:32) 1
𝑇

𝑇
∑︁

𝑡=1

𝑆(𝑉 ′

𝑡 , 𝑄 𝑘 ),

(cid:33)

𝑆(𝑉𝑡, 𝑄′
𝑗 )

,

1
𝑇

𝑇
∑︁

𝑡=1

(5.5)

which is also a pair-wise ranking. 𝑉 ′

𝑡 and 𝑄′

𝑗 indicate the negative video frame and the 𝑗-th query

in the negative query set, respectively. Eq. 5.5 encourages an aligned video-query pair (𝑉, 𝑄 𝑘 ) to

be better matched than two other types of unaligned video-query pairs (𝑉 ′, 𝑄 𝑘 ) and (𝑉, 𝑄′

𝑗 ). The

number of instances in each bag is 𝑇. Using average instance scores to represent a bag score helps

avoid a degenerate solution where we predict most of frames irrelevant to the queries, compared

with 𝑚𝑎𝑥 operation.

Moreover, consecutive frames in a video are correlated but their visual context does not nec-

essarily to be continuous due to the frequent camera shot cut. Therefore, the simple arithmetic

mean in Eq. 5.5 is not adaptive for temporal grounding in untrimmed videos. In this paper, we

propose an attention module to learn the the weight of each frame. Specifically, we extract the each

frame’s features from the last layer of VGG-16 backbone and then encode the frames’ features by

a self-attention layer as 𝑓𝑡, 𝑡 ∈ [1, 𝑇]. Based on that, we compute the weight of each frame as 𝑤𝑡

by a linear layer and a sigmoid activation w.r.t 𝑓𝑡. Note that the weight of each frame is agnostic to

different queries while the video content continuity can be addressed by the temporal consistency

of frame weights.

To achieve the goal of Eq. 5.5 by considering the temporal context, we propose to minimize the

following temporal ranking loss:

L

𝑘, 𝑗
tem = 𝑚𝑎𝑥

(cid:16)0, 𝑚𝑎𝑥

(cid:16)

Γ(𝑉 ′, 𝑄 𝑘 ), Γ(𝑉, 𝑄′
𝑗 )

(cid:17)

− Γ(𝑉, 𝑄 𝑘 ) + Δ𝑡

(cid:17)

,

(5.6)

where Γ(𝑉, 𝑄 𝑘 ) = 1
𝑇

(cid:205)𝑇

sum of query specific frame scores. L

𝑡=1 𝑤𝑡 𝑆(𝑉𝑡, 𝑄 𝑘 ) indicates the query specific bag score computed by weighted
𝑘, 𝑗
tem encourages video 𝑉 and its paired query 𝑄 𝑘 to be better

aligned than a query 𝑄′

𝑗 in negative query set 𝑄′. Δ𝑡 serves as similarity margin for temporal-level

grounding. Finally, the temporal video-query ranking loss is defined as the average over the entire

unaligned query set:

L 𝑘

tem =

1
𝐾′

𝐾 ′
∑︁

𝑗=1

I (𝑄 𝑘 ≠ 𝑄′

𝑗 )L

𝑘, 𝑗
tem.

87

(5.7)

The temporal MIL allows the queries to find the frames where they occur in a given video.

5.3.3.3 Overall objective function

In an untrimmed raw video, query does not necessarily occur in every frame of a video.

Previous work Shi et al. (2019) propose a contextual similarity to weight the importance of frames

corresponding to a query.

In our work, we have the temporal MIL to learn the query specific

attention over the temporal domain. Thus, only the query-related frame should contribute to the

spatial grounding. We propose to utilize our temporal level grounding results as guidance to mask

out query-irrelevant frames’ contributions in spatio-level MIL ranking loss:

L 𝑘

spatio =

𝑇
∑︁

𝑡=1

I (𝑆(𝑉𝑡, 𝑄 𝑘 ) > 0) L𝑡,𝑘

𝑟𝑎𝑛𝑘 ,

(5.8)

where L𝑡,𝑘
𝑟𝑎𝑛𝑘

is computed by Eq. 5.4. The temporal grounding result I (𝑆(𝑉𝑡, 𝑄 𝑘 ) > 0) is incorpo-

rated into the spatial ranking loss so that spatial and temporal MILs are mutually guided.

We add a penalty term to avoid the trivial solution 𝑆(𝑉𝑡, 𝑄 𝑘 ) = 0. The final objective function

of our model is formulated as:

L 𝑘 = L 𝑘

spatio + L 𝑘

tem +

𝜆
𝑇

𝑇
∑︁

𝑡=1

−𝑤𝑡 𝑆(𝑉𝑡, 𝑄 𝑘 ),

(5.9)

where 𝜆 is the weight for the sparsity constraint. And the final objective is the average of ranking

loss on each query and is summarized as L = 1
𝐾

(cid:205)𝑘 ∈𝐾 L 𝑘 .

5.4 Experiments

Following Shi et al. (2019), we train and evaluate our model on YouCookII dataset Zhou et al.

(2018b) in a weakly-supervised setting. Besides, we validate the generalization ability of our model

on RoboWatch dataset Sener et al. (2015).

5.4.1 Dataset

YouCookII Zhou et al. (2018b) contains 2, 000 cooking videos from 89 recipes. Each video

recipe consists of 3 to 15 steps. Each step is described by a sentence including multiple queries.

We follow Zhou et al. (2018a); Shi et al. (2019) to extract 15K video-description pairs from the

steps. Training, validation and testing splits contain 5161, 3483 and 1560 pairs, respectively. The

88

average duration of each step is 19.6s. Bounding box annotations Zhou et al. (2018a) for the most

67 frequently appearing objects in the description sentence for the validation and testing split are

used. The presence and bounding boxes of objects are labeled every second in a video, which can

be used to evaluate spatio-temporal grounding models.

RoboWatch Sener et al. (2015) contains 255 YouTube instructional videos, each of which also

contains multiple steps. Huang et al. (2018) extends the bounding box annotation for a part of those

videos, and the query can be either a word or a phrase. We follow Shi et al. (2019) to evaluate the

generalization ability of our model trained on YouCookII Zhou et al. (2018a) dataset. Following Shi

et al. (2019), we evaluate our model on the aligned pairs of video and query in RoboWatch. Since

each query appears in all of the annotated frames of its video, we only evaluate spatial grounding

on RoboWatch dataset.

5.4.2 Evaluation Metric

We follow Shi et al. (2019); Zhou et al. (2018a); Chen et al. (2019a) to evaluate spatial

grounding performance using box accuracy and query accuracy. The box accuracy is defined as

the ratio of correctly grounded boxes to all of the grounded boxes by setting a threshold, i.e., , 50%,

for Intersection-over-Union (IoU) between the grounded box and its corresponding ground-truth.

Query accuracy is defined as the ratio of correctly grounded queries to all queries. Following Shi

et al. (2019), the average of each class accuracy and the global accuracy without considering the

class are evaluated, which are denoted as macro-accuracy and micro-accuracy, respectively. In

addition, we follow an existing temporal grounding method Mithun et al. (2019) and compute the

temporal IoU (tIOU) between the grounded and ground-truth temporal occurrence as the temporal

grounding metric.

However, in previous work Shi et al. (2019); Zhou et al. (2018a), the box accuracy and query

accuracy consider only the frames with query occurring. These two evaluation metrics ignore the

frames that no query object appears, and thus are not suitable for evaluating the performance of

spatio-temporal grounding on untrimmed videos. We propose the following metric to evaluate

89

spatio-temporal grounding models for untrimmed videos:

stACC =

1
(cid:12)S (𝑈)(cid:12)
(cid:12)
(cid:12)

∑︁

𝑡∈S (𝐼 )

I (cid:0)IoU( ˆ𝑟𝑡, 𝑟𝑡) > 𝑅(cid:1) ,

(5.10)

where S (𝑈) is the union set of frames in which either ground-truth or the grounded bounding boxes

are located for a query in an entire video. S (𝐼) is the intersection set of frames in which both the

ground-truth and the grounded bounding boxes occur simultaneously for a query. To compute the

intersection, for each grounded box, we count the intersected box by computing the IoU between

the grounded box ˆ𝑟𝑡 and its corresponding ground truth 𝑟𝑡 with threshold 𝑅. Similar to existing

grounding metrics, our proposed stACC can be used to compute box accuracy and query accuracy

by considering the class of each query. Function I (·) is an indicator function. The proposed metric

in Eq. 5.10 will be used for evaluating the spatio-temporal grounding performance.

5.4.3

Implementation Details

Following Shi et al. (2019), the description sentence is parsed by Stanford CoreNLP parser Man-

ning et al. (2014) into nouns and pronouns. We also parse the predicates of nouns/pronouns in

the description sentence by SpaCy Honnibal and Johnson (2015). A pre-trained BERT Devlin

et al. (2019) model is applied to encode the sentence. For visual modality, a Faster R-CNN frame-

work Ren et al. (2015) with VGG-Net Simonyan and Zisserman (2014) as backbone pre-trained

on Visual Genome Krishna et al. (2017) is applied to extract top-20 confident region proposals for

each frame, which is the same setting as Shi et al. (2019); Zhou et al. (2018a). We uniformly

sample 16 frames from each video. Hyperparameter 𝜈𝑎 in spatial prior is set to 40. Visual and

textual features are embedded to a joint feature space with 512-dimension.

𝑡𝑎𝑛ℎ is used as the

activation function for both visual and text embedding.

We use TITAN Xp and implement the network using PyTorch. Adam with learning rate 0.001

is used for optimization. The ranking margin Δ𝑡 and Δ𝑠 is set to 10 and 5. Constraint weight 𝜆 is

set to 0.9. We use a batch-size of 8 in all experiments. Thus, each of positive sample is coupled

with 7 negative samples. Following Shi et al. (2019); Zhou et al. (2018a); Chen et al. (2019a), we

report the grounding results in both validation and test split.

90

5.4.4 Comparison

5.4.4.1 Spatial Grounding on YouCookII Dataset

We compare our method with the state-of-the-art weakly-supervised video grounding meth-

ods Chen et al. (2019b); Shi et al. (2019); Zhou et al. (2018a) and two extensions from image

grounding methods DVSA Karpathy and Fei-Fei (2015) and GroundR Rohrbach et al. (2016). Fol-

lowing Shi et al. (2019); Zhou et al. (2018a), we utilize RPN to extract region proposals. Existing

work Shi et al. (2019); Zhou et al. (2018a) only evaluate the spatial grounding accuracy on the

frames where the query occurs. The frames without the query are disregarded in evaluation. We

follow this evaluation setting and report the results under the four metrics used in Shi et al. (2019).

As shown in Table 5.1, the proposed method on spatial grounding consistently outperforms the

comparison methods. This is because we bridge the granularity gap between text and visual do-

mains by considering the activity-effect. In addition, our spatio-temporal MIL framework ensures

a spatial grounding model learned from the frames that query appears, even without temporal

annotations.

Methods

pre-trained

macro

micro

box accuracy(%)

MSCOCO
Extended GroundR
MSCOCO
Zhou et al.
Chen et al.
MSCOCO
VisualGenome
Extended DVSA
VisualGenome
Shi et al.
Ours (BERT)
VisualGenome
Ours (Glove+Activity) VisualGenome
VisualGenome
Our full model

val
19.63
30.31
33.24
36.90
39.54
37.40
38.50
40.66

test
19.94
31.73
34.90
37.55
40.71
38.88
39.28
41.67

val
-
-
-
44.26
46.41
48.12
46.57
49.11

test
-
-
-
44.16
46.33
45.20
45.85
48.22

query accuracy(%)

macro

micro

val
-
-
-
38.48
41.29
39.00
40.11
41.43

test
-
-
-
39.31
42.45
40.55
41.07
42.55

val
-
-
-
46.27
48.52
46.10
48.59
49.71

test
-
-
-
46.41
48.41
47.23
47.91
48.91

Table 5.1 Weakly-supervised spatial grounding results on YouCookII. “pre-trained” indicates the
dataset that RPN is pre-trained on. Zhou et al., extended GroundR and Chen et al. only report
macro box accuracy in their papers.

5.4.4.2 Temporal Grounding on YouCookII Dataset

We compare our method with an existing weakly-supervised video temporal grounding method

TGA Rohrbach et al. (2016) and an extension of Shi et al. (2019) to temporal grounding. The

extension of Shi et al. (2019) to temporal grounding is achieved by extending its frame-query

91

Methods
TGA Mithun et al. (2019)
Extension of Shi et al. (2019)
Ours

tIOU
29.43
27.12
39.51

Table 5.2 Weakly-supervised temporal grounding results on YouCookII. tIOU is used as the evalu-
ation metric.

contextual similarity module, which conducts 0-1 normalization of frame-query importance across

frames during training. We use it in the test phase to mask out the frame-query pair whose contextual

similarity score is less than 0.5.

As shown in Table 5.2, our approach significantly outperforms the extension of Shi et al. (2019)

and TGA Mithun et al. (2019) by 12% and 10%, respectively. This is because these video temporal

grounding method grounds a query to the relevant frames based on the similarity between the query

and the entire frame. In our method, spatial grounding provides guidance for temporal grounding to

be focused on the query specific region. This allows us to represent the visual data more accurately.

5.4.4.3 Spatio-temporal Grounding on YouCookII Dataset

Since there is no existing weakly-supervised spatio-temporal grounding method for untrimmed

videos, we extend Shi et al. (2019) to ground temporal occurrences (Extension of Shi et al. (2019))

using the method described above. We also compare with the weakly-supervised spatio-temporal

grounding method Chen et al. (2019b), which is originally developed for trimmed video. The

performance of spatio-temporal grounding methods is evaluated using the metric stACC described

in Eq. 5.10.

As shown in Table 5.3, our approach significantly outperforms the Extension of Shi et al. (2019)

by 5 ∼ 8%. This shows that a direct extension to spatio-temporal grounding is far from solving this

challenging problem. Our method is more effective since we solve this problem using a mutually

guided MIL. When generalized to untrimmed videos, Chen et al. (2019b) shows inferior results to

ours, because they highly rely on a visual tracker that easily fails due to the frequent camera shot

cut in untrimmed videos.

92

Methods

Extension of Shi et al. (2019)
Chen et al. (2019b)
Ours-max
Ours w/o attention
Our full model

macro

micro

val
15.89
7.31
5.17
18.94
21.73

test
19.10
7.70
5.25
20.98
24.25

val
17.35
8.02
5.93
22.31
25.50

test
18.54
8.79
6.11
21.41
25.65

Table 5.3 Weakly-supervised spatio-temporal grounding results on YouCookII. stACC is used as
the evaluation metric.

Methods
Extended DVSA Karpathy and Fei-Fei (2015)
Shi et al. (2019)
Ours w/o activity
Our full model

all
28.25
31.68
30.11
34.21

unseen split
25.12∗
26.79∗
26.56
35.97

Table 5.4 Generalization results on RoboWatch using query micro-accuracy (%). “*” indicates the
results are achieved by running the authors’ code on our side. All the other comparison results are
from their original papers.

5.4.4.4 Generalize Grounding Model to RoboWatch Dataset

Following Shi et al. (2019), we conduct the generalization ability experiment of the grounding

model trained on YouCookII dataset. We train our grounding model using the nouns and pronouns

parsed in the sentences of YouCookII and directly test the grounding model on RoboWatch dataset.

We compare our method with two existing methods Shi et al. (2019) and extended DVSA Karpathy

and Fei-Fei (2015) and a variant of our method that does not contain activity-driven object states

encoding module The comparison is conducted on two types of data split, including the entire

test set of RoboWatch and its unseen split which only consists of the objects that never occur in

YouCookII such as “oreo”, “flesh”, “alcohol”, “hanger”, “tie” etc.

As shown in Table 5.4, the proposed activity-driven model outperforms the variant Ours w/o

activity and two other existing methods Shi et al. (2019) and the extended DVSA Karpathy and

Fei-Fei (2015) by a large margin. Also, our full model’s performance in the unseen split is even

better than the performance in the entire test set denoted as “all”. This is because we model the

activity effect on objects’ physical states. Thus, even though our model has never seen the query

during training, it can utilize the seen activity information to ground the unseen query on which

the activity is performed.

93

5.4.5 Ablation Studies

5.4.5.1 Activity-driven Object-States Encoding

We conduct ablation study on the activity-driven object states encoding module with following

two variants: 1) “BERT” which first encodes the entire description sentence by a pre-trained

BERT model Devlin et al. (2019) and then extracts the query embedding of the objects without

considering activity; 2) “Glove+Activity”. It first extracts the predicates and nouns/pronouns from

the description sentence by Manning et al. (2014) and then encodes the predicate-object pair into

200-dimensional GloVe Pennington et al. (2014). Note that GloVe is used as word embedding

in Shi et al. (2019); Zhou et al. (2018a).

As shown in Table 5.1, the superiority of our full model over the variant “BERT” shows that

encoding activity effect on object states benefits grounding. As expected, the activity-driven object

state encoding module proves to bridge the granularity gap between the coarse text modality and

the rich visual modality, by incorporating the underlying activity-effect on object states into text

representation. Moreover, the performance gain from the activity cue is significantly larger than

the gain from better query embeddings, i.e., , Glove and BERT features. This further demonstrates

the effectiveness of the proposed activity-driven encoding.

5.4.5.2 Temporal MIL Loss

We conduct ablation study on the temporal grounding with the following variants: 1) “Ours-

max”, which replaces the average instance operation in Eq. 5.2 by 𝑚𝑎𝑥 operation, selecting the

top-ranked query specific frame to represent the bag; 2) “Ours w/o attention”, which uses Eq. 5.5

as the temporal ranking loss but removes the guidance of frame consistency attention block.

Table 5.3 shows that our full model achieves the best performance. Its superiority over “Ours

w/o attention” demonstrates that the temporal context information w.r.t frame similarity plays an

important role in video grounding with large visual inconsistency. The variant “Ours-max” is

inferior to others, indicating selecting the top-ranked frame as the video representation is not

appropriate for temporal grounding. This is because in an untrimmed video, the query may appear

discontinuously across frames, leading to multiple frames for the query in the video.

94

Methods

Ours w/o region refine
Our full model

distant view split
query
box
21.35
20.54
21.85
21.03

all

box
48.07
48.22

query
48.72
48.91

Table 5.5 Ablation study on YouCookII for our activity-driven region proposal refinement module.
The results are box/query micro-accuracy. Results of the distant view frame split in the test set and
the entire test set are reported.

Figure 5.4 Qualitative results of our spatio-temporal video grounding model. The yellow and cyan
boxes are the grounded results of the corresponding queries in description sentences. The white
boxes are their ground-truth. Best viewed in color.

5.4.5.3 Region Proposal Refinement

We conduct ablation study for the activity-driven region proposal refinement module. On

YouCookII dataset, only 22% frames are captured distant view. Thus, we evaluate this module on the

distant view split that only contains the frames with human and the entire test set, correspondingly.

In the variant “Ours w/o region refine”, we simply keep the top-𝑁 proposals without refinement.

As shown in Table 5.5, our method outperforms the variant without region refinement, which

elaborates the effectiveness of our region refinement module. Our full model refines the proposals

to include more query related proposals. This makes the query more likely to be grounded.

5.4.6 Qualitative Results

The qualitative results of YouCookII dataset are shown in Fig. 5.4. Each row depicts 6 frames

sampled from a video. Camera shot cut frequently appears in these videos. But even though the

95

Description: fry the spring roll in deep oil and serveDescription: place ﬂour onto a tray and place the dough on topDescription: fry the beef in the panGT temporal occurrence:Grounded temporal occurrence:GT temporal occurrence:Grounded temporal occurrence:GT temporal occurrence:Grounded temporal occurrence:large visual inconsistency appears, our method is able to ground each query in terms of its temporal

occurrence and spatial locations.

5.5 Summary

In this paper, we investigate the spatio-temporal grounding in untrimmed videos with frequent

visual inconsistency in a weakly-supervised manner. We develop two novel MIL ranking losses

for the spatial and temporal domains. Furthermore, to bridge the granularity gap between the

coarse text information and the detailed visual information, we introduce an activity-driven object

state encoding module to enhance textual representation. Experiments on two popular datasets

demonstrate the superiority of our method and its generalization ability to other datasets with

unseen queries.

96

BIBLIOGRAPHY

Alomari, M., Duckworth, P., Hogg, D. C., and Cohn, A. G. (2017). Natural language acquisition
In Thirty-First AAAI Conference on Artificial

and grounding for embodied robotic systems.
Intelligence.

Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. (2019). Murel: Multimodal relational

reasoning for visual question answering. In CVPR.

Cao, Z., Hidalgo, G. M., Simon, T., Wei, S., and Sheikh, Y. (2019). Openpose: Realtime multi-

person 2d pose estimation using part affinity fields. TPAMI.

Chen, K., Gao, J., and Nevatia, R. (2018). Knowledge aided consistency for weakly supervised
In Proceedings of the IEEE Conference on Computer Vision and Pattern

phrase grounding.
Recognition, pages 4042–4050.

Chen, L., Zhai, M., He, J., and Mori, G. (2019a). Object grounding via iterative context reasoning.
In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages
0–0.

Chen, Z., Ma, L., Luo, W., Tang, P., and Wong, K.-Y. K. (2020). Look closer to ground better:
Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308.

Chen, Z., Ma, L., Luo, W., and Wong, K.-Y. K. (2019b). Weakly-supervised spatio-temporally

grounding natural sentence in video.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional

transformers for language understanding. In ACL, pages 4171–4186.

Duan, X., Huang, W., Gan, C., Wang, J., Zhu, W., and Huang, J. (2018). Weakly supervised
dense event captioning in videos. In Advances in Neural Information Processing Systems, pages
3059–3069.

Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. (2019).

Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR.

Gao, M., Davis, L., Socher, R., and Xiong, C. (2019). Wslln: Weakly supervised natural language

localization networks. In EMNLP-ĲCNLP.

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask R-CNN. In ICCV.

Honnibal, M. and Johnson, M. (2015). An improved non-monotonic transition system for depen-

dency parsing. In EMNLP.

Huang, D.-A., Buch, S., Dery, L., Garg, A., Fei-Fei, L., and Niebles, J. C. (2018). Finding

97

“it”: Weakly-supervised, reference-aware visual grounding in instructional videos.
Conference on Computer Vision and Pattern Recognition (CVPR).

In IEEE

Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image
descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3128–3137.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li,
L.-J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using
crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–
73.

Kwak, S., Cho, M., Laptev, I., Ponce, J., and Schmid, C. (2015). Unsupervised object discovery

and tracking in video collections. In ICCV.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014). The

stanford corenlp natural language processing toolkit. In ACL.

Mithun, N. C., Paul, S., and Roy-Chowdhury, A. K. (2019). Weakly supervised video moment

retrieval from text queries. In CVPR.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word represen-

tation. In EMNLP.

Prest, A., Leistner, C., Civera, J., Schmid, C., and Ferrari, V. (2012). Learning object class detectors

from weakly annotated video. In CVPR.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection

with region proposal networks. In NIPS.

Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. (2016). Grounding of textual
phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–
834. Springer.

Sener, O., Zamir, A. R., Savarese, S., and Saxena, A. (2015). Unsupervised semantic parsing of

video collections. In ICCV.

Shi, J., Xu, J., Gong, B., and Xu, C. (2019). Not all frames are equal: Weakly-supervised video
grounding with contextual similarity and visual clustering losses. In CVPR, pages 10444–10452.

Shridhar, M. and Hsu, D. (2018). Interactive visual grounding of referring expressions for human-

robot interaction. arXiv preprint arXiv:1806.03831.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556.

98

Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., and Li, H. (2019a). Unsupervised deep tracking.

In CVPR.

Wang, X., Jabri, A., and Efros, A. A. (2019b). Learning correspondence from the cycle-consistency

of time. In CVPR.

Yang, S., Li, G., and Yu, Y. (2019a). Cross-modal relationship inference for grounding referring

expressions. In CVPR.

Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., and Luo, J. (2019b). A fast and accurate

one-stage approach to visual grounding. In ICCV.

Yang, Z., Kumar, T., Chen, T., and Luo, J. (2019c). Grounding-tracking-integration. arXiv preprint

arXiv:1912.06316.

Yu, H. and Siskind, J. M. (2013). Grounded language learning from video described with sentences.

In ACL.

Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., and Rohrbach, M. (2019). Grounded video

description. In CVPR.

Zhou, L., Louis, N., and Corso, J. J. (2018a). Weakly-supervised video object grounding from text

by loss weighting and object interaction.

Zhou, L., Xu, C., and Corso, J. J. (2018b). Towards automatic learning of procedures from web

instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.

99

CHAPTER 6

EXPLAINABLE VIDEO ENTAILMENT WITH VISUALLY GROUNDED EVIDENCE

6.1

Introduction

Bridging the gap between computer vision and natural language processing is a rapid growing

research area in various tasks including visual captioning Zhou et al. (2020); Vinyals et al. (2015),

VQA Lei et al. (2018); Antol et al. (2015); Tapaswi et al. (2016), and visual-textual retrieval Lei

et al. (2020); Li et al. (2019). Liu et al. (2020) introduced a new video entailment problem to

infer the semantic entailment between a premise video and a textual hypothesis. As shown in

Fig. 6.1, video entailment Liu et al. (2020) task aims at determining whether a textual statement is

entailed or contradicted by a video. In Fig. 6.1, the label for the first statement with the premise is

entailment because the statement can be concluded from the dialog of the first clip in which “the

woman wearing jeans” appears. On the contrary, the second statement is labeled as contradiction,

because the premise does not have evidence to conclude the statement. In this paper, we aim to

address the video entailment with a faithful explanation.

The main challenge of video entailment is that it requires fine-grained reasoning to understand

the complex story-based videos and then make a correct judgment. The story-based videos are

also accompanied by the textual dialog (subtitles) (see Fig. 6.1). In the existing method for video

entailment Liu et al. (2020), video frames are less exploited than dialog, because it lacks of a

fine-grained understanding of the video and the model does not know which frames in the long

video are related to the statement. However, the entities in the textual statement are usually people

with their attributes, e.g., , “A woman wearing jeans” (see Fig. 6.1), which should be implied in the

video frames instead of the dialog.

To this end, we propose to enhance the entailment judgment by introducing a visual grounding

model that links the entity described in a statement to the evidence in the video. This is motivated by

the fact that the statement is usually only related to a small subset of the long and untrimmed video.

Based on this, a visual grounding module for the entities described in the statement is developed to

localize the clips where the entity appears and guide the judgment to focus on the entity’s occurring

100

Figure 6.1 Video entailment aims at judging if a statement is entailed or contradicted by a video
and its aligned textual dialog. A pair of real and fake statements have a similar structure and
subtle difference (marked by the red dot line). We incorporate visual grounding into the entailment
judgment. The entity grounding, e.g., , “A woman wearing jeans” guides the entailment judgment
module to focus on the entity-relevant frames and the corresponding sentences in the dialog (marked
by blue in the temporal axis) to make a correct judgment. Best viewed in color.

clips as well as the aligned sentences in the dialog. For example, the statements in Fig. 6.1 are

linked to the first and fourth clips and sentences, considering the entity "The woman wearing jeans".

By highlighting the relevant clips and sentences, the details can be better understood compared

to Liu et al. (2020) that does not have grounding guidance and equally considers all of the frames.

Visual grounding has been attempted in many video+language tasks, such as image caption-

ing Zhou et al. (2019) and VQA Lei et al. (2019). However, it cannot be directly generalized

to the entailment task, because the bounding box annotations of grounding are not provided in

the entailment dataset. Therefore, we resort to the existing weakly-supervised object grounding

methods Karpathy and Fei-Fei (2015); Chen et al. (2018) to address the training of the grounding

module. But these methods are limited to explicit natural objects (e.g., , “apple”, “river”). Our

grounding is more demanding, as we target at the described entities with fine-grained attributes,

such as hair, clothes and gender, to be grounded to the challenging story-telling videos.

Furthermore, we aim at improving the faithfulness of the entailment model by evaluating if

the entailment is judged based on correct evidence. A faithful entailment model should tell not

only whether the statement is contradictory to the video but also which words or phrases in the

statement make it contradictory to the video. A pair of real/fake statements usually have a similar

101

[00:12.920, 00:16.430]
“hi yeah I'm cool hey you do me a favor”Dialog[00:01.709, 00:04.579]
"got pony lessons in an hour I'm hearing"[00:30.390, 00:32.479
“the season's over no no no no no Pamela”[00:12.920, 00:16.430
“please give me a glass of water”VideoPredictionEntailmentStatement(a) The woman wearing jeans  has kids that have pony lessons in hour.Ground to clip 1&4Focus on clip 1&4(b) The woman wearing jeans has kids that have school tomorrow.Contradictionstructure and only have very subtle differences, with only a small number of words’ replacement,

e.g., . “pony lessons in hour” and “school tomorrow” marked by the red dot line in Fig. 6.1. Thus,

we propose to regularize the training of the entailment judgment module by encouraging the local

explanation on the contribution of the words in the statement to conform to the subtle difference.

Our main contribution is threefold. First, we propose a novel approach to address video

entailment with visually grounded evidence. Second, we exploit the pairwise real/fake statements

to add the explainability to the entailment model, which can tell the specific words or phrases that

make the statement contradictory to the video. Third, extensive results demonstrate that our method

outperforms the state-of-the-art video entailment method.

6.2 Related Work

6.2.1 Visual Entailment

Natural language inference Dagan et al. (2005); Condoravdi et al. (2003); MacCartney and

Manning (2009); Camburu et al. (2018) is the task of understanding if a hypothesis sentence is

entailed or contradicted by a premise sentence, which is a fundamental task in natural language

understanding. Inspired by the textual entailment, recently visual entailment is proposed to extend

NLI to the visual domain. In visual entailment, the premise is an image or a video. And the goal

is to predict if the textual hypothesis can be confirmed in the visual premise.

Recently, researchers began to solve visual entailment mainly on image premise. SNLI-VE Xie

et al. (2019) is a visual entailment dataset combining the textual entailment Bowman et al. (2015)

and Flickr30k image caption Young et al. (2014). It also provides a solution model that utilizes

ROI generation and models the fine-grained cross-modal information. However, the hypothesis

(e.g., , “The two women are holding packages”) is much more straightforward compared to the

hypothesis in our video entailment. e-SNLI-VE-2.0 Do et al. (2020) appends and corrects SNLI-

VE Xie et al. (2019) by the human-written language hypothesis. It also provides the explanation

ground-truth of why the hypothesis is entailed/contradicted by the premise. NLVR2 Suhr et al.

(2019) is another image entailment dataset that requires quantitative and comparing reasoning. But

similar to SNLI-VE Xie et al. (2019), it also mainly focuses on objects in the natural images.

102

Recently, Liu et al. (2020) proposed Violin dataset that focuses on video entailment. Video

entailment is a challenging task as the complex temporal dynamics occur in the video. A fine-

grained reasoning of the social relations, human motions and intentions is necessary to understand

the story-based content and make a correct judgment.

6.2.2 Grounding for Video+Language Reasoning

Recently, many video+language tasks have been trying to explicitly link the language sentence

to the evidence in the video. Zhou et al. (2019) proposed a video description dataset with the

annotation of the bounding boxes of the referred objects. With this dataset, a good captioning model

is desirable by attending to appropriate video regions. For video question answering, Lei et al.

(2019) built a dataset with the spatio-temporal grounding annotation, which requires the model to

localize the temporal moments, detected the referred object, and answer the questions.

Different from captioning and VQA, video entailment needs a fine-grained understanding of

the entities with detailed attributes. Meanwhile, the existing video entailment does not provide the

grounding annotation. Thus, we propose to achieve the entity grounding in a weakly-supervised

manner.

6.2.3 Weakly-supervised Entity Grounding

Visual grounding is to localize the described entity to its occurring regions visually. Since

the annotation of bounding boxes is very expensive, sundry efforts have been made to achieve

object grounding in a weakly-supervised manner Karpathy and Fei-Fei (2015); Chen et al. (2018);

Rohrbach et al. (2016), mainly based on multiple instance learning. It also has been extended to

video domain Zhou et al. (2018); Shi et al. (2019); Huang et al. (2018); Chen et al. (2019b,a, 2020),

to achieve spatio-temporal grounding of entities in an untrimmed video.

In the video entailment task, the visually related entities are mainly characters, while the

existing grounding methods aim at grounding natural objects. Our grounding requires a fine-

grained understanding of human gender, dress, hair and other attributes. Therefore, we cannot

directly generalize the existing grounding methods to video entailment.

103

Figure 6.2 Given a video, its aligned dialog in text, and a textual statement for the video as input,
our goal is to predict if the statement is entailed or contradicted by the video and dialog. Our model
consists of three sub-networks: Entity Grounding, Entailment Judgment with Grounded Evidence,
and Contradiction Local Explanation. The entity grounding module helps to find if the described
entity occurs in the video clips. Moreover, entity grounding guides the judgment module to focus
on the entity-relevant clips and the corresponding sentences in the dialog (marked as “Key”), to
make a correct judgment. If judged as “contradiction", our model can also explain which words or
phrases in the statement make it contradictory to the video by generating an explanation heatmap.

6.2.4 Multi-modal VQA

Different from image entailment, video entailment is supposed to understand story-based video

content, such as movies. This is more challenging than the plain videos as multiple factors such

as human interactions, emotions, motivation, and scenes appear. Similar to existing videoQA

datasets Lei et al. (2019, 2020), the input to our entailment task is multi-modal, including both

videos and textual subtitles. For multi-modal VQA, early fusion was commonly used in merging

different modalities Na et al. (2017). Recent methods mainly leverage late fusion approaches Kim

et al. (2018, 2019). Another aspect Kim et al. (2020) is to utilize the content of QA pairs to shift

to the relevant modality and constrain the contribution of the irrelevant ones.

Video entailment requires a fine-grained understanding. The statement may only relate to the

details in a long and untrimmed video. Thus, we propose to ground the described entities to their

occurring clips and highlight the dialog sentence aligned to those clips for entailment judgment.

6.3 Our Approach

Given a story-like video aligned with a textual dialog (subtitles) and a hypothesis statement,

the entailment task is to predict if the hypothesis statement is entailed or contradicted by the

104

[00:01.709, 00:04.579]"got pony lessons in an hour I'm hearing”[00:12.920, 00:16.430]“hi yeah I'm cool hey you do me a favor”…[00:16.440, 00:18.439]“please give me a glass of water”VideoDialogStatementVid DetSTATEVid C3DEntity GroundingEntityDialogEntailmentJudgmentwithGroundedEvidenceExplanationtemporal segmentation!"×$%!×$&Key DialogKey C3D!×$%!"×$&Key C3DKey DialogSTATEOutput:EntailmentOutput:ContradictioncontradictorystatementexplanationThe woman wearing jeans has kids that have school tomorrow.The woman wearing jeans has kids that have school tomorrow.StatementThewomanwearingjeansParserContradictionLocalExplanationVideofeaturesDialogembeddingKey framesKey DialogFocusStatementFusionFuseJudgeOutput:HeatmapProblemSetupApproachpremise video (see the left of Fig. 6.2). The right part of Fig. 6.2 shows the overall pipeline of the

proposed method. We decompose our model into three sub-networks: entity grounding, entailment

judgment with grounded evidence, and contradiction local explanation, to address entailment in a

modularized manner.

The motivation of grounding entities described in the statement (e.g., , “a woman wearing a red

cape”) to frames comes from the observation that video modality is not well exploited compared to

dialog modality in the existing method Liu et al. (2020). However, many contradictory statements

such as the incorrect attributes should be determined from the frames instead of the dialog, (e.g., ,

“a woman wearing a blue cape”) in Fig. 6.3. Moreover, the statements are written about different

aspects of a video Liu et al. (2020), and a statement is usually related to a small subset of video

frames. The entity grounding helps to find the entity-relevant frames and then guides the entailment

judgment module to highlight these frames. To learn a credible entailment judgment model, we

propose to not only judge the semantic entailment but also explain which words or phrases make

the statement contradictory to the video by a heatmap that indicates the contribution of each word

in the statement to the model prediction.

6.3.1 Preliminaries

Text Representation. Following Violin Liu et al. (2020), we use BERT encoder Devlin et al.

(2019a) provided by Violin to represent the statement and dialog, resulting in a 768-dimension

vector for each word. Then using a bi-directional LSTM for both statement and dialog, each word

is also embedded to 𝑑-dimension. A statement is tokenized into a word sequence, in the length of

𝑁𝑙. A textual dialog is also tokenized and represented as a word sequence. Then, by encoding, the
statement is represented as 𝑅 = {𝑟𝑖}𝑁𝑙
dialog is represented as 𝐻 = (cid:8)ℎ 𝑗 (cid:9) 𝑁𝑠
𝑗=1
denotes the number of words in the long dialog. The starting time 𝑡 𝑗

𝑖=1 in which 𝑟𝑖 indicates the 𝑖-th word’s representation. The
, in which ℎ 𝑗 indicates the 𝑗-th word’s representation. 𝑁𝑠

𝑠 and the ending time 𝑡 𝑗

𝑒 of the

𝑗-th sentence are also provided, which can be aligned with the video frames.

Video Representation. Following Violin Liu et al. (2020), we extract a sequence of visual

features from video frames and then encode the visual features by a bi-directional LSTM layer.

105

Figure 6.3 Training of the entity grounding module. We extract the positive entity 𝑒𝑛 from the
statement aligned with the video and the negative entity 𝑒′
𝑛 from a statement unaligned with the
video. Besides, real and fake statements are formed in pairs. Thus, the entity in the fake statement
can be utilized as a golden negative entity 𝑒∗
𝑛, which is slightly different from 𝑒𝑛 and is a hard
sample to enhance the grounding model’s training. The training process encourages the matching
score of the positive entity to be larger than any negative entity. Best viewed in color.

The video is then represented as C ∈ R𝑇×𝑑, where 𝑇 is the number of frames, and 𝑑 is the feature

dimension of each frame.

To realize grounding, we first detect the people in the input video. Specifically, we extract

the frames of the middle timestamps corresponding to each sentence (𝑡 𝑗

𝑠 + 𝑡 𝑗

𝑒 )/2 and apply Faster

R-CNN Ren et al. (2015) pretrained on COCO Lin et al. (2014) to detect all of the people from each

frame and extract their features. Each person is represented by a 4096-dimension vector, denoted as

𝑣 𝑘 . Then each video is formed as a set of persons 𝑉 = {𝑣 𝑘 }𝐾

𝑘=1, where 𝑣 𝑘 encodes the 𝑘-th person.

6.3.2 Entity Grounding Module

In the existing video entailment method Liu et al. (2020), the performance gain of video modality

is limited compared to dialog modality. Visual information needs fine-grained understanding, but

the existing work equally considers all of the frames even if the frames are not relevant to the

statement. Video modality should be responsible for a lot of information described in the statement

such as entity attributes (e.g., , gender and clothes). We propose to leverage entity grounding in

the video modality to improve the entailment judgment in a modularized manner (see Fig. 6.2).

First, our grounding module is developed to achieve spatio-temporal grounding of the subject entity

described in the statement. The predicted temporal occurrences of the entity are used to guide the

following cross-modal entailment judgment.

However, two technical challenges need to be handled to leverage visual grounding for the

106

Positive EntityA man wearing a red cape is ﬂying while holding another man.Negative EntityA man in a red shirt walks through a store that's on ﬁre.Golden Negative EntityA man wearing a blue cape is ﬂying while holding another man.ene′ ne*nperson 
detection…123Kmatching scores(V,en)>s(V,e′ n)s(V,en)>s(V,e*n)vt1vt2vt1vt2vtNv1v2vKVene′ ne*nentailment task. First, spatial-temporal annotations of entities are typically not available for the

entailment task so that existing fully-supervised grounding-based VideoQA methods Lei et al.

(2019) cannot be directly leveraged. We resort to multiple instance learning Zhou et al. (2018) to

achieve entity grounding in a weakly-supervised fashion. Second, detailed visual attributes (e.g., ,

clothes and hair) of entities are essential for the entailment task but they are typically ignored by

the existing object grounding methods Shi et al. (2019); Zhou et al. (2018); Chen et al. (2020).

To extract the entity and its attributes from a textual statement, we employ a constitute parsing

method Kitaev and Klein (2018). For example, in Fig. 6.2, “The woman wearing jeans” is an entity

extracted from the corresponding statement “The woman wearing jeans has kids that have pony

lessons in hour”. The extracted entities in a statement are denoted as 𝐸 = {𝑒𝑛}𝑁𝑒

𝑛=1, where 𝑁𝑒 is the

total number of entities and 𝑒𝑛 indicates the 𝑛-th entity.

To ground the entity to its occurring frames, we compute the matching score 𝑠(𝑉, 𝑒𝑛) between

video 𝑉 and an entity 𝑒𝑛 as:

𝑠(𝑉, 𝑒𝑛) =

1
𝐾

𝐾
∑︁

𝜎(𝐹𝐶1(𝑣 𝑘 ||𝑒𝑛))

(6.1)

𝑘=1
where 𝐹𝐶1 is a fully-connected layer and 𝜎 is the sigmoid activation. We take average of the scores

of the 𝐾 people as the entity-video matching score 𝑠(𝑉, 𝑒𝑛).

Following the existing visual-textual matching work Li et al. (2019); Chen et al. (2020); Zhou

et al. (2018), we formulate the weakly-supervised learning of grounding as:

L𝑔𝑎 = − log(1 − 𝑠(𝑉, 𝑒′

𝑛)) − log(𝑠(𝑉, 𝑒𝑛)),

(6.2)

where 𝑒′

𝑛 is a “negative entity” extracted from a randomly sampled statement from another video,

which is different from 𝑒𝑛. Eq. 6.2 encourages that the aligned video-entity pair (𝑉, 𝑒𝑛) to better

matched and the unaligned pair (cid:0)𝑉, 𝑒′
𝑛

(cid:1) to be less matched.

Different from weakly-supervised video grounding Chen et al. (2020); Zhou et al. (2018), the

entailment task consists of the real/fake statements in pairs. Thus, we have the opportunity to obtain

hard negative samples, which is the entity described in the fake statement but NOT described in the

real statement. As shown in Fig. 6.3, the negative version is “a man wearing a blue cape”, which

107

is very similar to the positive one “a man wearing a red cape” but is contradicted by the video. We

name it as “golden negative entities” 𝑒∗

𝑛 and use it in training the grounding module:

L𝑔𝑏 = − log(1 − 𝑠(𝑉, 𝑒∗

𝑛)) − log(𝑠(𝑉, 𝑒𝑛)),

(6.3)

L𝑔𝑏 encourages the video 𝑉 to match more to its aligned entity 𝑒𝑛 and less to the golden negative

entity 𝑒∗

𝑛. To sum up, we train the grounding model by the grounding loss L𝑔 which balances the

negative entities and the golden negative entities by 𝛽.

L𝑔 = L𝑔𝑎 + 𝛽L𝑔𝑏,

(6.4)

During the inference, if the matching score 𝑠(𝑣 𝑘 , 𝑒𝑛) = 𝜎(𝐹𝐶1(𝑣 𝑘 ||𝑒𝑛)) between a person 𝑣 𝑘

and an entity 𝑒𝑛 exceeds a threshold, we consider that the 𝑘-th people is 𝑒𝑛. The temporal grounding

result will be used to guide the entailment judgment in Sec 3.3.

6.3.3 Entailment Judgment with Grounded Evidence

Statements are usually related to a small subset of the video, instead of the entire video. For

example, in Fig. 6.2, the clause in the statement “kids that have school tomorrow” should be judged

from the first sentence in the dialog. Thus, we utilize the entity grounding result to highlight the

frames and the corresponding textual dialog in the temporal range that the entity occurs, since

the frames and dialog are aligned by temporal boundaries. The highlighted frame and dialog

embeddings are concatenated and marked as key embedding 𝐶𝑂, 𝐻𝑂.

The model takes three streams in different modalities as input: video frames, dialog, and

statements. We leveraged the visually grounded evidence to make our model fixate its attention on

the frames where the entity appears. Then, we fuse the multi-modal data and predict whether the

statement is entailed or contradicted by the video.

To bridge the modal discrepancy between the video frames and textual content, we use hetero-

geneous reasoning Zhang et al. (2019) to fuse the statement representation 𝑅 with different context

embedding, including video embeddings 𝐶, dialog embeddings 𝐻 and key embeddings 𝐶𝑂, 𝐻𝑂

(see Fig. 6.4) respectively. The heterogeneous reasoning is based on a graph convolution layer Chen

108

Figure 6.4 Our multi-task learning framework for entailment judgment and its explanation. Given
the video and dialog embedding, we use heterogeneous reasoning to fuse them and update the
statement representation. Then, the statement representation is incorporated into two branches: the
judgment branch to predict if it is entailed or contradicted and the explanation branch to generate
a heatmap that shows the contribution of words in the statement in making it fake. GT abbreviates
ground-truth.

et al. (2019b):

𝑃∗ = 𝐴∗→𝑠 𝑋∗𝑊∗𝑠,

(6.5)

where ∗ denotes one of the context among video 𝐶, dialog 𝐷 and key 𝐶𝑂, 𝐻𝑂 and Adjacency

matrix 𝐴∗→𝑠 contains the similarity between the statement 𝑅 and the context embedding 𝑋∗.

Eq. 6.5 projects the context 𝑋∗ to an 𝑅-shaped embedding 𝑃∗ by a learnable linear layer 𝑊∗𝑠.

Then, to avoid forgetting, we learn a gating function 𝑧∗ by a linear operation 𝑊∗, 𝑏∗ and constrained

activation 𝑠𝑖𝑔𝑚𝑜𝑖𝑑,

𝑧∗ = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑊∗ [𝑅, 𝑃∗] + 𝑏∗),

(6.6)

and incorporate the projected embedding 𝑃∗ of different context into the statement representation

by:

𝑄∗𝑠 = 𝑧∗ ⊙ 𝑅 + (1 − 𝑧∗) ⊙ 𝑃∗.

(6.7)

Eq. 6.7 respectively results in three statement representations 𝑄𝑐𝑠, 𝑄 ℎ𝑠, 𝑄𝑐𝑜𝑠, 𝑄 ℎ𝑜𝑠 specific to the

video, dialog and key context. ⊙ indicates element-wise product. We concatenate them and update

109

Visual Accuracy
Method
C3D
Violin Liu et al. (2020)
Ours
C3D
Violin Liu et al. (2020) Resnet
Resnet
Ours

67.23
68.15
67.60
68.39

Real
74.66
79.21
79.10
79.52

Fake Human-written Adv-sampled
57.73
57.08
56.10
57.25

67.60
79.43
84.49
84.94

61.99
61.33
59.15
60.11

Table 6.1 Entailment Accuracy Comparison. We report the Accuracy (%) of all statements, real
statements, fake statements, human-written statements, and adversarially sampled statements. 2/3
of fake statements are human-written and the remaining 1/3 are adversarially sampled. Not that
“Visual” column denotes the visual features used in the entailment judgment stage.

the statement representation as:

𝑄 = [𝑅; 𝑄 ℎ𝑠; 𝑄𝑐𝑠; 𝑄 ℎ𝑜𝑠; 𝑄𝑐𝑜𝑠] ,

(6.8)

The updated statement representation 𝑄 is passed through a function 𝑓 that contains a linear layer

with 1-dimensional output and a sigmoid activation to predict the score of the statement to be real.

6.3.4 Explainable Entailment

The local explanation for judging a textual statement is defined as the contribution of each word,

which is in form of a heatmap for a sentence. Our method aims to regularize the training of entail-

ment judgment with its local explanation to promote the model’s faithfulness and generalization

ability (see the explanation branch in Fig. 6.4) Du et al. (2019). We encourage the entailment model

to focus more on the words that actually make the statement contradictory to the video, instead of

memorizing the dataset-specific artifacts.

In VIOLIN dataset Lin et al. (2014), more than half of the fake statements were collected by

modifying a small subset of the real statement to be contradicted by the video Liu et al. (2020),

which makes the difference between the real and fake statements subtle and alleviates the bias. We

propose to exploit the subtle difference as a kind of supervision signal for the local explanation.

During training, we have access to the real/fake statements that are formed in pairs. For example, a

pair of real and fake statements are: “A man in a black jacket gets off his white motorcycle” and “A

man in a black jacket gets off the bell towel." respectively. By a simple “diff ” operation between

them, the contradictory items are “the bell towel”. The indexes of the different words between the

real and fake statements obtained by the “diff ” operation are defined as the ground-truth of local

explanation. We mark it as a binary vector 𝑜𝑒 ∈ R𝑁𝑙×1 that is in length of the statement.

110

Specifically, we form the entailment judgment (see 3.3) and its explanation as multi-task

learning. The explanation branch in Fig. 6.4 takes the updated statement representation 𝑄 as

input and generates a heatmap 𝑢𝑒 ∈ R𝑁𝑙 that indicates the contribution of each word to the model

prediction 𝑓 (𝑄). The explanation loss L𝑟 is defined as:

L𝑟 =

𝑁𝑙
∑︁

𝑖=1

𝑒 (− log(𝑢𝑒)) + (1 − 𝑜𝑖
𝑜𝑖

𝑒) (− log(1 − 𝑢𝑒)),

(6.9)

which aligns the generated heatmap 𝑢𝑒 with the local explanation ground-truth 𝑜𝑒. The overall

objective function L𝑒 is defined as:

L𝑒 = L𝑐𝑙𝑠 + 𝜆L𝑟,

(6.10)

in which 𝐿𝑐𝑙𝑠 is the binary cross entropy loss for entailment judgment.

It balances entailment

judgment and its explanation by constraint 𝜆. If a statement is justified as real, each word should

be entailed by the premise. Thus, during training, we only regularize the fake statements. During

inference, if a statement is predicted as “contradiction”, the explanation module will be triggered

to generate the heatmap for the statement.

6.4 Experiments

6.4.1 Dataset

To our best knowledge, Violin Liu et al. (2020) is the only dataset for video entailment task.

Violin contains 15, 887 video clips and each video clip is annotated with 3 pairs of real/fake

statements, resulting in 95, 322 statements in total. Statements are in random lengths and have 18

words on average. The first two fake statements of each video are human-written by modifying a

small portion of the corresponding real statements. Thus, the human-written real/fake statements

have very subtle differences, such as one or two words replacement. The third negative statement

is adversarially sampled and has a relatively larger difference compared to the real statement.

Following the original paper, we split the Violin dataset into 80% for training, 10% for validation,

and 10% for testing.

111

6.4.2

Implementation Details

We use the pre-trained Bert Devlin et al. (2019b) features of both dialog subtitles and statements

provided by Liu et al. (2020). For grounding, a Faster R-CNN framework Ren et al. (2015) with

VGG-Net Simonyan and Zisserman (2014) as backbone pre-trained on COCO Lin et al. (2014) is

applied to extract persons and their features across frames. The entity grounding threshold is set

to 0.5. Both the visual and textual input are embedded into 𝑑-dimension for fusion, and 𝑑 is set as

256. We sample the frames corresponding to the middle timestamp of each sentence for grounding.

Adam with a learning rate of 1𝑒 − 3 is used for optimization. The constraint weight of grounding

module 𝛽 is set to 1. We set batch size as 8 in training. The entities in the statements of other

videos in the batch are sampled as the negative samples for training the entity grounding module.

For the contradiction explanation module, we only use the human-written samples for training.

Adam with a learning rate of 1𝑒 − 4 is used for optimization. Constraint weight of multi-task

learning 𝜆 is set to 1.

6.4.3 Comparison Methods

We compare our method with the only existing method proposed for the video entailment task,

to our best knowledge. Violin Liu et al. (2020) dataset provides a visual/language fusion model to

address entailment judgment. The statement representations are jointly modeled with its video and

subtitle by an attention-based fusion module.

Experimental results on Violin dataset are shown in Table 6.1. Our proposed explainable

entailment model along with grounded evidence given by our method outperforms the previous

video entailment method. Because we precisely model the alignment between the video frames

and dialog based on grounded evidence. We also evaluate the influence of different visual features

following Violin Liu et al. (2020). The results demonstrate that our method works for both

image-based features “Resnet” and motion-based features “C3D”.

112

Method
v1
v2
v3
Ours

Accuracy
66.72
67.60
66.53
68.39

Real Accuracy
73.60
75.50
77.78
79.52

Fake Accuracy
59.83
59.71
48.01
57.25

Table 6.2 Ablation Study of Entity Grounding for Entailment (%). v1: Removing the first con-
tradiction judgment from the entity grounding module. v2: Removing the temporal grounding
guidance on entailment judgment. v3: Removing L𝑔.

6.4.4 Ablation Study

6.4.4.1 How does grounding help in entailment?

To exhibit the effectiveness of entity grounding in entailment judgment, we compare our

proposed method with the following variants. (1) v1: Removing the first contradiction judgment

from the entity grounding module. Then, entity grounding is only used to provide temporal

guidance.

(2) v2: Removing the temporal grounding guidance on entailment judgment. We

substitute the Eq. 6.8 by 𝑄 = [𝑅; 𝑄 ℎ𝑠; 𝑄𝑐𝑠]. Each frame contributes to the statement without being

highlighted.

(3) v3: Removing L𝑔. The grounding module is trained without golden negative

statements.

Table 6.2 summarizes the results of the aforementioned variants. Comparing “Ours” and v3,

adding golden negative entities brings more than 1% performance improvement, as it improves the

grounding quality. Comparing “Ours” and v2, adding temporal grounding’s guidance is necessary

for making an accurate judgment. The contradiction judgment from the entity grounding module

also brings performance gain by comparing “Ours” to v1.

6.4.4.2 How does explanation help in entailment?

To explore the contribution of the add-on entailment explanation module, we conduct the

ablation study with the following variants: (1) v4: Using both the adversarial statements and

human-written statements in training the explanation model. (2) v5: Removing the explanation

regularizer L𝑟 and only use L𝑐𝑙𝑠.

Table 6.3 illustrates the results of the ablation study on the explanation module. The proposed

method outperforms the variant v5 without explanation module by 0.83%, which shows that the

multi-task learning boosts the performance of entailment judgment. By the outperformance to the

113

Method Accuracy Real Accuracy

v4
v5
Ours

67.65
67.32
68.39

78.75
80.63
79.52

Fake Accuracy
56.54
54.02
57.25

Method Explanation Accuracy

v6
Ours

72.42
75.20

Table 6.3 Ablation Study of the Add-on Ex-
planation Module for Entailment (%).

Table 6.4 Quantitative Result for Contradic-
tion Explanation (%).

variant v4, it is wise to train the explanation model with only human-written samples instead of the

adversarial samples, since the adversarial samples are very different from its paired real statement

in sentence structure.

6.4.5 Contradiction Explanation Result

Since the real and fake statements are formed in pairs, we can get access to the ground-truth

of the items (in words or phrases) that make the statement contradictory to the video. For human-

written fake statements, the annotators manually change a small portion of words or phrases in the

real statement, which makes the paired real and fake statements have similar grammar and very

tiny differences. Thus, the ground-truth of the contradictory items can be obtained by a simple

“diff ” operation between a real/fake pair. But in the adversarial sampled pairs, the real and fake

statements are mostly different in structure. Thus, we only use the human-written pairs for training

the explanation module. But we test all of the statements either human-written or adversarially

sampled.

We quantitatively evaluate the local explanation on the fake statements that are human-written.

The evaluation metric is defined as the percentage of the number of words that are correctly

explained over the overall number of words in the statement. The explanation results are exhibited

in Table 6.4. We achieve 75.2% accuracy in contradiction explanation, which indicates that more

than three-quarters of fake words can be found by our explanation model.

We also compare the proposed explanation method with a variant v6. v6 is the variant that

explains the entailment of the statement by finding the contradictory constitutes instead of the

contradictory words. Constitute parsing method Kitaev and Klein (2018) that was used in obtaining

entities in Sec.3.3 is applied to extract constitutes from statement. The result demonstrates that a

plain word-level explanation is better than using the constitute.

114

Figure 6.5 Visualization of the entailment judgment and its explanation with grounded evidence.
Strikethrough indicates that the video does not contain the described entity and thus is judged as
“contradiction”. The contradictory items are marked by the underline with the predicted scores.

6.4.6 Explainable Entailment Result

Fig. 6.5 presents several entailment judgment examples using our method. Our model can

successfully ground the described entities to the specific regions and the relevant frames, even if the

grounding annotation is not provided in training. Our model also has the resilience to the entities in

the fake statements that are absent in the video. The two fake statements contain the entities that are

missing (e.g., , “the blonde girl”, “the woman wearing the golden dress”), marked by strikethrough,

and are judged as fake in the grounding stage. The predicted fake items are marked by the underline

with explanation scores. We find if the statement is correctly judged as fake, the explanation result

is more reliable.

115

00:39.580, 00:40.000
'You look like me'00:37.380, 00:38.760
'I need my throat for talking, so thank you'00:22.980, 00:25.680
"An angel who's getting really close"00:18.120, 00:20.000
"She's..... She's Like an Angel"00:16.760, 00:18.000
'What were they talking about?'00:27.240, 00:28.940
'Quem é Você? O que você está fazendo aqui'True StatementPredictionFalse StatementPredictionA male blue bird gazes admirably at a female blue bird.A female blue bird gazes admirably at a male blue bird.(Entail)A female blue bird immediately mistrusts the male blue bird.A female blue bird immediately trusts the male blue bird.The female blue bird speaks to the male blue bird with suspicion.The man at the phone booth used a handkerchief to use the phone as to not leave any ﬁngerprints.(Entail)(Entail)(Contradict)(Entail)(Contradict)0.78730.57220.6577Grounded Entities: male blue bird, female blue birdUngrounded Entities: man at phone booth, handkerchief.The blonde girl wants to go home and sleep in her own bed.The man in the dark jacket and a man in a light blue suit and yellow apron laugh about a dropped tray of food.The man in the dark jacket and a man in a light blue suit and yellow apron laugh about a funny comedian's joke.The man in the dark jacket is drinking milk in the kitchen when he hears a loud crash coming from upstairs.The man in the dark jacket is drinking milk in the kitchen when he hears a dark barking outside.Prediction(Contradict)(Entail)The man in the dark jacket mistakes a man in a light blue suit for his mother as he walks upstairs.(Entail)Prediction(Contradict)(Contradict)(Contradict)Grounded Entities: man in dark jacket,  man in light blue suitUngrounded Entities: blonde girlTrue StatementFalse Statement00:08.130, 00:17.080
'no huh hi Jimbo you thought I was mom'00:19.770, 00:22.370
'curls outhouse getting mom some supper'00:22.380, 00:26.500
'she doesn't feel too well'00:26.510, 00:29.120
'what you doing drop it yeah'00:29.130, 00:40.000
'she dropped it yeah I better clean it up'0.95400.63740.94090.89000.58630.98800.92190.9980The game is being played to pick the godparent for the baby of the woman wearing the gold dress.The man kisses the back of the woman's head when he hears that she got a job.Prediction(Entail)Prediction(Contradict)True StatementFalse StatementGrounded Entities: Woman, ManUngrounded Entities: woman wearing the gold dress00:04.249, 00:05.579
'Hey,babe.'00:07.769, 00:09.219
'I am a working girl'00:09.249, 00:11.179
'bree asked me to join her company.'00:15.049, 00:16.739
'Hey,you're still gonna cook for me,right?'00:16.779, 00:18.759
'Are you kidding?You're my guinea pig'The woman is putting cards in a box when her husband arrives home.The woman is cooking dinner when her husband arrives home.The man is carrying a magazine in his hands when he arrives home.The man is carrying a suitcase in his hands when he arrives home.(Entail)(Entail)(Contradict)(Contradict)0.91640.91120.70390.58030.69146.5 Summary

In this paper, we present a novel approach for video entailment and its local explanation. Entity

grounding is highly incorporated into our task from two aspects. First, we train a weakly-supervised

entity video grounding module to judge a statement as “contradiction” if the statement consists

of an entity absent in the video. Then if the entity is present in the video, we infer the temporal

occurrence of that entity to guide the entailment judgment module focusing on the entity-relevant

clips. In addition to entailment judgment, our method is also developed to explain which words

or phrases make the statement contradictory to the video. We formulate the local explanation as

a regularizer to the decision-making of entailment to improve the model’s faithfulness. Extensive

results on Violin dataset demonstrate the resulting model consistently outperforms the existing

methods.

116

BIBLIOGRAPHY

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. (2015).

Vqa: Visual question answering. In CVPR, pages 2425–2433.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for

learning natural language inference. In EMNLP.

Camburu, O.-M., Rocktäschel, T., Lukasiewicz, T., and Blunsom, P. (2018). e-snli: Natural

language inference with natural language explanations. In NIPS.

Chen, J., Bao, W., and Kong, Y. (2020). Activity-driven weakly-supervised spatio-temporal ground-

ing from untrimmed videos. In ACM Multimedia, pages 3789–3797.

Chen, K., Gao, J., and Nevatia, R. (2018). Knowledge aided consistency for weakly supervised
In Proceedings of the IEEE Conference on Computer Vision and Pattern

phrase grounding.
Recognition, pages 4042–4050.

Chen, L., Zhai, M., He, J., and Mori, G. (2019a). Object grounding via iterative context reasoning.
In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages
0–0.

Chen, Z., Ma, L., Luo, W., and Wong, K.-Y. K. (2019b). Weakly-supervised spatio-temporally

grounding natural sentence in video.

Condoravdi, C., Crouch, D., De Paiva, V., Stolle, R., and Bobrow, D. (2003). Entailment, inten-

sionality and text understanding. In NAACL.

Dagan, I., Glickman, O., and Magnini, B. (2005). The pascal recognising textual entailment

challenge. In Machine Learning Challenges Workshop. Springer.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019a). BERT: Pre-training of deep

bidirectional transformers for language understanding.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019b). Bert: Pre-training of deep bidirec-

tional transformers for language understanding. In ACL, pages 4171–4186.

Do, V., Camburu, O.-M., Akata, Z., and Lukasiewicz, T. (2020). e-snli-ve-2.0: Corrected visual-

textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744.

Du, M., Liu, N., Yang, F., and Hu, X. (2019). Learning credible deep neural networks with rationale

regularization. In ICDM.

Huang, D.-A., Buch, S., Dery, L., Garg, A., Fei-Fei, L., and Niebles, J. C. (2018). Finding
In IEEE

“it”: Weakly-supervised, reference-aware visual grounding in instructional videos.

117

Conference on Computer Vision and Pattern Recognition (CVPR).

Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image
descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3128–3137.

Kim, J., Ma, M., Kim, K., Kim, S., and Yoo, C. D. (2019). Progressive attention memory network

for movie story question answering. In CVPR.

Kim, J., Ma, M., Pham, T., Kim, K., and Yoo, C. D. (2020). Modality shifting attention network

for multi-modal video question answering. In CVPR.

Kim, K.-M., Choi, S.-H., Kim, J.-H., and Zhang, B.-T. (2018). Multimodal dual attention memory

for video story question answering. In ECCV.

Kitaev, N. and Klein, D. (2018). Constituency parsing with a self-attentive encoder. In ACL, pages

2676–2686.

Lei, J., Yu, L., Bansal, M., and Berg, T. L. (2018). Tvqa: Localized, compositional video question

answering.

Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2019). Tvqa+: Spatio-temporal grounding for video

question answering.

Lei, J., Yu, L., Berg, T. L., and Bansal, M. (2020). Tvr: A large-scale dataset for video-subtitle

moment retrieval. arXiv preprint arXiv:2001.09099.

Li, K., Zhang, Y., Li, K., Li, Y., and Fu, Y. (2019). Visual semantic reasoning for image-text

matching. In ICCV, pages 4654–4662.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L.

(2014). Microsoft coco: Common objects in context. In ECCV.

Liu, J., Chen, W., Cheng, Y., Gan, Z., Yu, L., Yang, Y., and Liu, J. (2020). Violin: A large-scale

dataset for video-and-language inference. In CVPR.

MacCartney, B. and Manning, C. D. (2009). An extended model of natural logic. In Proceedings

of the Eight International Conference on Computational Semantics.

Na, S., Lee, S., Kim, J., and Kim, G. (2017). A read-write memory network for movie story

understanding. In ICCV.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection

with region proposal networks. In NIPS.

118

Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. (2016). Grounding of textual
phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–
834. Springer.

Shi, J., Xu, J., Gong, B., and Xu, C. (2019). Not all frames are equal: Weakly-supervised video
grounding with contextual similarity and visual clustering losses. In CVPR, pages 10444–10452.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556.

Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi, Y. (2019). A corpus for reasoning

about natural language grounded in photographs. In ACL.

Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., and Fidler, S. (2016). Movieqa:
Understanding stories in movies through question-answering. In CVPR, pages 4631–4640.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption

generator. In CVPR.

Xie, N., Lai, F., Doran, D., and Kadav, A. (2019). Visual entailment: A novel task for fine-grained

image understanding. arXiv preprint arXiv:1901.06706.

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. (2014). From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions. TACL.

Zhang, C., Song, D., Huang, C., Swami, A., and Chawla, N. V. (2019). Heterogeneous graph neural
network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining.

Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., and Rohrbach, M. (2019). Grounded video

description. In CVPR.

Zhou, L., Louis, N., and Corso, J. J. (2018). Weakly-supervised video object grounding from text

by loss weighting and object interaction.

Zhou, Y., Wang, M., Liu, D., Hu, Z., and Zhang, H. (2020). More grounded image captioning by

distilling image-text matching model. In CVPR.

119

CHAPTER 7

CONCLUSIONS AND FUTURE WORK

7.1 Summary of the thesis

In this thesis, we have posed and investigated a range of fundamental problems in long-form

video understanding. We developed a series of learning frameworks that address representation

learning, temporal dependency modeling, and trustworthy video understanding. Through extensive

investigations, we can draw the following conclusions.

First, we propose a novel framework to solve the VideoQA. Our method addresses the impor-

tance of temporality reasoning. To this end, we realize that it is worth revisiting optical flow, as

flow may become less considered in atomic action recognition but is still effective in long-horizon

temporality. Then, we propose action-centric contrastive learning that makes both video and text

representations informative for action. Then, we fine-tune the VideoQA via a novel temporal

sensitivity-aware confusion loss to mitigate the potential static bias. Our ATM method is demon-

strated to be superior to all existing VideoQA methods on multiple benchmarks and shows faithful

temporality reasoning via a new metric.

Second, we have proposed a novel sequential relational anticipation model (SRAM) to predict

group activity given only the beginning frames of an activity execution. Our model captures the

complex relational dynamics of multiple people in the observed frames. It then anticipates the group

representations including group activity features and position features. A novel sequential decoder

is proposed to progressively anticipate the group representations through several unrolling stages.

Extensive results on two datasets demonstrate that our method significantly outperforms the state-

of-the-art methods. Results also validate that the progressive anticipation using multiple unrolling

stages facilitates group activity prediction. Further experimental results show that the modeling

and prediction of people’s positions improve our performance on group activity prediction.

Third, we present GateHUB for online action detection in untrimmed streaming videos.

It

consists of novel designs including Gated History Unit (GHU), Future-augmented History (FaH),

and a background suppression loss to more informatively leverage history and reduce false positives

120

for current frame prediction. GateHUB achieves higher accuracy than all existing methods for online

action detection and is more efficient than the existing best method. Moreover, its optical flow-free

variant is 2.8× faster than previous methods that require both RGB and optical flow while obtaining

higher or close accuracy.

Fourth, we investigate the spatio-temporal grounding in untrimmed videos with frequent visual

inconsistency in a weakly-supervised manner. We develop two novel MIL ranking losses for the

spatial and temporal domains. Furthermore, to bridge the granularity gap between the coarse text

information and the detailed visual information, we introduce an activity-driven object state encod-

ing module to enhance textual representation. Experiments on two popular datasets demonstrate

the superiority of our method and its generalization ability to other datasets with unseen queries.

Fifth, we present a novel approach for video entailment and its local explanation. Entity

grounding is highly incorporated into our task from two aspects. First, we train a weakly-supervised

entity video grounding module to judge a statement as “contradiction” if the statement consists

of an entity absent in the video. Then if the entity is present in the video, we infer the temporal

occurrence of that entity to guide the entailment judgment module focusing on the entity-relevant

clips. In addition to entailment judgment, our method is also developed to explain which words

or phrases make the statement contradictory to the video. We formulate the local explanation as

a regularizer to the decision-making of entailment to improve the model’s faithfulness. Extensive

results on Violin dataset demonstrate the resulting model consistently outperforms the existing

methods.

7.2 Future work

7.2.1 Leveraging temporal reasoning for social good

In addition to videos, the temporal structure is also present in many specialized domains. For

example, in biomedicine, doctors can infer “A tumor is decreasing under chemotherapy treatment”,

given the comparison of a prior and a current radiology image. I am interested in enhancing down-

stream applications such as report generation to leverage temporal structure. Broadly speaking,

I am committed to grounding this technique in a context of social good, such as applications in

121

healthcare, education, social inclusion, and biomedicine. I am passionate about working with front-

line researchers in the medical field, computational social science, art, and data mining to improve

these applications by integrating temporal semantic prior, as well as learning in a self-supervised

manner with complementary temporal signals.

7.2.2 Human-AI interactive systems to perceive accessibility

I believe video understanding has a natural benefit in assisting people with disabilities. A

significant case is to adapt computer vision techniques to overcome the visual challenge for blind

people so that a blind person can know the surrounding physical world by wearing a camera with

on-device computation. Deaf people use visual language as a means of communication, which

could be largely benefited by video understanding techniques. Achieving automatic sign language

recognition and localization enables the construction of sign language dictionaries and real-time

communication with surroundings. To this end, I will investigate video+language techniques to

progress in the availability of these applications. I am excited to work with researchers in Human-

Computer Interaction and NGO to understand the real challenge of people from neglected groups.

7.2.3 Holistic trustworthy computer vision

In the era of multi-modal foundation models and large language models, my vision is to attain

holistic trustworthiness in machine learning models. In addition to bias mitigation and explainability

that we have studied, trustworthiness includes privacy, inclusiveness, security, fairness, and other

core targets. For example, recognizing human action requires the model to preserve the privacy

of sensitive sectors such as gait and identity in training data, be inclusive to different data nodes,

and make fair decisions. Although videos constitute a substantial portion of computer vision

data, developing trustworthy video models are much more underscored than image or graph data

counterpart. I plan to train researchers thinking about the risks and challenges of real-world AI

systems. I want to collaborate with people from cybersecurity and industry in these domains.

122