GROUNDED COMPOSITIONAL CONCEPT LEARNING

By

Guangyue Xu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2024

ABSTRACT

Humans learn concepts in a grounded and compositional manner. Such compositional and ground-

ing abilities enable humans to understand an endless variety of scenarios and expressions. Although

deep learning models have pushed performance to new limits on many Natural Language Processing

and Computer Vision tasks, we still have a lack of knowledge about how these models process com-

positional structures and their potential to accomplish human-like meaning composition. The goal

of this thesis is to advance the current compositional generalization research on both the evaluation

and design of the learning models. In this direction, we make the following contributions.

Firstly, we introduce a transductive learning method to utilize the unlabeled data for learning

the distribution of both seen and novel compositions. Moreover, we utilize the cross-attention

mechanism to align and ground the linguistic concepts into specific regions of the image to tackle

the grounding challenge. Unlike traditional learning, we use episodic training where each training

item consists of one image and the sampled positive and negative compositional labels. We select

the image’s compositional label by computing their matching scores Our empirical results show

that combining episodic training and transductive learning does help compositional learning.

Secondly, we develop a new prompting technique for compositional learning by considering the

interaction between element concepts. In our proposed technique called GIPCOL, we construct a

textual input that contains rich compositional information when prompting the foundation vision-

language model. We use the CLIP model as the pre-trained backbone vision-language model

and improve its compositional zero-shot learning ability with our novel soft-prompting approach.

GIPCOL freezes the majority of CLIP’s parameters and only learns CLIP’s word embedding layer

through a graph neural network. By concatenating the learnable soft prompt and the updated word

embeddings, GIPCOL achieves better results compared with other prompting-based methods.

Thirdly, since retrieval plays a critical role in human learning, our work studies how retrieval

can help compositional learning. We propose MetaReVision which is a new retrieval-enhanced

meta-learning model to address the visually grounded compositional concept learning problem.

Given an image with a novel compositional concept, MetaReVision first uses a retrieval module

to find relevant items from the training set. Then it constructs an episode for which the retrieved

items form the support set and the test item forms the query set. The retrieved support set mimics

the primitive concept learning scenario, while the query set encourages the compositional strategy

learning by meta-learning’s bi-level optimization objective. The experimental results show that

such retrieval-enhanced meta-learning framework helps the vision-language model’s compositional

learning. Moreover, we create two new benchmarks called CompCOCO and CompFlickr for the

evaluation of grounded compositional concept learning.

Finally, we evaluate the large generative vision and language models in solving compositional

zero-shot learning within the in-context learning framework. We highlight their shortcomings and

propose retriever and ranker modules to improve their performance in addressing this challenging

problem. These two modules select the most informative in-context examples in their most effective

order to guide the backbone generative model. Our approach is novel in the context of grounded

compositional learning and our experimental results show improved performance compared to

basic in-context learning.

Copyright by
GUANGYUE XU
2024

ACKNOWLEDGEMENTS

First and foremost, I am tremendously grateful for my advisor Dr. Parisa Kordjamshidi and Joyce

Y. Chai for their continuous support and guidance. They shared with me how to think critically,

explore new problems, asking good questions and how to do good research. All of these experiences

will have a great influence on my whole life. Besides, their great insights on the domain of large

language models and grounded compositional learning have always shed light on problems I have

been working on. Without their continuous advice, inspiration and guidance for my PhD. study,

this work would have been impossible.

I would also like to thank my dissertation committee members: Dr. Xiaoming Liu and Dr.

Taosheng Liu. I greatly appreciate their valuable feedback on every step of my PhD journey.

I’m very happy to have had the opportunity to collaborate with an amazing group of students

and researchers: Dr. Shaohua Yang and Dr. Qiaozi Gao provide great suggestions and directions

when I start my research career as a PhD student. Thanks to Dr. Sari Saba-Sadiya for his great

efforts and enlightening comments. I also appreciate my co-authors on various papers.

I would like to thank all my friends at MSU, who made my time at MSU enjoyable.

Finally, I dedicate this thesis to my family: my parents Pingxian Xu and Jie Zhu, my parents-

in-law Junming Gu and Aĳu Zhang, my sons Yufeng Xu and Oscar Gu, and my cherished wife

Yingjun Gu, for your years of unwavering love and support.

v

TABLE OF CONTENTS

CHAPTER 1

.

.

.

.

.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
1.1 Motivation .
.
1.2 Compositional Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Challenges of Compositional Learning . . . . . . . . . . . . . . . . . . . . . .
1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1
1
1
3
5
6

CHAPTER 2

BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . .

.
8
2.1 Compositional Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . .
8
9
2.2 Large Foundation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Parameter-Efficient Paradigm For Applying Large Models . . . . . . . . . . . . 11
. 13
2.4 Meta Learning . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

CHAPTER 3

3.1
Introduction . .
3.2 Related Work .
.
3.3 Approach .
3.4 Experiments . .
3.5 Conclusion . .

ZERO-SHOT COMPOSITONAL CONCEPT LEARNING . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
.
. 16
. 18
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.
.
.
.
. .
. .

.
.
.
.
.

.

CHAPTER 4

Introduction .

GIPCOL: GRAPH-INJECTED SOFT PROMPTING FOR COMPOSITIONAL
. 29
ZERO-SHOT LEARNING . . . . . . . . . . . . . . . . . . . . . . .
4.1
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. 31
4.2 Related Work .
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 GIPCOL .
. 34
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Experiments .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Conclusion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.
.
.

.
.
.

.
.
.

.

CHAPTER 5

.
.
. .

Introduction .

METAREVISION: META-LEARNING WITH RETRIEVAL FOR VISUALLY
. 50
GROUNDED COMPOSITIONAL CONCEPT ACQUISITION . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
. 53
. 54
. . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. 67
. 68

5.1
.
5.2 Related Works .
5.3 Grounded Compositional Concept Learning (GCCL)
5.4 Meta-Learning with Retrieval for GCCL (MetaReVision)
5.5 Experiments .
5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Limitations

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .

. .

. .

.

.

.

.

CHAPTER 6

GENCZSL: GENERATIVE COMPOSITIONAL ZERO-SHOT CONCEPT
RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction .
6.1
6.2 Preliminaries
6.3 GenCZSL: Generative In-Context Learning for CZSL . . . . . . . . . . . . .

. 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. 72

.
.

.
.

.
.

.
.

vi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Experiments . .
6.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

. .

.

CHAPTER 7

CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . .
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future Directions .

. 78
. 78
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

.

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

vii

CHAPTER 1

INTRODUCTION

1.1 Motivation

Humans acquire language in a compositional and grounded manner. They can understand new

scenes and combine known words in novel ways to describe their perceptual world through their

compositional and grounding abilities, although these novel compositions may have never been

seen before. It would be desirable for intelligent systems to have such compositional generalization

ability [Lake et al., 2017]. It is also widely believed that effective semantic representations need to

have both compositionality and groundedness as minimum requirements [Carnap, 1988, Baroni and

Zamparelli, 2010, Miller and Charles, 1991]. However, recent neural models struggle to generalize

outside their training distribution and have difficulties using observed words in a compositional

manner, especially in novel situations [Kim and Linzen, 2020]. In recent years, there has been

remarkable advancement in large-scale neural network models that can integrate information from

both natural language textual and visual data. Despite their impressive progress, the extent to which

such large-scale neural network models can effectively encode compositional representations of

learned element concepts is still an open question. For instance, correctly identifying a sliced apple

when this combination has not been observed by reasoning over its constituents, red and car, is a

challenge for such models| [Hupkes et al., 2020, Hermann, 2014, Lake et al., 2015a]. The research

conducted in this thesis is an effort to design novel architectures to address some of the challenges

of compositional generalization when the models are required to recognize the novel composition

of objects and attributes in the visual modality and express it in natural language.

1.2 Compositional Learning

The compositionality is considered as one of the key elements in human intelligence and

explained by [Partee et al., 1995] as: the meaning of a whole is a function of the meanings of the

parts and of the way they are syntactically combined.

However, in terms of computational modeling, compositional learning can have multiple as-

pects, including primitiveness, systematicity, productivity, and substituivity, which are identified

1

in cognitive science literature. These aspects are explained in Table 1.1. Compositional abilities

of computational models have been widely studied with different lenses using a variety of bench-

marks [Chang et al., 2016, Gao et al., 2023, Mancini et al., 2021]. Figure 1.1 provides an overview

of multiple aspects in compositional learning, recently proposed compositional benchmarks and re-

lated modalities. These benchmarks are proposed to evaluate the compositional ability of different

neural networks within different modalities, including natural language processing (NLP), Com-

puter Vision (CV) and vision-language fields. For example, scan is a pure textual compositional

learning task that requires models generating action sequences from compositional navigation

commands.

In the vision-language field, many benchmarks are proposed to measure models’

compositional ability via downstream tasks such as question-answering, image-text retrieval, ac-

tion generation and compositional zero-shot learning (CZSL). In this thesis, we mainly focus on

Compositional Zero-Shot Learning benckmarks and study the primitiveness and systematicity in

compositional learning.

Aspect

Description

Primitiveness Concept seen in isolation during train can be applied compositionaly at test time.
Systematicity Generalize to unseen compositions of known elements.
Productivity
Substitutivity Model robustness when replacing words with synonyms.

Generating longer sequences than those seen in the training data.

Table 1.1 Different aspects of compositional learning identified in compositional generalization
literature.
In this thesis, we focus on primitiveness and systematicity aspects in compositional
learning.

An example of our focused compositional zero-shot learning problem is shown in Figure 1.2.

As shown in Figure 1.2(a), suppose the training set has images with compositional concepts sliced-

tomato, sliced-cake, ripe-apple, peeled-apple, etc. Given a new image, our goal is to assign a

novel compositional concept sliced-apple to the image by composing the element concepts, sliced

and apple, learned from the training data. Although sliced and apple have appeared with other

objects or attributes, the combination of this attribute-object pair is not observed in the training

set. Representative CZSL datasets include MIT-States [Isola et al., 2015a], UT-Zappos [Yu and

Grauman, 2014] and C-GQA [Hudson and Manning, 2019]. Based on these datasets, we further

2

Figure 1.1 The aspects of compositional learning are shown on the top of the figure. Examples
of Benchmarks and datasets of different modalities are shown in the rest of this Figure. The
compositional aspects and datasets marked in red are the ones we focused on in this thesis.

propose two more CZSL benchmarks , including CompCOCO and CompFlickr as shown in Figure

1.2(b). Different from the previous CZSL benchmarks, these two datasets add more textual

information to testify the current deep learning model’s compositional learning ability, especially

the large visual-language models (VLMs).

1.3 Challenges of Compositional Learning

Challenge 1: Zero-Shot Learning. Despite the success of deep learning (DL) models, traditional

DL models require training on a massive amount of labeled data for each class. However, the distri-

bution of compositional concept samples naturally follows a zero-shot setting: novel compositional

concepts do not appear in the training phase. In this respect, collecting large-scale labeled samples

to address compositional learning is a challenge. Because there are no training data available for

the novel pairs, the learned models will bias to seen pairs.

Challenge 2: Grounded Concept Learning. The second challenge is grounding ability. Ground-

3

Compositional LearningSystematicityPrimitivenessProductivityPure Text BenchmarksScan DatasetCogs DatasetOthersText-Visual BenchmarksQA SettingCZSL SettingVQAGQAUT-ZapposCompCOCOSubstitutivityComp FlickrMIT-StatesC-GQARetrieval SettingCREPEWinogroundGeneration SettinggScan(a)

(b)

Figure 1.2 Compositional learning examples for MITStates and CompCOCO.

ing means the ability to connect words to the real-world entities, events, and ideas that they refer

to and it is obviously necessary and fundamental for compositional concept learning. Mostly lan-

guage models that are trained with huge amounts of data, are not fed with the explicit alignments of

occurring words in natural language expression and their real-world manifestations. After the intro-

duction of self-attention, more and more works use the attention weights as indicator for grounding.

However, due to the complex self-attention mechanism and the large number of layer and heads in

self attention implementation, it is difficult to enforce attention to represent the gronding during

training time.

Challenge 3: Capturing the rules of composition. Capturing compositionality and learning

principles of compositions in language has been a long-term challenge for neural networks. Most

4

Concept of SlicedSliced TomatoSliced CakeSliced BreadConcept of AppleDiced ApplePeeled AppleRipe AppleSliced AppleDiced Pizza…Train Phase:Test Phase:Localize, Learn and Compose Regional Visual FeaturesCompose the Learnt Regional Visual Featuresa large red busis driving down the street.two teddy bears sitting on an oldchair together.a man in black sits at a red table with red chairsLearnt Element Concept:red, chairCompositional Concept:red chairof the prior work focus on designing new architectures with the guide of explicit compositional

structures [Fodor and Pylyshyn, 1988a, Andreas, 2019, Huynh and Elhamifar, 2020]. However,

such designs are customized to the task setting and have a limited generalization ability. Current

models rely on large amounts of data to capture the encoded patterns of compositionality. Such

a framework has difficulty in handing out-of-distribution compositions. In order to address the

compositional problem, the models should learn both 1) primitive concepts and 2) the rules for

composing them. Current studies show the limitations of data-driven models in generalizing over

composing rules.

1.4 Contributions of the Thesis

Contribution 1: Transductive Episodic Training. To address challenges 1 and 3 in CZSL, we

propose an episode-based training scheme. We perform model optimization over batches of tasks

instead of batches of data. Within this framework, we treat each composition in the training set

as a compositional learning task. Through training over multiple tasks, the model is expected to

progressively accumulate knowledge on compositional generalization rules and, learn the unseen

compositions based on the seen ones within each episode. In addition, we utilize the unlabeled data

to augment the supervision for episodic learning and compositional generalization in a transductive

learning framework. Experiments have shown the importance of of the transductive learning setting

which increase the accuracy by 1.5% in pair accuracy.

Contribution 2: Meta-Learning. To further address challenges of grounding and composing rule

learning, we develop a meta-learning framework to train the vision & language models VLMs, we

call (MetaReVision), for compositional concept learning. Specifically, MetaReVision uses DG-

MAML (Domain-Generalization Model-Agnostic Meta-Learning) proposed by [Li et al., 2018],

a variant of Model-Agnostic Meta-Learning (MAML) [Finn et al., 2017], to learn the primitive

concepts and the compositional strategy by training through episodes, in a more principled way.

In MetaVL, each episode consists of a support set and a query set. The support set mimics the

primitive concept learning scenario, while the query set encourages the compositional strategy

learning by the DG-MAML’s bi-level optimization objective.

5

Contribution 3: Prompting VLMs for Compositional Learning Given the huge influence

of larger pre-trained VLMs in various vision and language tasks [Zhu et al., 2023], our third

contribution is to effectively utilize them for compositional concept learning using the prompting

methods. We propose a new prompting approach, called GIPCOL, to inject information about the

composition of objects and attributes into the prompting design. Specially, we use CLIP as the

large VLMs backbone in our experiments and change its hard prompting strategy by combining

learnable prefix vectors and element concept vectors. In particular, we achieved SoTA AUC resutls

on all three benchmarks.

Contribution 4: In-Context Learning (ICL) for CZSL. Although the generative large models

represented by GPT-series [Brown et al., 2020, Achiam et al., 2023] have achieved huge success

in many downstream tasks within the in-context learning framework, evaluation and application of

such models in a multi-modal problem setting is not straightforward, especially in the zero-shot

setting. Main challenges include 1) adapt the current evaluation benchmarks for a sound evaluation

of generative large language models for zero-shot compositional learning and, 2) improve foundation

models for better compositional generalization by introducing in-context example retriever and

ranker modules. To address the above challenges, we propose GenCZSL which introduce the

retriever and ranker modules. The retriever is to select informative examples and the ranker is to

further sort the retrieved example to help Flamingo recognize the novel compositions.

1.5 Organization of Dissertation

The reminder of this dissertation is organized as follows:

in Chapter 2, we introduce back-

ground and previous works which this dissertation builds on. In Chapter 3, we present the work on

recognizing compositional attribute-object concepts within the zero-shot setting. We propose an

episode-based cross-attention (EpiCA) network which combines merits of cross-attention mech-

anism and episode-based training strategy to recognize novel compositional concepts, which aim

to address the grounding and compositional challenges.

In Chapter 4, we present MetaReVi-

sion, a meta-learning framework to train vision-and-language models for compositional concept

learning. The episodic training and the bi-level optimization within the meta-learning framework

6

encourages gradients learnt from support set to be beneficial for compositional concept learning

in the query set, In Chapter 5, we will present PromptCompVL, which explores the compositional

zero-shot learning(CZSL) ability of large pre-trained vision-language models(VLMs) within the

prompt-based learning framework. PromptCompVL gives a general prompting-based framework

for compositional learning and makes two design choices: first, it uses a soft-prompting instead of

hard-prompting to inject learnable parameters for compositional learning. Second, it uses the soft-

embedding layer to learn primitive concepts in different combinations. In Chapter 6, we explore

the possibility of utilizing the in-context learning (ICL) paradigm compositional learning. ICL

provides the foundation models, like GPT4 [Achiam et al., 2023] or LLaMa [Touvron et al., 2023],

with a few labeled examples as input before asking them to make a prediction on a new example.

Different from previous works, we try to address the in-context example selectio for ICL. In Chapter

7, we draw the conclusion of our current state of research and provide the research proposal for the

next steps toward completing this PhD thesis. We provide the timelines of the proposed research

accordingly.

7

CHAPTER 2

BACKGROUND AND RELATED WORK

2.1 Compositional Zero-Shot Learning

Compositional learning is the key component of human intelligence and has been widely studied

in the deep learning field under the contexts of human-object interactions(HOI) [Kato et al., 2018,

Hou et al., 2020], compositional zero-shot learning [Nagarajan and Grauman, 2018b, Misra et al.,

2017a], natural language processing [Lake, 2019, Nye et al., 2020] and language acquisition [Jin

et al., 2020, Surís et al., 2020]. In this thesis, we study the Compositional Zero-Shot Learning

(CZSL) problem and this topic falls into the language acquisition category. As a specific zero-shot

learning (ZSL) problem, compositional zero-shot learning (CZSL) tries to learn complex concepts

by composing element concepts. Previous solutions can mainly be categorized as:

• Classifier-based methods train classifiers for element concepts and combine the element

classifiers to recognize compositional concepts [Chen and Grauman, 2014, Misra et al.,

2017a, Li et al., 2019a].

• Metric-based methods learn a shared space by minimizing the distance between the projected

visual features and concept features [Nagarajan and Grauman, 2018b, Li et al., 2020b].

• Generative-based methods learn to generate samples from the semantic information and

transfer CZSL into a traditional supervised classification problem [Wei et al., 2019].

• Prompting-based methods [Nayak et al., 2022] tries to explore the compositional knowledge

from the large visual-language models (CLIP)[Radford et al., 2021] by constructing the textual

prompting input.

In our work, we try to address the key challenges in CZSL, including 1) grounding, 2) compositional

rule learning, 3) zero-shot setting, and based on different visual-language models (VLMs) proposing

novel parameter-efficient methods to solve the CZSL problem. Moreover, we contribute two more

realistic datasets CompFlickr and CompCOCO which give more textual information for CZSL and

the realted prolem settings are as below:

8

• For UT-Zappos, MIT-Staes and C-GQA datasets, the textual input is acting as the class labels.

In this CZSL setting, given an image, we need to retrieve or generate the most relevant pair

label. As a zero-shot learning problem, the pair label has never been seen during training

time and therefore we can formulate CZSL as a open-vocabulary problem. CLIP [Radford

et al., 2021] and Flamingo [Alayrac et al., 2022] can be utilized for their zero-shot or few-shot

learning ability.

• For CompFlickr and CompCOCO datasets, because we have more textual input as our

contextual information, we formulate CZSL as masked token predictio problem. Then we

can utilize VLMs, like VL-BERT [Su et al., 2020], as multi-modal encoder to help predict

the masked compositional concepts. In this setting, we modify the VLMs in two ways: 1)

add retrieval module to retrieve related element concepts to construct episodes. 2) meta-train

such VLMs [Finn et al., 2017] to accumulate compositional knowledge from the constructed

compositional tasks.

2.2 Large Foundation Models

Large-scale datasets, self-supervision training technique, and attention mechanism [Vaswani

et al., 2017a] have led to the emergence of powerful uni-modal encoders for images [Dosovitskiy

et al., 2020], videos [Arnab et al., 2021], language models [Devlin et al., 2019] and other modalities

[Girdhar et al., 2022]. These uni-modal encoders form the basis for large vision-language models

(VLMs). Popular VLMs such as CLIP [Radford et al., 2021] and ALIGN [Jia et al., 2021] are

trained using the above uni-modal encoders and fusing multi-modal information from the massive

web datasets in form of images and alt-text.

In this section, we mainly introduce three VLMs

related to our thesis, including VL-Bert [Su et al., 2020], CLIP [Radford et al., 2021] and Flamingo

[Alayrac et al., 2022] separatelyas below:

2.2.1 Generic Vision-Language Encoder: VL-BERT

VL-BERT [Su et al., 2020] is designed to extract generic representation for visual-linguistic

tasks through pre-trained tasks, including masked language modeling (MLM) and masked RoI

classification. Such pre-trained models are expected to have a joint understanding of image features

9

and language phrases that correspond to them. Specially, after extracting visual tokens using

Fast R-CNN [Girshick, 2015] from images and textual tokens from texts, VL-BERT adopts the

Transformer model [Vaswani et al., 2017a] as the backbone to extract multi-modal representation

from massive-scale Conceptual Captions dataset [Sharma et al., 2018], together with text-only

corpus. VL-BERT follows the pre-training and fine-tuning framework. After obtaining the generic

representation for vision-lanuge tasks, for each downstream tasks, like Visual Question Answering

(VQA) [Antol et al., 2015], Visual Commonsense Reasoning (VCR) [Zellers et al., 2019] and

Referring Expression task [Yu et al., 2016]. In our work, we aim to explore the compositional ability

of VL-BERT’s vision-language representation. In order to achieve this goal, we propose two new

benchmarks for compositional learning testing and our experiments on these two benchmarks show

that the extracted representations have difficulty for representing novel compositions. Furthermore,

we propose a new framework which combines retrieval and meta-learning to enhance VL-BERT

and similar models’ compositional ability which is detailed in Chapter 5.

2.2.2 Contrastive Image-Text Pretraining: CLIP

The recently released contrastively trained vision-language model, CLIP [Radford et al., 2021],

has enabled a diverse of downstream applications at the intersection of Computer Vision (CV)

and Natural Language Processing (NLP) fields in the form of Language Guided Vision Processing

[Huang et al., 2023a, Zhang et al., 2023, Huang et al., 2023b]. Pre-training using 400 million

of image-text pairs, CLIP-based models have demonstrated remarkable zero-shot capabilities [Ma

et al., 2023a]. Moreover, through the pre-trained visual encoder, textual encoder and the latent

space which align images and texts, CLIP provide many downstream application scenarios. 1) In

CV, its pre-trained visual and textual encoders have been used for semantic segmentation, object

detection and image captioning [Rao et al., 2022a]. 2) In diffusion models, CLIP has been used as

a loss and acted as an automated evaluation metric [Hessel et al., 2021]. 3) In feature extractor,

CLIP has been incorporated into architectures for various tasks, such as video summarization [Xu

et al., 2021]. In our work, we aim to study and improve CLIP’s compositional ability within using

prompting paradigm. We conduct our experiments on three compositional datasets, including

10

MIT-States [Isola et al., 2015a], UT-Zappos [Yu and Grauman, 2014] and CGQA [Hudson and

Manning, 2019] and find that improved prompting design can help CLIP’s compositional learning

which is detailed in Chapter 4.

2.2.3 Few-Shot Vision-Language Model: Flamingo

In order to utilize the increasing ability of large VLMs, in-context learning (ICL) has become

a new paradigm for multi-modal tasks [Brown et al., 2020]. However, most VLMs only accept one

image and utilized this single input image for downstream tasks [Bugliarello et al., 2020, Bagad

et al., 2023]. Such VLMs can not be directly used in ICL’s compositional learning since ICL

requires multiple images as demonstration input in compositional learning. Recently proposed

Flamingo [Alayrac et al., 2022] can consume sequences of arbitrarily interleaved visual and textual

data as input in few-shot setting. It introduce two components to address the arbitrarily interleaved

challenge: 1) perceiver which uses query vectors to fuse and compress visual input and produce a

small fixed number of visual tokens per image, 2) cross-attention mechanism to fuse the multi-modal

information from the query vectors. However, different from Flamingo’s few-shot application, we

need to retrieve and construct episodes in compositional learning. We will discuss the episode

construction and optimiazaion in Chapter 6.

2.3 Parameter-Efficient Paradigm For Applying Large Models

The full-model fine-tuning (FT) for large language models (LLM) is expensive and could

affect the learnt knowledge acquired during the large scale pre-training phase [Sun et al., 2023].

Therefore, more parameter-efficient techniques are recently explored to increase the accessibility of

large models. In this section, we give a detailed discussion about the parameter-efficient fine-tuning

methods and talk about these methods’ application in compositional learning.

2.3.1 Prompt-Based Learning

Prompt-based learning is an emerging technique originated from NLP field. Different from

traditional supervised fine-tuning techniques, prompting-based methods freeze most parts of the

large pre-trained NLP model, like T5 [Raffel et al., 2020] and GPT [Brown et al., 2020], and

concatenate a small number of additional learnable parameters to the test input which learns to

11

solve downstream tasks [Liu et al., 2021] as Equation 4.2.

Input PT = concat (P; Xtest ) .

(2.1)

where 𝑃 is the learnable embeddings. Because of these learnable embeddings, prompt-based

learning requires access to a training set Xtrain for the target downstream task.

As the prevalence of large pre-trained visual-language(VL) models, prompting-based methods

are introduced to explore the multi-modal knowledge encoded in such VLMs [Tsimpoukelli et al.,

2021, Radford et al., 2021, Jin et al., 2021]. Recently, [Zhou et al., 2022a] and [Zhou et al., 2022b]

prompt CLIP by prepending learnable parameters to text input for low-resource image classification

and achieves satisfactory resutls. Meanwhile, [Nayak et al., 2022] conducted compositional learning

by modifying CLIP’s original vocabulary embeddings and shows the possibility of prompting VL

models for compositional learning. Our work proposes a novel prompting strategy to further

imporve CLIP’s compositional learning ability.

2.3.2

In-Context Learning

In-context learning (ICL) is an important paradigm for adapting LLM and VLMs to new tasks

which is first introduced by [Brown et al., 2020]. Different from prompt-based learning, ICL

paradigm enables the adaptation of these large models to new tasks by prompting them with

instructions (zero-shot) or demonstrations (few-shot) without any additional learnable parameters

as shown in Equation 6.1.

Input ICL = concat

(cid:16)

[X𝑖𝑐𝑙; Y𝑖𝑐𝑙] 𝑘

1 ; Xtest

(cid:17)

(2.2)

where [X𝑖𝑐𝑙; Y𝑖𝑐𝑙] 𝑘

1 are the 𝑘 demonstraing examples.

Compared with traditional learning paradigms, ICL has several advantages. First, data ef-

ficiency, the ability to do few-shot learning directly reduces the need for human-labeled data.

Second, computing efficiently, in contrast to other popular training paradigms, ICL enables infer-

ence without any gradient updates. Lastly, good performance, ICL also displays amazing versatility

through different modes of prompting. However, ICL’s performance is highly sensitive to prompting

12

input and three key components affect its performance, including example selection, example order

and template design [Nguyen and Wong, 2023]. In this thesis, we explore ICL for compositional

learning. Specially, we focus on example selection and ranking to improve VLMs’ compositional

ability.

2.4 Meta Learning

Humans learn in a compositional manner from their previous experience [Fodor, 1975]. This

process could be formalized within meta-learning framework. Meta learning, also known as

learning to learn, deal with the problem of efficient learning so that they can learn new concepts

or skills fast with just a few seen examples (few-shot setting) or even no seen examples (zero-shot

setting). It aims to solve a low-resource problem by leveraging the learnt experience from a set

of related tasks. Through learning from the compositional tasks, meta-learning could be used

to learn the compositional rules.There are mainly three categories of meta-learning methods: 1)

Metric-based methods learn a metric or distance function over tasks [Sung et al., 2018a, Snell

et al., 2017b]. 2) Model-based methods aim to design an architecture or a training process for

rapid adaption across tasks [Ravi and Larochelle, 2016, Munkhdalai et al., 2018]. 3) Optimization-

based methods directly adjust the optimization algorithm to enable quick adaptation with just a few

examples [Nichol et al., 2018a, Finn et al., 2017]. Meta learning has also been widely deployed

in NLP field [Gu et al., 2018, Dou et al., 2019, Holla et al., 2020] recently to address the low-

resource language processing problems. In this thesis, we use optimization-based meta-learning

methods to learn the generalizable initialization for CZSL by training on the constructed episodes.

This process tries to mimic the human’s compositional learning process and the compositional

knowledge is encoded in the learnt parameter initialization.

13

CHAPTER 3

ZERO-SHOT COMPOSITONAL CONCEPT LEARNING

In this thesis, we study the problem of recognizing compositional attribute-object concepts within

the zero-shot learning (ZSL) framework. We propose an episode-based cross-attention (EpiCA)

network that combines the merits of the cross-attention mechanism and episode-based training

strategy to recognize novel compositional concepts. Firstly, EpiCA bases on cross-attention to

correlate concept-visual information and utilizes the gated pooling layer to build contextualized

representations for both images and concepts. The updated representations are used for a more

in-depth multi-modal relevance calculation for concept recognition. Secondly, a two-phase episode

training strategy, especially the transductive phase, is adopted to utilize unlabeled test examples to

alleviate the low-resource learning problem. Experiments on two widely-used zero-shot compo-

sitional learning (ZSCL) benchmarks have demonstrated the effectiveness of the model compared

with recent approaches on both conventional and generalized ZSCL settings 1.

3.1

Introduction

Humans can recognize novel concepts through composing previously learned knowledge -

known as compositional generalization ability [Lake et al., 2015b, Lake and Baroni, 2018]. As a

key critical capacity to build modern AI systems, this thesis investigates the problem of zero-shot

compositional learning (ZSCL) focusing on recognizing novel compositional attribute-object pairs

appeared in the images. For example in Figure 5.1, suppose the training set has images with

compositional concepts sliced-tomato, sliced-cake, ripe-apple, peeled-apple, etc. Given a new

image, our goal is to assign a novel compositional concept sliced-apple to the image by composing

the element concepts, sliced and apple, learned from the training data. Although sliced and apple

have appeared with other objects or attributes, the combination of this attribute-object pair is not

observed in the training set.

This is a challenging problem because objects with different attributes often have a significant

1Zero-Shot Compositional Concept Learning. Guangyue Xu, Parisa Kordjamshid, Joyce Chai. MetaNLP@ACL,

2021

14

Figure 3.1 Given the concepts of sliced and apple in the training phase, our target is to recognize the
novel compositional concept slice apple which doesn’t appear in the training set by decomposing,
grounding and composing concept-related visual features.

diversity in their visual features. While red apple has similar visual features as the apple prototype,

sliced apple presents rather different visual features as shown in Fig 5.1. Similarly, the same

attributes can have different visual effects depending on the modified objects. For example, old

has a different visual effect in objects of old town compared to objects of old car. Despite recent

progress [Misra et al., 2017b, Li et al., 2020c], previous works still suffer several limitations: (1)

Most existing methods adopt metric learning framework by projecting concepts and images into

shared latent space, and focus on regularizing the structure of the latent space by adding principled

constraints without considering the relationship between concepts and visual features. Our work

brings a new perspective, the relevance-based framework inspired by [Sung et al., 2018b], to

conduct compositional concept learning. (2)Previous works represent concept and image by the

same vector regardless of the context it occurs. However, cross concept-visual representation

often provides more grounded information to help in recognizing objects and attributes which will

consequently help in learning their compositions.

Motivated by the above discussions, we propose an Episode-based Cross Attention (EpiCA)

15

Concept of SlicedSliced TomatoSliced CakeSliced BreadConcept of AppleDiced ApplePeeled AppleRipe AppleSliced AppleDiced Pizza…Train Phase:Test Phase:Localize, Learn and Compose Regional Visual FeaturesCompose the Learnt Regional Visual Featuresnetwork to capture multi-modal interactions and exploit the visual clues to learn novel compositional

concepts. Specifically, within each episode, we first adopt cross-attention encoder to fuse the

concept-visual information and discover possible relationships between image regions and element

concepts which corresponds to the localizing and learning phase in Fig.5.1. Second, a gated pooling

layer is introduced to obtain the global representation by selectively aggregating the salient element

features corresponding to Fig. 5.1’s composing phase. Finally, relevance score is calculated based

on the updated features to update EpiCA.

The contribution of this work can be summarized as follows: 1) Different from previous work,

EpiCA has the ability to learn and ground the attributes and objects in the image by cross-attention

mechanism. 2) Episode-based training strategy is introduced to train the model. Moreover, we are

among the first works to employ transductive training to select confident unlabelled examples to

gain knowledge about novel compositional concepts. 3) Empirical results show that our framework

achieves competitive results on two benchmarks in conventional ZSCL setting. In the more realistic

generalized ZSCL setting, our framework significantly outperforms SOTA and achieves over 2×

improved performance on several metrics.

3.2 Related Work

Compositional Concept Learning. As a specific zero-shot learning (ZSL) problem, zero-shot

compositional learning (ZSCL) tries to learn complex concepts by composing element concepts.

Previous solutions can mainly be categorized as: (1) classifier-based methods train classifiers for

element concepts and combine the element classifiers to recognize compositional concepts [Chen

and Grauman, 2014, Misra et al., 2017b, Li et al., 2019a].

(2) metric-based methods learn

a shared space by minimizing the distance between the projected visual features and concept

features [Nagarajan and Grauman, 2018a, Li et al., 2020c].

(3) GAN-based methods learn to

generate samples from the semantic information and transfer ZSCL into a traditional supervised

classification problem [Wei et al., 2019].

Attention Mechanism. The attention mechanism selectively use the salient elements of the

data to compose the data representation and is adopted in various visiolinguistic tasks. Cross

16

Figure 3.2 Illustration of the proposed EpiCA framework. It is a two-stage training framework,
including inductive learning and transductive learning. Both phases are trained on episodes as
illustrated in Alg. 1.

attention is employed to locate important image regions for text-image matching [Lee et al.,

2018]. Self-attention and cross-attention are combined at different levels to search images with text

feedback [Chen et al., 2020b]. More recent works refer to Transformer [Vaswani et al., 2017b] to

design various visiolinguistic attention mechanism [Lu et al., 2019].

Episode-based Training. The data sparsity in low-resource learning problems, including few-

shot learning and zero-shot learning, makes the typical fine-tuning strategy in deep learning not

adaptable, due to not having enough labeled data and the overfitting problem. Most successful

approaches in this field rely on an episode-based training scheme: performing model optimization

over batches of tasks instead of batches of data. Through training multiple episodes, the model is

expected to progressively accumulate knowledge on predicting the mimetic unseen classes within

each episode. Representative work includes Matching network [Vinyals et al., 2016], Prototypical

network [Snell et al., 2017a] and RelNet [Sung et al., 2018b].

The related works to EpiCA are RelNet [Sung et al., 2018b] and cvcZSL [Li et al., 2019a].

Compared with these methods, we have two improvements including an explicit way to construct

episodes which is more consistent with the test scenario and a cross-attention module to fuse and

ground more detailed information between the concept space and the visual space.

17

… …Ancient CityCut Pear Sliced PizzaDiced Apple…Broken ToyGloVe+ LSTMAttr. EmbeddingObj. EmbeddingVisual FeatureEmbedding PhaseScoring PhaseRelevance ScoreOne-hot VectorCNNEpisodeRelevance ScoreEpiCA…Sliced AppleVis-Cpt CrossAttnQ     K    VCpt-Vis CrossAttnQ     K    VGated PoolingGated PoolingMulti-Modal Relevance NetworkRelevance ScoreAttr. EmbeddingObj. EmbeddingVisual FeatureLoss3.3 Approach

3.3.1 Task Definition

Different from the traditional supervised setting where training concepts and test concepts

are from the same domain, our problem focuses on recognizing novel compositional concepts of

attributes and objects which are not seen during the training phase. Although we have seen all the

attributes and objects in the training set, their compositions are novel 2.

We model this problem within the ZSL framework where the dataset is divided into the seen

domain S = {(𝑣𝑠, 𝑦𝑠)|𝑣𝑠 ∈ V 𝑠, 𝑦𝑠 ∈ Y 𝑠} for training and the unseen domain U = {(𝑣𝑢, 𝑦𝑢)|𝑣𝑢 ∈

V𝑢, 𝑦𝑢 ∈ Y𝑢} for test, where 𝑣 is the visual feature of image I which can be extracted using deep

convolution networks and 𝑦 is the corresponding label which consists of an attribute label 𝑎 and a

object label 𝑜 as 𝑦 = (𝑎, 𝑜) satisfying 𝑎𝑢 ⊆ 𝑎𝑠, 𝑜𝑢 ⊆ 𝑜𝑠 and Y𝑠 ∩ Y𝑢 = 𝜙. Moreover, we address

the problem in both conventional ZSCL setting and generalized ZSCL setting. In conventional

ZSCL, we only consider unseen pairs in the test phase and the target is to learn a mapping function

V ↦→ Y𝑢. In generalized ZSCL, images with both seen and unseen concepts can appear in the

test set, and the mapping function changes to V ↦→ Y 𝑠 ∪ Y𝑢 which is a more general and realistic

setting.

3.3.2 Overall Framework

As summarized in Fig. 5.4, EpiCA consists of the cross-attention encoder, gated pooling layer

and multi-modal relevance network to compute the relevance score between concepts and images.

In order to accumulate the knowledge between images and concepts, EpiCA is trained by episodes

including the following two phases:

• Inductive training phase constructs episodes from the seen concepts and trains EpiCA based

on these constructed episodes.

• Transductive training phase employs the self-taught methodology to collect confident pseudo-

labeled test items to further fine-tune EpiCA.

2We refer concept as compositional concept, element concept as the attribute and the object in the rest of the thesis.

18

3.3.3 Unimodal Representation

Concept Representation. Given a compositonal concept (𝑎, 𝑜), we first transform attribute

and object using 300-D GloVe [Pennington et al., 2014a] separately. Then we use one layer

BiLSTM [Hochreiter and Schmidhuber, 1997] to obtain contextualized representation for concepts

with 𝑑𝑘 hidden units.

Instead of using the final state, we maintain the output features for both

attribute and object and output feature matrix 𝐶 ∈ R2×𝑑𝑘 for each compositional concept.

Image Representation. We extract the visual features using pretrained ResNet [He et al., 2016]

from a given image. In order to obtain more detailed visual features for concept recognition, we

keep the output from the last convolutional layer of ResNet-18 to represent the image and therefore

each image is split into 7 × 7 = 49 visual blocks with each block as a 512-dim vector denoted as

V = (v1, v2, . . . , v49). Each element represents a region in the image. We further convert 𝑣𝑖 with
a linear transformation 𝑣𝑖 = W⊤𝑣𝑖, where W ∈ R512×𝑑𝑘 is the weight matrix to transfer the image

into the joint concept-image space.

3.3.4 Cross Attention Encoder

Motivation. Previous works usually utilize vector representation for both concepts and images

and construct a metric space by pushing aligned images and concepts closer to each other. The

potential limitation of such frameworks is that the same vector representations without context

information will miss sufficient detailed information needed for grounding and recognizing objects

and attributes appeared in the images. We observe that certain visual blocks in the image can be

more related to certain element concept and certain element concept may highlight different visual

blocks. Inspired by this observation, our model addresses the previous limitation by introducing

cross-attention encoder and constructs more meaningful cross-modality representation for both

images and element concepts for compositional concept recognition.

Cross Attention Layer. To fuse and ground information between visual space and concept

space, we first design a correlation layer to calculate the correlation map between the two spaces,

which is used to guide the generation of the cross attention map. Given an image and a candidate

concept, after extracting unimodal representations, the correlation layer computes the semantic

19

relevance between visual blocks {𝑣𝑖}49

𝑖=1 and element concepts (cid:8)𝑐 𝑗 (cid:9)2
𝑗=1

3 with cosine distance and

output the final image-to-concept relevance matrix as 𝑅 ∈ R49×2 with each element 𝑟𝑖 𝑗 calculated

using Eq. 3.1. We can easily have another concept-to-image relevance matrix by transposing 𝑅.

𝑟𝑖 𝑗 =

(cid:18) 𝑣𝑖
∥𝑣𝑖 ∥2

(cid:33)

(cid:19)𝑇 (cid:32) 𝑐 𝑗
(cid:13)
(cid:13)
(cid:13)𝑐 𝑗
(cid:13)2

, 𝑖 ∈ [1, 49], 𝑗 ∈ [1, 2]

(3.1)

In order to obtain attention weights, we need to normalize the relevance score 𝑟𝑖 𝑗 as Eq. 3.2 as

[Chen et al., 2020a].

¯𝑟𝑖 𝑗 =

relu (cid:0)𝑟𝑖 𝑗 (cid:1)
𝑗=1 relu (cid:0)𝑟𝑖 𝑗 (cid:1) 2

√︃(cid:205)𝑛

(3.2)

After obtaining the normalized attention score, we can calculate the cross-attention represen-

tation based on the selected query space 𝑄 and the context space 𝑉, where 𝑉 = 𝐾 in our setting as

shown in Fig. 5.4. Taking image-to-concept attention for example, given a visual block feature 𝑣𝑖

as query, cross attention encoding is performed over the element concept space 𝐶 using Eq. 3.3.

𝑣𝑖 =
(cid:98)

𝑛
∑︁

𝑗=1

𝛼𝑖 𝑗 𝑐 𝑗 ,

s.t.

𝛼𝑖 𝑗 =

exp (cid:0)𝜆 ¯𝑟𝑖 𝑗 (cid:1)
𝑗=1 exp (cid:0)𝜆 ¯𝑟𝑖 𝑗 (cid:1)

(cid:205)𝑛

(3.3)

where 𝜆 is the inverse temperature parameter of the softmax function [Chorowski et al., 2015] to

control the smoothness of the attention distribution.

Visually-Attended Concept Representation. The goal of this module is to align and represent

concepts with related visual blocks and help further determine the alignment between element

concepts and image regions. We use concept embedding as query and collect visual clues using

Eq. 3.3 and the final visually-attended features for compositional concept is (cid:98)

𝑐 ∈ 𝑅2×𝑑𝑘 .

Concept-Attended Visual Representation. An image representation grounded with element

concept would be beneficial for compositional concept learning. Following the similar procedure

as visually-attended concept representation, we take visual block features as query and concept

embedding as context. We can calculate the concept-attended visual representation using Eq. 3.3.

3Each compositional concept only has two elements, attribute and object.

20

𝑣 ∈ R49×𝑑𝑘 represents the concept-attended block visual features with the latent

The final result (cid:98)
space dimension 𝑑𝑘 .

3.3.5 Gated Pooling Layer

After the cross-attention encoder, the output image features 𝑉 = [𝑣1, . . . , 𝑣49] ∈ R49×𝑑𝑘 and
concept features 𝐶 = [𝑐1, 𝑐2] ∈ R2×𝑑𝑘 are expected to contain rich cross-modal information. Our

target of gated pooling layer is to combine elements to form the final representation for concepts

and images separately. Pooling techniques can be directly deployed to obtain such representation.

However, we argue that elements should have different effect on the final concept recognition. For

example, background visual blocks shouldn’t be paid much attention during concept recognition.

To address the assumption, we propose gated pooling layer to learn the relative importance of

each element and dynamically control the contribution of each element in the final representation.

Specially, We apply one linear layers with parameter 𝑊 ∈ R𝑑𝑘×1 on the element feature 𝑥𝑖 and

normalize the output to calculate an attention weight 𝛼𝑖 that indicates the relative importance of

each element using Eq. 3.4.

𝑥 = (cid:205)𝑖 𝛼𝑖𝑥𝑖

s.t.

𝛼𝑖 =

exp((𝑊𝑥𝑖))
𝑘=1 exp((𝑊𝑥𝑘))

(cid:205)𝑁

(3.4)

3.3.6 Multi-Modal Relevance Network

After obtaining the updated features for both images (cid:98)

𝑎,
𝑣𝑖 and concepts ((cid:98)

𝑜) 𝑗 , we introduce the
(cid:98)

multimodal relevance network shared the spirit as [Sung et al., 2018b] to calculate the relevance

score as shown in Eq. 3.5

𝑠𝑖, 𝑗 = 𝑔𝜙 (cid:0)concat[((cid:98)

𝑎,
𝑣𝑖), ((cid:98)

𝑜) 𝑗 ](cid:1)
(cid:98)

(3.5)

where 𝑔 is the relevance function implemented by two layer feed-forward network with trainable

parameters 𝜙.

In order to train EpiCA, we add Softmax activation on the relevance score to measure the

probability of image 𝑖 belonging to concept 𝑗 within the current episode as Eq. 3.5 and update

21

Algorithm 1: Training EpiCA for ZSCL:
Input: D𝑡𝑟𝑎𝑖𝑛 = {(𝑣𝑚, (𝑎𝑚, 𝑜𝑚)}|𝑇𝑟 |
Output: Multi-Modal Rel. Function 𝑓

𝑚=1, D𝑡𝑒𝑠𝑡 = {𝑣𝑛}|𝑇 𝑠|

𝑖=𝑛 , task size 𝑆, sample interval 𝑡

1 // Inductive Learning Phase
for 𝑒 𝑝𝑜𝑐ℎ ← 1 to 𝐸𝑖𝑛𝑑_𝑚𝑎𝑥 do

2

3

4

5

6

7

for each image and the corresponding pair in the training set do

Construct an episode [𝑣 𝑝, (𝑎 𝑝, 𝑜 𝑝), (𝑎𝑛1
Gated Cross-Attention Encoding using Eq. 3.1, 3.2, 3.3 and 3.4
Calculating multi-modal relevance score using Eq 3.5.
Updating EpiCA.

, 𝑜𝑛1), · · · , (𝑎𝑛𝑠 , 𝑜𝑛𝑠 )].

end

8
9 end
10 // Transductive Learning Phase
11 for 𝑒 𝑝𝑜𝑐ℎ ← 1 to 𝐸𝑡𝑟𝑎𝑛𝑠_𝑚𝑎𝑥 do
if 𝑒 𝑝𝑜𝑐ℎ % 𝑡 == 0 then
12

Pick confident samples from unseen set by Eq. 3.7.

13

14

15
16 end

end
Updating EpiCA by Eq 3.9.

EpiCA using cross-entropy loss.

𝑝 𝑗 ((cid:98)

𝑣𝑖) =

exp(𝑠𝑖, 𝑗 )
𝑘=1 exp (cid:0)𝑠𝑖,𝑘 (cid:1)

(cid:205)𝐶

(3.6)

3.3.7 Training and Prediction

Inductive Training. For each image and the corresponding pair label, we randomly sample

negative pairs to form an episode which consists of an image 𝑣 𝑝, a positive pair (𝑎 𝑝, 𝑜 𝑝) and

a predefined number 𝑛𝑡 of negative pairs in the form of [𝑣 𝑝, (𝑎 𝑝, 𝑜 𝑝), (𝑎𝑛1

, 𝑜𝑛1), · · · , (𝑎𝑛𝑡 , 𝑜𝑛𝑡 )].

Then within each episode, we calculate the relevance score between image and all candidate pairs

using Eq. 3.5. Finally, we calculate the cross entropy loss using Eq. 3.6 and update EpiCA as shown

in Alg. 1.

Transductive Training. The disjointness of the seen/unseen concept space will result in domain

shift problems and cause the predictions biasing towards seen concepts as pointed by [Pan and Yang,

2009]. Transductive training utilizes the unlabeled test set to alleviate the problem [Dhillon et al.,

2019]. Specifically, transductive training has a sampling phase to select confident test samples and

22

utilize the generalized cross entropy loss as Eq. 3.8 to update EpiCA.

Following previous work [Li et al., 2019b], we use threshold-based method as Eq. 3.7 to pick

up confident examples.

𝑣𝑖)
𝑝1((cid:98)
𝑣𝑖)
𝑝2((cid:98)

> 𝛾

(3.7)

where 𝑝 is calculated by Eq. 3.6 and the threshold is the fraction of the highest label probability
𝑝1((cid:98)
current episode. Only confident instances are employed to update EpiCA which is controlled by 𝛾.

𝑣𝑖) and the second highest label probability 𝑝2((cid:98)

𝑣𝑖) which measures the prediction peakiness in

Moreover, the recently proposed generalized cross-entropy loss [Zhang and Sabuncu, 2018] is

used to calculate the loss for pseudo-labeled test examples as Eq. 3.8.

L𝑢 =

∑︁

𝑣𝑖))𝑞

1 − ( 𝑝 𝑗 ((cid:98)
𝑞

(𝑣𝑖,(𝑎,𝑜) 𝑗)∈U
𝑎,
𝑣𝑖 belonging to pair ((cid:98)

(3.8)

𝑣𝑖) is the probability of (cid:98)

where 𝑝 𝑗 ((cid:98)
the hyper-parameter related to the noise level of the pseudo labels, with higher noisy pseudo labels

𝑜) 𝑗 calculated using Eq. 3.6. 𝑞 ∈ (0, 1] is
(cid:98)

requiring larger 𝑞.

Finally, the transductive loss is calculated as Eq. 3.9, where L𝑢 corresponds to the generalized

cross entropy loss from pseudo-labeled test examples and L𝑠 is the cross entropy loss for the

training examples

L = L𝑢 + L𝑠.

(3.9)

Prediction. Given a new image with extracted feature 𝑣𝑖, we iterate over all the candidate pairs

and select the pair with the highest relevance score as ( ˆ𝑎, ˆ𝑜) = argmax ˆ𝑎, ˆ𝑜 𝑠𝑖, 𝑗 ( ˆ𝑣𝑖, ( ˆ𝑎, ˆ𝑜) 𝑗 ) as Eq. 3.5

using EpiCA.

3.4 Experiments

Dataset. We use similar dataset as in [Nagarajan and Grauman, 2018a, Purushwalkam et al.,

2019a] for both conventional and generalized ZSCL settings with the split shown in Tab. 3.1.

Notably, generalized ZSCL setting has additional validation set for both benchmarks which allows

23

cross-validation to set the hyperparameters. The generalized ZSCL evaluates the models on both

seen/unseen sets.

• MIT-States [Isola et al., 2015b] has 245 objects and 115 attributes. In conventional ZSCL, the

pairs are split into two disjoint sets with 1200 seen pairs and 700 unseen pairs. In generalized

ZSCL, the validation set has 600 pairs with 300 pairs seen in the training set and 300 pairs

unseen during training and the test set has 800 pairs with 400 pairs seen and remaining 400

pairs unseen in the training set.

• UT-Zappos [Yu and Grauman, 2017] contains images of 12 shoe types as object labels and

16 material types as attribute labels. In conventional ZSCL, the dataset is split into disjoint

seen set with 83 pairs and unseen set with 33 pairs. In generalized ZSCL, the 36 pairs in the

test set consists 18 seen and 18 unseen pairs. 15 seen pairs and 15 unseen pairs composes

the validation set.

Implementation Details. We develop our model based on PyTorch. For all experiments, we adopt

ResNet-18 pre-trained on ImageNet as the backbone to extract visual features. For attr-obj pairs,

we encode attributes and objects with 300-dim GloVe and fix it during the training process. We

randomly sample 50 negative pairs to construct episodes. We use Adam with 10−3 as the initial

learning rate and multiply the learning rate by 0.5 every 5 epoch and train the network for total 25

epochs. We report the accuracy at the last epoch for conventional ZSCL. For generalized ZSCL,

the accuracy is reported based on the validation set. Moreover, the batch size is set to 64, 𝜆 in

Eq. 3.3 is set to 9, 𝑞 in Eq. 3.8 is set to 0.5 and the threshold in Eq. 3.7 is set to 10. 4

Baselines. We compare EpiCA with the following SOTA methods: 1) Analog [Chen and Grauman,

2014] trains a linear SVM classifier for the seen pairs and utilizes Bayesian Probabilistic Tensor

Factorization to infer the unseen classifier weights. 2) Redwine [Misra et al., 2017b] leverages the

compatibility between visual features 𝑣 and concepts semantic representation to do the recognition.

3) AttOperator [Nagarajan and Grauman, 2018a] models composition by treating attributes as

matrix operators to modify object state to score the compatibility. 4) GenModel [Nan et al., 2019]

4Our code is publicly available at: https://github.com/HLR/CrossAttnCptLearn

24

Conventional ZSCL Generalized ZSCL
MIT-States Zappos MIT-States Zappos

115
245
1262
34562
700
19191

16
12
83
24898
33
4228

# Attr.
# Obj.
# Train Pair
# Train Img.
# Test Pair
# Test Img.
# Val. Pair
# Val. Img.

115
245
1262
30338
800
12995
600
10420

16
12
83
22998
36
2914
30
3214

Table 3.1 Data Statistics about Conventional and Generalized Data Split for MIT-States and UT-
Zappos Datasets.

adds reconstruction loss to boost the metric-learning performance. 5) TAFE-Net [Wang et al., 2019]

extracts visual features based on the pair semantic representation and utilizes a shared classifier

to recognize novel concepts. 6) SymNet [Li et al., 2020c] builds a transformation framework and

adds group theory constraints to its latent space to recognize novel concepts. We report the results

according to the above baseline papers and the released official code 5 6 of the aforementioned

baselines.

Methods
Random
ANALOG
REDWINE
ATTOPERATOR
GenModel
TAFE-Net
SymNet
EpiCA(Inductive)
EpiCA(Transductive)

MIT-States(%) UT-Zappos(%)

0.14
1.4
12.5
14.2
17.8
16.4
19.9
15.68
18.13

3.0
18.3
40.3
46.2
48.3
33.2
52.1
52.56
55.48

Table 3.2 Results of Conventional ZSCL setting.

3.4.1 Conventional ZSCL Setting

Quantitive Results. Top-1 accuracy metric is reported in this setting to compare different methods.

The top-1 accuracy of the unseen attr-obj pairs for conventional ZSCL is presented in Tab. 4.3.

5https://github.com/Tushar-N/attributes-as-operators
6https://github.com/ucbdrive/tafe-net.git

25

Model Top k −→

AttOperator
RedWine
LabelEmbed+
TMN
SymNet

Mit-States

Val AUC
2

3

6.2
7.3
7.6
8.1
9.8

10.1
11.8
12.2
12.4
14.8

1

2.5
2.9
3.0
3.5
4.3

Test AUC
2

3

4.7
5.7
5.6
7.1
7.6

7.6
9.3
9.4
11.5
12.3

1

1.6
2.4
2.0
2.9
3.0

Val AUC
2

44.2
52.2
49.0
57.1
∥

1

21.5
30.4
26.4
36.8
∥

UT-Zappos

3

61.6
63.5
66.1
69.2
∥

1

25.9
27.1
25.7
29.3
∥

Test AUC
2

51.3
54.6
52.1
55.3
∥

3

67.6
68.8
67.8
69.8
∥

Inductive EpiCA

7.73 12.19 22.93 6.55 13.07 20.01 25.13 50.19 61.97 25.59 50.06 63.08
Transductive EpiCA 9.01 17.63 24.01 7.18 14.02 21.31 53.18 68.71 77.89 35.04 54.83 70.02

Table 3.3 AUC in percentage (multiplied by 100) on MIT-States and UT-Zappos. Our EpiCA model
outperforms the previous methods by a large margin on MIT-States based on most of the metrics
on UT-Zappos.

EpiCA outperforms all baselines on Zappos benchmark and exceeds the state-of-the-art by 3.3%.

It achieves comparable performance on MITStates benchmark. We will empirically analyze the

model’s behavior in later sections.

3.4.2 Generalized ZSCL Setting

In this setting, following the related work [Purushwalkam et al., 2019a], we measure the

performance with AUC metric. AUC introduces the concept of calibration bias which is a scalar

value added to the predicting scores of unseen pairs. By changing the values of the calibration bias,

we can draw an accuracy curve for seen/unseen sets. The area below the curve is the AUC metric

as a measurement for the generalized ZSCL system.

Quantitative results. Tab. 3.3 provides comparisons between our EpiCA model and the previous

methods on both the validation and testing sets. As Tab. 3.3 shows, the EpiCA model outperforms

the previous methods by a large margin. On the challenging MIT-States dataset which has about

2000 attribute-object pairs, all the baseline methods have a relatively low AUC score while our

model is able to double the performance of the previous methods, indicating its effectiveness.

3.4.3 Ablation Study

We conduct ablation study on EpiCA and compare its performance in different settings.

Importance of Transductive Learning.

The experimental results in Tab. 4.3 and Tab. 3.3 show the importance of transductive learning.

26

There are about 2% and 3% performance gains for MIT-States and UT-Zappos in conventional

ZSCL. A significant improvement is observed for both datasets in generalized ZSCL. This is within

our expectation because 1) our inductive model has accumulated knowledge about the elements of

the concept and has the ability to pick confident test examples. 2) after training the model with the

confident pseudo-labeled test data, it acquires the knowledge about unseen concepts.

Importance of Cross-Attention (CA) Encoder. To analyze the effect of CA encoder, we remove

CA (w/o CA) and use unimodal representations for both concepts and images. From Tab. 3.4, it

can be seen that EpiCA does depend on multi-modal information to do concept recognition and the

results also verifies the rationale to fuse multi-modal information by cross-attention mechanism.

Importance of Gated Pooling (GP) Layer. We replace GP layer by average pooling (w/o GP).

Tab. 3.4 shows the effectiveness of GP in filtering out noisy information. Instead of treating each

element equally, GP help selectively suppress and highlight salient elements within each modality.

Importance of Episode Training. We also conduct experiments by removing both CA and GP

(w/o GP and CA). In this setting, we concatenate unimodal representation of images and concepts

and use 2-layer MLP to calculate the relevance score. Although simple, it still achieves satisfactory

results, showing episode training is vital for our EpiCA model.

EpiCA variants

MIT-States(%) UT-Zappos(%)

Full EpiCA
- w/o cross attention (CA)
- w/o gated pooling (GP)
- w/o GP and CA

15.79
12.05
13.46
14.13

52.56
42.77
50.47
48.76

Table 3.4 Ablation study of EpiCA components. The episode training and cross-attention encoder
are import to our model. Adding gated pooling layer further boosts the accuracy.

3.4.4 Qualitative Analysis.

Fig. 3.3 shows some examples and their predicted labels by EpiCA. Although it gives the correct

predictions for the two examples in the first row, EpiCA still struggles in distinguishing the similar,

even opposite attributes, like New and Old. For example, the second highest prediction for the

image with true label new truck is old car. The predicted object is reasonable, but the predicted

27

Figure 3.3 Predicting examples of EpiCA from MIT-States dataset. True label and predicted labels
are in red and blue text respectively.

attribute is opposite. Meanwhile, for the incorrect predictions, the predicted labels are meaningful

and remain relevant to the image. For example, Engraved Clock may be a better label than Ancient

Clock for the bottom image. These examples show that EpiCA learns the relevance between images

and concepts. But the evaluation of the models is hard and in some cases additional information

and bias is needed to predict the exact labels occurring in the dataset.

3.5 Conclusion

In this thesis, we propose EpiCA which combines episode-based training and cross-attention

mechanism to exploit the alignment between concepts and images to address ZSCL problems. It

has led to competitive performance on two benchmark datasets.

In generalized ZSCL setting,

EpiCA achieves over 2× performance gain compared to the SOTA on several evaluation metrics.

However, ZSCL remains a challenging problem. Future work that explores cognitively motivated

learning models and incorporates information about relations between objects as well as attributes

will be interesting directions to pursue.

28

(New Truck)(Dented  Car)(New Truck)(New Toy)(Old Car)(New Tire)(Ancient Clock)(Engraved  Clock)(Ancient Clock)(Large Fan)(Painted Wheel)(Small Fan)(Ancient Clock)(Ancient Clock)(Engraved  Clock)(Modern Clock)(Burnt Redwood)(Painted Redwood)(New Truck)(New Truck)(Old Car)(Clean Truck)(Dented Car)(Wide Tire)CHAPTER 4

GIPCOL: GRAPH-INJECTED SOFT PROMPTING FOR COMPOSITIONAL
ZERO-SHOT LEARNING

Pre-trained vision-language models (VLMs) have achieved promising success in many fields,

especially with prompt learning paradigm. In this work, we propose GIPCOL (Graph-Injected Soft

Prompting for COmpositional Learning) to better explore the compositional zero-shot learning

(CZSL) ability of VLMs within the prompt-based learning framework. The soft prompt in GIPCOL

is structured and consists of the prefix learnable vectors, attribute label and object label. In addition,

the attribute and object labels in the soft prompt are designated as nodes in a compositional graph.

The compositional graph is constructed based on the compositional structure of the objects and

attributes extracted from the training data and consequently feeds the updated concept representation

into the soft prompt to capture this compositional structure for a better prompting for CZSL. With

the new prompting strategy, GIPCOL achieves state-of-the-art AUC results on all three CZSL

benchmarks, including MIT-States, UT-Zappos, and C-GQA datasets in both closed and open

settings compared to previous non-CLIP as well as CLIP-based methods. We analyze when and

why GIPCOL operates well given the CLIP backbone and its training data limitations, and our

findings shed light on designing more effective prompts for CZSL1.

4.1

Introduction

Compositional ability is a key component of human intelligence and should be an important

building block for current autonomous AI agents. Fig. 5.1 demonstrates a compositional learning

example where after learning the element concepts sliced and apple, the autonomous agent is ex-

pected to recognize the novel composition sliced apple, by composing the leared element concepts2

which has not been observed during the training time. This example shows the compositional

attribute-object learning problem and this type of compositional ability is essential for language

grounding in the vision-language tasks, such as instruction following [Chai et al., 2018], navigation

1GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning. Guangyue Xu, Joyce Chai,

Parisa Kordjamshid. WACV, 2024

2Element concepts also known as primitive concepts including both attributes and objects in CZSL

29

Figure 4.1 CZSL setting: given the element concepts of sliced and apple, our target is to recognize
the compositional concept sliced apple.

[Anderson et al., 2018] , and image captioning [Vinyals et al., 2015].

In this chapter, we investigate the compositional zero-shot learning (CZSL) problem as shown

in the example. It requires agents to recognize novel compositions of the attribute-object (attr-obj)

pairs appearing in an image by composing previously learned element concepts (e.g., “sliced” and

“apple” individually are considered as element concepts). The main challenges of CZSL are 1)

zero-shot setting in which we do not have training data for the novel compositions. 2) the model

should learn the compositional rules to compose the learned element concepts. 3) the distribution

shift from the training data to the test data cased by zero-shot setting. Such shift causes the learned

models overfitting the seen compositions and makes it difficult to generalize to novel compositions.

Previous solutions usually construct a shared embedding space to calculate the matching scores

between images and seen pairs and add different generalizing constraints to regularize the space

expecting the learnt embeddings capable of encoding compositional properties [Nagarajan and

Grauman, 2018b, Naeem et al., 2021, Mancini et al., 2021]. Given impressive performance of large

VLMs on downstream tasks, in this work, we attempt to solve CZSL from the lens of prompting

large VLMs specifically using CLIP [Radford et al., 2021] as in [Nayak et al., 2022].

Different from traditional zero-shot learning (ZSL) settings where each class is represented by a

single text label [Zhou et al., 2022a, Zhou et al., 2022b], CZSL needs to consider the compositional

30

Concept of SlicedSliced TomatoSliced CakeSliced BreadConcept of AppleDiced ApplePeeled AppleRipe AppleSliced AppleDiced Pizza…Train Phase:Test Phase:Target Set•Open world•Close worldinformation among the concepts. Therefore, the prompt design which can efficiently encode the

compositional information is the main challenge for our work. We expect the designed prompt

can re-program CLIP for compositional learning [Tsai et al., 2020] and the compositional labels

in the prompt should consider the compositonal information. Motivated by above expectations,

we propose GIPCOL (Graph-Injected Soft Prompting for COmpositional Learning) to design a

better prompt to apply VMLs in CZSL. The core idea of GIPCOL is to re-program CLIP for

CZSL by setting the prefix vectors in the soft prompt as learnable parameters which is different

from CSP [Nayak et al., 2022]. Moreover, GIPCOL captures the compositional structure between

concepts by constructing a compositional graph from the seen pairs in the training dataset. The

concepts, both element concept and compositional concept, are acting as nodes in the graph and

the compositional graph models the feasible topological combinations between these concepts.

GIPCOL uses a GNN module to update the element label’s representations based on their neighbor

information in the constructed compositional graph. And the updated element embedding is used as

class labels int the soft prompt. Concretely, the learnable prefix vectors and GNN-updated element

concepts consist of the soft prompt for GIPCOL and work together to explore CLIP’s knowledge

for CZSL. The contributions of this work can be summarized as follows,

• Novel prompting design. Our technique introduces a novel way of utilizing the compositional

structure of concepts for constructing the soft prompts. Though we use GNN for capturing

this structure, any other differentiable architectures can be used here to enrich the prompt’s

compositional representation.

• GIPCOL achieves SoTA AUC results on all three CZSL benchmarks, including MIT-States,

UT-Zappos, and the more challenging C-GQA datasets. Moreover, it shows consistent

improvements compared to other CLIP-based methods on all benchmarks.

4.2 Related Work

Compositional Zero-Shot Learning (CZSL) is a special field of Zero-Shot Learning (ZSL). The

CZSL is a challenging problem as it requires generalization from seen compositions to novel com-

positions by learning the compositional rules between element concepts. There are mainly four

31

lines of research to address this problem. 1) Classifier-based methods train classifiers for attributes

and objects separately and combine the element predictions for compositional predictions [Misra

et al., 2017a]. 2) Embedding-based methods construct a shared embedding space for both textual

pairs and images. Different methods add different constraints on the space to enhance composi-

tionality [Nagarajan and Grauman, 2018b]. 3) Generation-based methods learn to generate visual

features for the novel compositions and train classifiers from the generated images [Xian et al.,

2018a]. 4) Newly proposed prompt-based methods utilize CLIP and introduce learnable element

concept embedding or soft prefix vectors in the soft prompt to solve CZSL problems [Nayak et al.,

2022, Xu et al., 2022].

Prompt-based Learning. Parallel to ’fine-tuning’, prompt learning provides an efficient mech-

anism to adapt large pretrained language models(PLMs) or vision-language models (VLMs) to

downstream tasks by treating the input prompt as learnable parameters while freezing the rest of

the foundation model. Prompt learning is a parameter-efficient framework originated from the

NLP field aiming at utilizing knowledge encoded in PLMs for downstream tasks

[Liu et al.,

2021, Brown et al., 2020, ?]. Recently, as the prevalence of large vision-language models (VLMs),

prompt learning is introduced into multimodal settings to solve VL-related problems [Tsimpoukelli

et al., 2021, Yang et al., 2022, Jin et al., 2021], including the CZSL problems [Nayak et al., 2022, Xu

et al., 2022]. In both linguistic and multi-modal settings, prompt engineering plays an important

role. How to design a suitable prompt template for downstream tasks is a challenge and GIPCOL

proposes a novel approach to address this challenge.

Vision-Language Models. Large VMLs are pre-trained to learn the semantic alignment between

vision and language modalities in different levels [Jia et al., 2021, Radford et al., 2021]. Attention-

based encoder, large mini-batch contrastive loss, and web-scaled training data are the main factors

to boost the performance of such vision-language models. Recent advances in these pre-trained

VLMs have presented a promising direction to promote open-world visual understanding with the

help of language. Besides the open-world image classification, VLMs are used in other visual

fields, like dense prediction [Rao et al., 2022b] and caption generation [Mokady et al., 2021].

32

Among existing methods, the most relevant to ours are CSP [Nayak et al., 2022] and CGE[Naeem

et al., 2021]. CSP treat the element concept labels as learnable parameters to prompt CLIP for

CZSL and can be considered as a baseline for GIPCOL. CGE encodes compositional concepts

using GNN and constructs a shared embedding space to align images and compositional concepts.

It is a task-specific architecture and needs to fine-tune the visual encoder to achieve satisfactory

performance. Compared with such task-specific models, GIPCOL is a general prompting method

and uses GNN to capture interactions among the concepts for its soft prompting design. GIPCOL

fixes CLIP’s pre-trained visual and textual encoders and achieves better performance in a more

general and parameter-efficient manner. It is worth noting that GNN used in CGE and GIPCOL

have different nature, CGE for compositional encoding and GIPCOL for soft prompt construction.

Figure 4.2 Illustration of different CZSL settings based on the target compositional set. GIPCOL
is evaluated under closed-world and open-world settings.

4.3 Problem Formulation

In this section, we formally define the CZSL task. Let A = {𝑎0, 𝑎1, . . . , 𝑎𝑛} be the attribute

set and O = {𝑜0, 𝑜1, . . . , 𝑜𝑚} be the object set. All possible compositional label space C is the

Cartesian product of these two element concept sets, C = A × O with size 𝑛 × 𝑚. At training

time, we are given a set of seen3 examples Cseen = {(𝑥1, 𝑐1) , . . . , (𝑥𝑘 , 𝑐𝑘 )}, where 𝑥𝑖 is an image

and 𝑐𝑖 = (𝑎𝑖, 𝑜𝑖) 4 is its compositional label from the seen set C𝑠𝑒𝑒𝑛 ⊂ C. The goal of CZSL is

to learn a function 𝑓 to assign a compositional label from the target set C𝑡𝑎𝑟𝑔𝑒𝑡 ⊆ C to a given

image . Based on different target set settings as shown in Fig. 4.2, CZSL can be categorized into 1)

3seen examples also mean training examples, we use them interchangeably in this work.
4We use the pair index to denote the object and attribute indexes for the sake of simple notation. The object and

attribute indexes do not refer to their original sets in this case.

33

Seen PairsUnseen PairsClose-World CZSLOpen-World CZSLAll Other Pairs (feasible/infeasible)Figure 4.3 GIPCOL Architecture. Besides CLIP’s frozen text and visual encoders, GIPCOL
consists of two learnable components: a soft-prompting module and a GNN. GIPCOL calculates
the cosine similarity between the given image and all candidate pairs and the cross-entropy loss is
back-propagated through the frozen LM in order to update soft-prompt and GNN.

Closed-world CZSL, where C𝑡𝑎𝑟𝑔𝑒𝑡 = C𝑠𝑒𝑒𝑛 ∪C𝑢𝑛𝑠𝑒𝑒𝑛, the target set consists of both seen and unseen

pairs as introduced in [Purushwalkam et al., 2019b]. In this setting, both seen and unseen pairs are

feasible. This setting is called a closed-world setting because the test pairs are given in advance. 2)

Open-world CZSL, where C𝑡𝑎𝑟𝑔𝑒𝑡 = C. The target set contains all attr-obj combinations including

both feasible and infeasible pairs. This is the most challenging case introduced in [Mancini et al.,

2021]. We evaluate our models under both closed-world and open-world settings.

4.4 GIPCOL

By pre-training on 400 million image-text association pairs, CLIP has already learned the general

knowledge for images recognition.

In order to fully utilize CLIP’s capability in compositional

learning, GIPCOL freezes CLIP’s textual and visual encoders and focuses on structuring its textual

prompt to address compositional concept learning. The GIPCOL’s architecture is shown in Fig. 5.4.

In particular, GIPCOL adds two learnable components to construct the soft prompt for CZSL: the

learnable prefix vectors and the GNN module. The prefix vectors are used to add more learnable

parameters to represent the compositional concepts and reprogram CLIP for compositional learning.

Notably, in the whole architecture in Fig 5.4, soft-prompt and soft-embedding are the only modules

that need to be learnt. The GNN module is to capture the compositional structure of the objects

34

CLIP Image EncoderSeen Compositional Concepts:•Old City•New Shoe•Broken Clock•Small Dog•…CLIP Text EncoderAttrEmb.Obj. Emb.OldBrokenSmallCityClockDogImage VectorOld_CityVecNew_ShoeVecSmall_dogVecLearnable ModulesFrozen CLIP Parameters… … Cos ScoreCos ScoreCos ScoreCross Entropy LossForward PathBackward PathPrefix VectorsOldCitySmallDogOldCitySmallDogCompositional GraphGNNInit Concept EmbeddingsAttrEmbObj EmbComp Emband attributes for a better compositional concept representation in the constructed soft prompt.

We describe the details of GIPCOL, including the learnable prefix vectors, GNN, and CLIP’s

visual/textual encoder in the following section.

4.4.1 GIPCOL Architecture

Learnable Prefix Vectors. We designate 𝑘 learnable prefix vectors Θ = {𝜃1, 𝜃2, ..., 𝜃𝑚} where
𝜃𝑖 ∈ R𝑑 in soft prompt for compositional concept encoding. 𝑑 is set to 768 to be consistent with

CLIP embedding size. Here, larger 𝑘 means more learnable parameters and learning ability for

compositional concept representation. These vectors are used to prepend to the attr-obj embeddings

and act as part of the compositional representation. These prefix vectors are fine-tuned by gradients

flowing back through CLIP during the training time.

GNN as Concept Encoder. Different from traditional zero-shot learning (ZSL) problems where

output labels are treated independently, CZSL requires modeling the interactions between element

concepts. For example, given the compositional concept red apple, we need to learn both the

concept apple and how red changes apple’s state instead of treating red and apple as two inde-

pendent concepts. Graph Neural Networks (GNN) have been proved to be able to capture such

dependencies [Naeem et al., 2021, Mancini et al., 2022]. We introduce GNN in GIPCOL to enrich

the concept’s representations by fusing information from their compositional neighbors as follows,

( ˆ𝑎𝑖, ˆ𝑜𝑖) = 𝐺 𝑁 𝑁Φ(𝑎𝑖, 𝑜𝑖)

(4.1)

where Φ is GNN’s parameter, (𝑎𝑖, 𝑜𝑖) and ( ˆ𝑎𝑖, ˆ𝑜𝑖) are the original and updated compositional

concept’s representation. The updated node representations from GNN will serve as class labels

in soft prompt. The whole soft prompt represents the compositional concept and will be put into

CLIP’s textual encoder for compositional learning.

Frozen CLIP’s Text Encoder. After obtaining the updated compositional representations ( ˆ𝑎𝑖, ˆ𝑜𝑖),

GIPCOL adds the learnable prefix vectors Θ = [𝜃1, 𝜃2, ..., 𝜃𝑚] prepending in front of ( ˆ𝑎𝑖, ˆ𝑜𝑖) to

represent compositional concept as follows,

35

prefix Vectors
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:122)
(cid:123)
(cid:125)(cid:124)
𝜃1, 𝜃2, ..., 𝜃𝑚,
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

GNN-Updated Concept
(cid:122)(cid:125)(cid:124)(cid:123)
ˆ𝑎𝑖, ˆ𝑜𝑖,

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
Soft Prompt as Compositional Concept Representation

[𝑆𝑂𝑆,
(cid:124)

].

𝐸𝑂𝑆
(cid:125)

(4.2)

Then we use CLIP’s frozen text encoder, a Bert encoder [Devlin et al., 2019], to extract the

normalized EOS vector as the compositional concept’s representation for further multi-modal

alignment as follows,

𝒄𝑖 =

𝑇𝑥𝑡𝐸𝑛𝑐 (Θ, ( ˆ𝑎𝑖, ˆ𝑜𝑖))
∥𝑇𝑥𝑡𝐸𝑛𝑐 (Θ, ( ˆ𝑎𝑖, ˆ𝑜𝑖)) ∥

(4.3)

where ( ˆ𝑎𝑖, ˆ𝑜𝑖) is the GNN-updated attribute and object vectors and 𝑐𝑖 is the 𝑖-th compositional

concept vector encoded by CLIP.

Frozen visual encoder. Following CLIP’s pre-processing routine, we first rescale the image’s size

to 224 × 224. Then we use ViT-L/14 as the visual encoder ViT to encode the image and extract the

[CLASS] token as the image’s representation. The extracted image vector 𝑥𝑖 needs to be normalized

as follows for further similarity calculation.

𝒙𝑖 =

𝑉𝑖𝑠𝐸𝑛𝑐(𝑣𝑖)
∥𝑉𝑖𝑠𝐸𝑛𝑐(𝑣𝑖)∥

(4.4)

where 𝑣𝑖 is the given image and 𝑥𝑖 is its vector representation.

Aligning Image and Compositional Concept. After obtaining the vectors for the compositional

concept 𝑐𝑖 and the image 𝑥, GIPCOL calcualtes the probability of 𝑥 belonging to class 𝑐𝑖 as follows,

𝑝(𝑐𝑖 | 𝑥) =

exp ((𝑥 · 𝑐𝑖) /𝜏)
𝑘=1 exp ((𝑥 · 𝑐𝑘 ) /𝜏) .

(cid:205)𝐾

(4.5)

where 𝜏 is a temperature parameter from CLIP, · denotes the inner product of the concept vector

and the image vector and 𝐾 is the number of attr-obj pairs in the training set.

4.4.2 GNN in Soft Prompting

As disussed previously, a key idea to address the CZSL problem is to learn concept representa-

tions that are able to internalize the compositional information. Graph could be the tool to model

such compositional dependencies. And this idea has been used in previous work [Naeem et al.,

36

2021, Mancini et al., 2022] by applying Graph Neural Networks(GNN) as encoders to represent the

compositional concepts. Although we adopt similar graph-based methods for compositional en-

coding, our novelty is to use the graph’s compositional structure to facilitate the automated prompt

engineering in compositional learning as shown in Fig. 4.4. We model the element concepts and

their composition explicitly in GNN for the soft prompting construction. In principle, the GNN

module can be replaced by other differentiable architectures that are able to capture the concept’s

compositional information. We describe the detailed GNN application in GIPCOL next.

Figure 4.4 Comparison between CGE and GIPCOL. The main difference is that GIPCOL uses
GNN to help prompt construction instead of compositional concept encoder.

Node Embedding V. There are two types of nodes in GIPCOL’s compositional graph: element

concept node and compositional concept node. The node embedding’s size is 𝑅(|𝑎|+|𝑜|+|𝑐|)∗𝑑, where

|𝑎| is the attribute number, |𝑜| is the object number, |𝑐| is the training pair number and 𝑑 is the

feature dimension. For the element nodes, we initialize them using CLIP’s embedding vectors. For

the compositional nodes, we initialize them using the average embedding of the element nodes, that

is,

𝑎𝑡𝑡_𝑣𝑒𝑐+𝑜𝑏 𝑗_𝑣𝑒𝑐
2

. GIPCOL relies on GNN to fuse information from the constructed compositional

graph and update the concept’s representation.

Compositional Graph Constructions E. We use a graph to capture the compositional depen-

dencies and learn richer concept representations. The connection design among concepts is the

key challenge for such graph. In order to utilize the feasible compositional information, GIPCOL

considers the training pairs and construct one single compositional graph for both closed-world

37

Small DogGNNVisEncCGESmall DogGNNCLIP’s VisEncX X X Small DogSoft PromptingCLIP’s TxtEncCLIP’s SpaceGIPCOLLearnableFreezeFigure 4.5 Different prompting strategies. GIPCOL combines both soft prefix vector and GNN for
prompt construction.

CZSL and open-world CZSL to conserve the computing and storage resources. Specifically, given

a pair 𝑐 = (𝑎, 𝑜), besides the self-connected edge, GIPCOL adds three undirected edges (𝑐 ↔ 𝑎),

(𝑐 ↔ 𝑜) and (𝑎 ↔ 𝑜) in the graph where the adjacency matrix 𝐴 ∈ R𝐾×𝐾 is symmetric with

𝐾 = |𝑎| + |𝑜| + |𝑐|. The compositional concept plays the bridging role to help connect element

concepts and only the element concepts are used to construct the compositional prompting due to

the zero-shot setting.

GNN Module: Once we have the compositional graph and the initialized concept features, we

can update the concept’s embedding by fusing the compositional information from its neighbors.

Any GNN models could be applied here and in GIPCOL, we use Graph Convolution Network

(GCN) [Kipf and Welling, 2016] in Eq. 4.6 for compositional encoding.

𝐻 (𝑙+1) = 𝜎

(cid:16)

˜𝐷− 1

2 ˜𝐴 ˜𝐷− 1

2 𝐻 (𝑙)Φ(𝑙)(cid:17)

(4.6)

where 𝐻𝑙 denotes the node’s representations in the 𝑙𝑡ℎ layer, 𝜎 is the non-linearity ReLU function,

˜𝐴 is the adjacency matrix with added self-connections,

˜𝐷 is a diagonal node degree matrix and

Φ𝑙 is the learnable weight matrix in layer 𝑙. Notably, other graph constructing methods, like using

external knowledge [Karthik et al., 2022], and other GNN models, like GAT [Velickovic et al.,

2017], could be further explored to improve CZSL performance based on GIPCOL’s architecture.

However, these are not target of this work. Here, GIPCOL shows the effectiveness of utilizing

compositional knowledge in prompting construction in CZSL.

4.4.3 Training

After obtaining the concept and image representations, we calculate the class probability using

Eq. 4.5. And the regularized Cross-Entropy loss is used to update GIPCOL’s prefix vectors Θ and

38

A Photo of [Attr][Obj]Learnablea) CSPX   XXX[Attr] [Obj]Learnableb) COOPd) CoCoOPX   XX[Attr] [Obj]LearnableimgMeta-NetX   XX[Attr][Obj]e) Ours(GIPCOL)LearnableGNNConcept EmbeddingX   X   XX[Attr][Obj]Learnablec) PromptCompVLLearnableGNN parameters Φ as follows,

−

1
|N|

∑︁

𝑖∈N

log 𝑝𝜽 (𝑐𝑖 | 𝑥) + 𝜆1∥𝚯∥2 + 𝜆2∥𝚽∥2

(4.7)

where 𝜆1 and 𝜆2 are the hyper-parameters to control the weight decay for prefix vector and GCN

separately. GIPCOL keeps CLIP’s pre-trained textural and visual encoders fixed during the training

time. And more details about the training process can be found in Alg. 2.

Algorithm 2: GIPCOL:

1: Initialize GIPCOL using CLIP’s pre-trained textual and visual encoders.
2: Update element concept’s representation using GNN as Equation 4.1 and Equation 4.6.
3: Construct textual prompt for compositional labels using the updated element concepts and

learnable prefix vectors as Equation 4.2.

4: Extract and normalize image/text vectors using CLIP’s image/text encoder based on

Equation 4.3 and Equation 4.4 separately.

5: Calculate the class probability as Equation 4.5 using the cosine similarity and update

GIPCOL’s soft-prompting layer Θ and GNN layer Φ using Cross-Entropy loss.

4.4.4

Inference

During inference, given an image, we first construct the soft prompts for all target concepts

using the fine-tuned prefix vectors and GNN. Then, we use CLIP’s frozen textual and visual

encoders to obtain the image vector 𝑥 and the target concept vector set C𝑡𝑎𝑟𝑒𝑔𝑒𝑡. Then we use

cosine measurement to select the most similar attr-obj pair from C𝑡𝑎𝑟𝑔𝑒𝑡 as the compositional label

as follows,

ˆ𝑐 = arg max
𝑐𝑖 ∈Ctarget

𝑐𝑜𝑠 (𝑐𝑖, 𝑥) .

(4.8)

where 𝑐𝑖 is the 𝑖-th compositional vector from the target set.

4.4.5 CLIP-Prompting Method Comparison

In this section, we clarify the difference between all CLIP-prompting methods used in CZSL as

shown in Fig. 4.5. Generally, all current CLIP-prompting methods keeps the image representation

fixed and learn constructing the CLIP’s textual prompt to represent the compositional concept

as shown in Eq. 4.2. The main difference is that CSP[Nayak et al., 2022] learns the element

39

embedding, COOP[Zhou et al., 2022a] learns the prefix vectors and PromptCompVL learns both

the element embedding and the prefix vectors. All these three methods do not explicit consider the

compositional structures between concepts. In order to inject more semantic information into soft

prompt, CoCoOP[Zhou et al., 2022b] introduces a Meta-Net and tries to modify the prefix vectors

based on each image input.

It uses the instance-level information not the global compositional

information for CZSL. Such instance-level prompting also causes training inefficient and consumes

a significant amount of computing resources as discussed in that work. Different from all previous

methods, GIPCOL proposed a novel prompting strategy by combining the learnable prefix vectors

and the GNN module and the detailed comparition is in Append. ??. Although both CGE[Naeem

et al., 2021] and GIPCOL use GNN to encode compositional concepts, the GNN module functions

in a fundamentally different manner in these two models. GNN in GIPCOL helps construct the

soft prompting for CZSL. However, GNN in CGE plays the text encoder role which projects the

concept into the embedding space. GIPCOL freeze CLIP’s textual and visual encoders to utilize

CLIP’s multi-modal aligning ability for CZSL which is more efficient. In contrast, CGE needs to

train both the GNN and visual encoder to obtian competitive performance as compared in Fig. 4.4.

4.5 Experiments

4.5.1 Experimental Setting

Datasets. We conduct experiments on three datasets, MIT-States [Isola et al., 2015a], UT-

Zappos [Yu and Grauman, 2014] and C-GQA [Naeem et al., 2021]. MIT-States and C-GQA

consist of images with objects and their attributes in the general domain. In contrast, UT-Zappos

contains images of shoes paired with their material attributes which is a more domain-specific

dataset. Our experiments follow the previous works [Purushwalkam et al., 2019b, Naeem et al.,

2021] on the data split for training and testing. More details about the data splits and statistics can

be found in Tab. 4.1.

Implementation details. We extend on the codebase of [Nayak et al., 2022]5 and [Naeem et al.,

2021]6 for GIPCOL’s implementation. Moreover, for a fair comparison, the length of the prefix

5https://github.com/BatsResearch/csp
6https://github.com/ExplainableML/czsl

40

MIT-States UT-Zappos C-GQA

# 𝐴𝑡𝑡𝑟.
# 𝑂𝑏 𝑗 .
# 𝐴𝑡𝑡𝑟. × 𝑂𝑏 𝑗 .
# Train Pair
# Train Img.
# Val. Seen Pair
# Val. Unseen Pair
# Val. Img.
# Test Seen Pair
# Test Unseen Pair
# Test Img.

115
245
28175
1262
30338
300
300
10420
400
400
19191

16
12
192
83
22998
15
15
3214
18
18
2914

413
674
278362
5592
26920
1252
1040
7280
888
923
5098

Table 4.1 Dataset Statistics for MIT-States, UT-Zappos and C-GQA.

vector, 𝑘, is set to 3 which is the same length of CLIP hard-prompting ’a photo of’. The dimension

of soft-prompting 𝑑 is set to 768 which is consistent with CLIP’s model setting. Moreover, we

use two-layer GCN to encode concepts and the corresponding GNN’s learnable parameters are

Φ = {Φ1, Φ2} Our code will be made publicly available on GitHub7.

Evaluation Metrics. Zero-shot models are biased to the seen classes as shown in previous woks

[Chao et al., 2016, Mancini et al., 2021]. As a standard method in zero-shot learning, we introduce

a scalar value adding to the unseen classes to adjust the bias towards the seen classes as used in

[Purushwalkam et al., 2019b, Nayak et al., 2022]. By varying the added bias from −∞ to +∞,

we report GIPCOL’s performance using the following four metrics in both the closed-world and

the open-world settings as discussed in Sec. 4.3: 1) Best seen accuracy (S), testing only on seen

compositions when bias is −∞; 2) Best unseen accuracy (U), testing only on unseen compositions

when bias is +∞; 3) Best harmonic mean (HM) which balances the performance between seen and

unseen accuracies; 4) Area Under the Curve (AUC), the area below the seen-unseen accuracy curve

by varying the scalar added to the unseen compositional concepts.

Baselines. We compare GIPCOL with two types of baselines: 1) non-CLIP methods (top seven

models in the closed setting and top six in the open setting) namely Attributes as Operators

(AoP)[Nagarajan and Grauman, 2018b], Label Embed+ (LE+)[Misra et al., 2017a], Task Mod-

7https://github.com/HLR/GIPCOL

41

ular Networks (TMN)[Purushwalkam et al., 2019b], SymNet[Li et al., 2020b], Compositional

Graph Embeddings (CGE)[Naeem et al., 2021], Compositional Cosine Logits (CompCos)[Mancini

et al., 2021] and Siamese Contrastive Embedding Network(SCEN)[Li et al., 2022]. 2) CLIP-

based methods (the bottom three models), namely CLIP[Radford et al., 2021], Context Optimiza-

tion(COOP)[Zhou et al., 2022a] and compositional soft prompting (CSP)[Nayak et al., 2022].

Feasibility Calibration in Open-World Setting. Open-world CZSL is more challenging compared

with the closed-world setting as the class space contains all possible combinations of attributes and

objects including both feasible compositions and infeasible compositions. In order to filter out the

infeasible compositions, we apply the feasibility calibration as used in [Mancini et al., 2021, Nayak

et al., 2022]. For each unseen pair (𝑎, 𝑜), we first collect two sets from the training data. One

is the applicable attribute set 𝐴 = {𝑎1, 𝑎2, . . . , 𝑎 𝑀 } for the target object 𝑜 and the other is the

applicable object set 𝑂 = {𝑜1, 𝑜2, . . . , 𝑜𝑁 } for the target attribute 𝑎 where (𝑎𝑖, 𝑜) and (𝑎, 𝑜 𝑗 ) has

been observed in training time. Then we calculate the similarity between 𝑎 and each element in 𝐴

and use the maximum similarity score as this pair’s attribute feasibility score as follows,

𝑓𝑎 (𝑎, 𝑜) = max

(𝑎𝑖,𝑜)∈Cseen

𝑒(𝑎) · 𝑒(𝑎𝑖)
∥𝑒(𝑎) ∥∥𝑒(𝑎𝑖) ∥

,

(4.9)

where 𝑒 is the GloVe embedding [Pennington et al., 2014b]. On the other hand, this pair’s object

feasibility score is calculated in a similar way based on the applicable object set. Finally, the

unseen pair feasibility score is calculated as the average of the two scores,

𝑓𝑎+ 𝑓𝑜
2

. After obtaining

the feasibility score for all unseen pairs, we can filter out infeasible compositions by setting a

threshold 𝑇 whcih can be tuned based on the validation set. The final prediction for image 𝑥 in the

open-world setting is computed as follows,

Different from the closed-world setting, we require the feasibility score of the predicted label 𝑐 to be

ˆ𝑐 = arg max

𝑐𝑖 ∈Ctarget, 𝑐𝑖 ≥𝑇

𝑐𝑜𝑠 (𝑐𝑖, 𝑥) .

(4.10)

larger than a threshold. The threshold uses in our experiments is shown in Tab. 4.2. In open-world

CZSL (OW-CZSL), we use the validation set to choose a feasible threshold to remove less feasible

compositions from the output space and the adopted threshold in GIPCOL is shown in Tab. 4.2.

42

Dataset
MIT-States
UT-Zappos
C-GQA

Feasibility Score
0.40691
0.51878
0.49941

Table 4.2 GIPCOL’s feasibility threshold score.

4.5.2 Results

Results on MIT-States. As shown in Tab. 4.3 and 4.4, GIPCOL achieves the new SoTA results

on MIT-States on both closed-world and open-world settings compared with CLIP and non-CLIP

baselines (except for the best-unseen metric (U)). The CLIP-based models have consistently better

performance compared to the non-CLIP methods8. CLIP-prompting methods, including COOP,

CSP and ours, further boost the performance compared to the vanilla CLIP model.

Results on UT-Zappos. On UT-Zappos, previous CLIP-based approaches under-perform the

SoTA performance achieved by CGE which is a non-CLIP model. However, GIPCOL successfully

surpasses the CGE model. Note that UT-Zappos is a domain-specific dataset that consists of shoe

types and the materials. There may exist two reasons for to explain the accurary drop: 1) CLIP

doesn’t see many images from this domain during training time; 2) As a fashion data, there is a

appearance shift between CLIP’s training data set and UT-Zappo’s test data set We suspect that

CLIP may not have seen sufficient similar samples from this specific domain and therefore purely

tuning the prompting is not helpful to solve the problem. In contrast, GIPCOL adds additional

compositional information to learn the element concept embedding which appears to boost the

compositional learning ability within this specific domain.

Results on C-GQA. On the more challenging C-GQA dataset, GIPCOL also achieves new SoTA

results on both closed and open world settings with an exception for the seen accuracy in the open

world. However, the key metric is AUC which is consistently higher for GIPCOL in all settings.

Comparing GIPCOL with other CLIP-based method. Besides the absolute SOTA improvement

on MIT-States, another interesting observation is the GIPCOL achieves a consistent improvement

8In principle CLIP-based and non-CLIP-based methods cannot be directly compared as we have no information
about the training data used for CLIP training. Here we follow previous work and include these baselines for the sake
of comparison and consistency with the previous work.

43

MIT-States

UT_Zappos

C-GQA

Method
AoP [Nagarajan and Grauman, 2018b]
LE+ [Misra et al., 2017a]
TMN [Purushwalkam et al., 2019b]
SymNet [Li et al., 2020b]
CompCos [Mancini et al., 2021]
CGE [Naeem et al., 2021]
SCEN [Li et al., 2022]
CLIP [Radford et al., 2021]
COOP [Zhou et al., 2022a]
CSP [Nayak et al., 2022]
GIPCOL (Ours)

S

U

H
9.9
14.3 17.4
15.0
20.1 10.7
20.2 20.1 13.0
24.2 25.2 16.1
25.3 24.6 16.4
32.8 28.0 21.4
29.9 25.2 18.4
30.2 40.0 26.1
34.4 47.6 29.8
46.6 49.9 36.3
36.6
48.5 49.6

AUC
1.6
2.0
2.9
3.0
4.5
6.5
5.3
11.0
13.5
19.4
19.9

S

H

U
59.8 54.2 40.8
53.0 61.9 41.0
58.7 60.0 45.0
49.8 57.4 40.4
59.8 62.5 43.1
64.5 71.5
60.5
63.5 63.1 47.8
15.8 49.1 15.6
52.1 49.3 34.6
64.2 66.2 46.6
65.0 68.5 48.8

AUC
25.9
25.7
29.3
23.4
28.7
33.5
32.0
5.0
18.8
33.0
36.2

S
17.0
18.1
23.1
26.8
28.1
33.5
28.9
7.5
20.5
28.8
31.92

U
5.6
5.6
6.5
10.3
11.2
15.5
25.4
25.0
26.8
26.8
28.4

H
5.9
6.1
7.5
11.0
12.4
16.0
17.5
8.6
17.1
20.5
22.5

AUC
0.7
0.8
1.1
2.1
2.6
4.2
5.5
1.4
4.4
6.2
7.14

Table 4.3 Closed-World CZSL results on UT-Zappos, Mit-States and C-GQA datasets.

on MIT-States in both settings and on UT-Zappos in the close setting compared with other CLIP-

based methods. This empirically shows the effectiveness of introducing both soft-embedding and

soft-prompting in CZSL. Comparing with CSP [Nayak et al., 2022], we only introduce additional

3 learnable prompt vectors and obtain satisfactory improvements on MIT-States. This shows

the importance of soft-prompting. It reprograms CLIP for CZSL. For soft-embedding, we learn

the element concept embedding instead of using fixed CLIP’s embedding which is better for

compositional learning compared with COOP [Zhou et al., 2022a].

We give some qualitative analysis in next section. CZSL dataset usually is a multiple-label

dataset which means we can describe an object from different dimensions. For example, giraffe in

C-GQA, we can describe its color or it size. Both of the compostions should be right. Therefore,

we need to develop more reasonable metrics to evaluate compositional learning performance.

Moreover, there exists wrong labeled items, such as black point in the last row, meaning we also

needs a more clean benchmark for CZSL.

4.5.3 Qualitative Analysis

Predicted Examples. We looked into a number of randomly selected predictions from GIPCOL

shown in Fig. 4.6. The red colored texts are the ground-truth labels, the blue colored texts are

GIPCOL’s correctly predicted labels and the black colored texts are GIPCOL’s wrongly predicted

labels. The first two columns present examples with correctly predicted compositional labels and the

last two columns show the wrongly predicted labels, either wrong in attributes or wrong in objects.

44

MIT-States

UT_Zappos

C-GQA

S

Method

H
U
AoP [Nagarajan and Grauman, 2018b] 16.6
4.7
5.7
2.7
2.5
14.2
1.2
0.9
12.6
5.8
7.0
21.4
8.9
10.0
25.4
32.4
6.0
5.1
30.1 14.3 12.8
12.3
9.3
34.6
46.3 15.7 17.4
17.9
48.5 16.0

LE+ [Misra et al., 2017a]
TMN [Purushwalkam et al., 2019b]
SymNet [Li et al., 2020b]
CompCos [Mancini et al., 2021]
CGE [Naeem et al., 2021]
CLIP [Radford et al., 2021]
COOP [Zhou et al., 2022a]
CSP [Nayak et al., 2022]
GIPCOL (Ours)

AUC
0.7
0.3
0.1
0.8
1.6
1.0
3.0
2.8
5.7
6.3

S

U

H
29.4
50.9 34.2
60.4
36.5 30.5
55.9 18.1 21.7
53.3 44.6 34.5
59.3 46.8 36.9
61.7 47.7 39.0
15.7 20.6
11.2
28.9
31.5
52.1
64.1
44.1 38.9
65.0 45.0 40.1

AUC
13.7
16.3
8.4
18.5
21.3
23.1
2.2
13.2
22.7
23.5

-

S
-

H AUC
U
-
-
19.2 0.7 1.0
-
-
26.7 2.2 3.3
-
-
32.1 1.8 2.9
7.5
4.6 4.0
5.5
21.0 4.6
28.7 5.2 6.9
31.6 5.5 7.3

-
0.08
-
0.43
-
0.47
0.27
0.70
1.20
1.30

-

Table 4.4 Open-World CZSL results on UT-Zappos, Mit-States and C-GQA datasets.

From this figure, we can see that GIPCOL can recognize objects in most of the compositions in

MIT-States and C-GQA datasets. However, it has difficulty to precisely predict the attributes for

these two datasets. For example, it predicts modern clock instead of ancient clock which is the

antonym of the actual attribute. But for UT-Zappos, the more domain-specific dataset, GIPCOL

even has difficulty in recognizing the objects.

Figure 4.6 We show the top-3 predictions of our proposed model for some images. Red colors
are ground-truth labels, blue colors are correctly predicted labels and black colors are wrongly
predicted labels.

Differences in Domains: In this section, we try to explain why GIPCOL works in CZSL by

checking the CLIP’s training data. From Tables 4.3 and 4.4, we observe that CLIP without any

prompt-tuning can achieve better performance compared to non-CLIP models on the MIT-States

dataset, but not on the UT-Zappos dataset. We hypothesize that this issue can be related to the

45

C-GQAUT-ZapposMIT-StatesAncient ClockModern ClockEngraved ClockAncient ClockAncient ClockModern ClockEngraved ClockLarge SnakeHuge SnakeWinding CableBroken BeltBurnt RubberBurnt RubberBurnt WoodBurntKichenAncient ClockBlack PaintOrang ClockWhite ClockBlackSignTall GiraffeLarge GiraffeLong GiraffeBrown GiraffeDamaged HouseDamaged HouseGrey HouseRed HouseWhite ShirtWhite ShirtWhite ComforterGray ShirtLeather SlippersSynthetic SandalsPatent.LeatherSandalsSatin SandalsSatin SandalsLeather Boots.Knee.HighSuede Boots.Knee.HighSheepskin Boots.AnkleLeather SlippersFull.grain.leatherShoes.LoafersCanvas Shoes.LoafersSuede SlippersNylon Boots.AnkleSynthetic Boots.Mid-CalfSuede Boots.Mid-CalfNylon Boots.Ankledistribution difference between the pre-training data used by CLIP and the data domain of the

downstream task. To validate this hypothesis, we further look into some concrete examples from

MIT-stats and UT-Zappos. We take burnt boat from MIT-Stats and Faux Fur-Shoes Clogs and

Mules from UT-Zappos for comparison as shown in Fig. 4.7. From this figure, we can see that

MIT-States have similar visual appearance with CLIP’s pre-trained data. However, for UT-Zappos,

because of the fashion style change overtime, shoes have significant visual appearance between

the pre-training dataset and the target dataset. Results in Tab. 4.3 and Tab. 4.4 have shown the

domain similarity plays an important role in prompting-based method. Prompting CLIP without

any training can achieve better performance on MIT-State then UT-Zappos. GIPCOL helps address

this challenge partially by prompting design based on the results. The CLIP’s training data is

not publicly available. However, LAION-400M [Schuhmann et al., 2021] used the released CLIP

model and obtained the closest 400M image-text pairs9 from their crawled dataset from Web by

reverse engineering. We based our analysis on this constructed LAION-400M subset. By querying

LAION-400M with burn boat, we could retrieve about 600 relevant images. By querying with

Faux Fur_Shoes Clogs and Mules we can retrieve about 200 relevant images. The first interesting

difference is in the quantity of the retrieved relevant images which is significantly lower for the

shoe dataset. The second difference is the data quality differences. As can be seen from Fig. 4.7,

the retrieved shoes are less similar to the UT-Zappos’ shoes when compared to the similarity of

the retrieved boats to MIT-Stats boats. We note that UT-Zappos is about shoe fashion and was

constructed in 2014 while CLIP is pretrained using recent 2020’s images. The change in fashion

trends has made the images look different for the same compositional concept. Based on these

observations, it is evident that the quantity and quality of CLIP’s pre-training data play an important

role in its performance.

Covering the Performance Gap. Despite the above-mentioned issues, GIPCOL improves the

UT-Zappos dataset. While we found that CLIP’s pre-training data is important in its performance

in the Zero-shot setting, introducing the additional compositional knowledge in GIPCOL positively

9https://rom1504.github.io/clip-retrieval

46

Figure 4.7 Comparison between retrieved images from Laion400M and UT-Zappos/MIT-States.

impacts CLIP’s ability in recognizing the novel compositional concepts. GIPCOL uses GNN to

inject compositional information into concept representations which turned out to be helpful. The

improvement is important, especially for UT-Zappos which is a special domain with not many

shared similar examples with CLIP’s training.

t-SNE Comparison between CLIP and GIPCOL The compositiponal concepts learnt by GIPCOL

and CLIP are visualized separately in Fig. 4.8. Each figure randomly sample 5 compositional

concepts with related images and draw their representation using t-SNE [Van der Maaten and

Hinton, 2008]. All prompting-based methods share the same image representation because they

freeze CLIP’s visual encoder during training. Then compositional encoding is the difference

between all these prompting models. From the figures, we can see that GIPCOL’s compositional

vectors (+) are closer to the related image cluster compared with CLIP’s vectors (-) which empirically

shows that GIPCOL has better compositional encoding ability.

4.5.4 Ablation Study

To better understand the influence of each component in GIPCOL, Tab. 4.5 shows the per-

formances of its variations on UT-Zappos’ closed-world setting. From Table 4.5, both GNN and

soft-prompting are important for GIPCOL.

Effects of GNN. We remove the GNN module and directly set attribute and object embeddings

as learnable parameters as in [Xu et al., 2022]. The performance decreases. Especially the AUC

drops from 36.2% to 32.2%.

Effect of Learnable Prefix Vectors. Another variant of GIPCOL is to fix the prefix vectors and

47

UT-ZapposLAION-400MMIT-StatesLAION-400MFaux Fur_ShoesClogs and MulesBurnt BoatFigure 4.8 t-SNE comparison between CLIP and GIPCOL.

Model

S

U

H AUC

36.2
65.0 68.5 48.8
GIPCOL
- without GNN
64.4 64.0 46.12 32.2
- without prefix
31.0
64.7 62.3 45.9
- without both (CLIP) 15.8 49.1 15.6
5.0

Table 4.5 Performance of GIPCOL’s variations.

only tune the GNN module to update the class embeddings. From Tab. 4.5, we can see that

learnable prefix vectors play a more important role than GNN. In fact, adding the prefix vectors

changes CLIP’s textual input and makes it biased towards compositional learning, which is a key

component in GIPCOL.

Comparison to vallina CLIP. Although CLIP has seen many of the compositional concepts during

training, applying CLIP directly achieves no satisfactory results in CZSL. This result shows the

48

importance of prompting learning in CZSL.

4.5.5 Higher-Order Compositional Learning

Previous work (CSP) [Nayak et al., 2022] introduced another challenging dataset: AAO-MIT-

States, a subset derived from MIT-States to evaluate the higher-order compositional learning ability

in the form of attribute-attribute-object (AAO) compositions. After learning the prefix vectors and

GNN-encoded element concepts, GIPCOL can be easily adapted to solve AAO by modifying the

compositional prompt to (𝜃1, 𝜃2, ..., 𝜃𝑚, ˆ𝑎𝑖, ˆ𝑎 𝑗 , ˆ𝑜𝑘 ) to represent the higer-order compositions. We

report the AAO results in Tab. 4.6. We can see that GIPCOL has a better higher-order compositional

leaning ability, with a 3% absolute improvement compared with CSP.

Model

Accuracy

62.7
CLIP
CSP
72.6
GIPCOL (Ours) 75.9

Table 4.6 AAO Performance of different CLIP-based models.

4.6 Conclusion

In this chapter, we propose GIPCOL, a new CLIP-based prompting framework, to address the

compositional zero-shot learning (CZSL) problem. The goal is to recognize compositional concepts

of objects with their states and attributes as described by images. The objects and attributes have

been observed during training in some compositions, however, the test-time compositions could be

novel and unseen. We introduce a novel prompting strategy for soft prompt construction by treating

element concepts as part of a global GNN network that encodes feasible compositional information

including objects, attributes and their compositions. In this way, the soft-prompt representation is

influenced not only by the pre-trained VLMs but also by all the compositional representations in its

neighborhood captured by the compositional graph. Our results have shown that GIPCOL performs

better and achieves SoTA AUC results on all three benchmarks including MIT-States, UT-Zappos,

and C-GQA . These results demonstrate the advantages and limitations of prompting large vision

and language models (such as CLIP) for compositional concept learning.

49

CHAPTER 5

METAREVISION: META-LEARNING WITH RETRIEVAL FOR VISUALLY
GROUNDED COMPOSITIONAL CONCEPT ACQUISITION

Humans have the ability to learn novel compositional concepts by recalling and generalizing prim-

itive concepts acquired from past experiences.

Inspired by this observation, in this thesis, we

propose MetaReVision, a retrieval-enhanced meta-learning model to address the visually grounded

compositional concept learning problem. The proposed MetaReVision consists of a retrieval mod-

ule and a meta-learning module which are designed to incorporate retrieved primitive concepts as

a supporting set to meta-train vision-language models for grounded compositional concept recog-

nition. Through meta-learning from episodes constructed by the retriever, MetaReVision learns

a generic compositional representation that can be fast updated to recognize novel compositional

concepts. We create CompCOCO and CompFlickr to benchmark the grounded compositional

concept learning. Our experimental results show that MetaReVision outperforms other competitive

baselines and the retrieval module plays an important role in this compositional learning process 1.

5.1

Introduction

Learning to compose from previous experience is an important integral part of human intel-

ligence [Fodor and Pylyshyn, 1988b, Biederman and Vessel, 2006]. Generally, compositional

learning refers to the ability to learn a set of basic primitives and generalize these primitives

in a novel scenario different from training time [Kemp and Tenenbaum, 2009, Ontanón et al.,

2021]. It includes various learning aspects, such as systematic generalization, productivity and

substitutivity [Hupkes et al., 2020]. In this work, we focus on systematic generalization within the

multi-modal setting and propose a multi-modal compositional problem: Grounded Compositional

Concept Learning (GCCL). As shown in Figure 5.1, in the GCCL setting, the models are trained

with primitive concepts, such as red and chair, from the training data. The trained models are then

applied to predict novel compositional concepts e.g., red chair in the testing phase although these

concepts were never seen during training.

1MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition.

Guangyue Xu, Parisa Kordjamshid, Joyce Chai. EMNLP-Finding, 2023

50

Figure 5.1 An illustration of Grounded Compositional Concept Learning(GCCL). For example,
given concepts (red, bus) and (old, chair) in the training data, the goal is to learn to predict novel
compositional concepts (red, chair) as masked token prediction at test time.

The ideal vision-language system should have the compositional ability to solve the GCCL

problem. Recently, significant efforts have been made to the development of pre-training vision-

language models (VLMs) [Tan and Bansal, 2019, Su et al., 2020, Radford et al., 2021]. These

VLMs have demonstrated impressive performance in various downstream tasks, including Visual

Question Answering (VQA) [Li et al., 2020a], Vision-Language Navigation (VLN) [Hao et al.,

2020] and image captioning [Zhou et al., 2020]. Despite their success in related fields, it remains

unclear whether these models can truly perceive the world in a compositional manner or generate

language compositionally to cooperate with humans in a shared physical world. Such composition-

related questions are important from both the theory and the application perspectives. From the

theory perspective, compositional learning allows the model to process and understand objects

by breaking them down into smaller, interpretable units. Therefore, compositional learning helps

improve large models’ efficiency and generalization [Andreas et al., 2016]. From the application

perspective, it is not realistic to give the model all possible compositions in training data. For

example, in Vision Language Navigation (VLN), it is not feasible to observe a sofa with all possible

51

a large red busis driving down the street.two teddy bears sitting on an oldchair together.a man in black sits at a red table with red chairsLearnt Element Concept:red, chairCompositional Concept:red chaircolors e.g. red sofa and blue sofa. The vision-language models applied in VLN are expected to

recognize these compositions after learning the element concepts 2. Compositional learning can

be viewed as a special case of zero-shot learning problems. Moreover, the domain-shift problem is

commonplace in zero-shot learning because the statistical distribution of the data in the training set

(seen compositions) and the testing set (novel compositions) could be significantly different. While

compositionality can be reliably interpreted by humans, State-of-the-art VLMs, which are trained

on vast amounts of image-text pairs and employ diverse loss functions, still encounter challenges

in compositional learning [Ma et al., 2023a, Thrush et al., 2022].

To address these limitations, this thesis takes a closer look at the compositionality in VLM with

an attempt to improve its ability. More specifically, we create two grounded compositional con-

cept learning datasets, CompFlickr and CompCOCO curated from MSCOCO [Chen et al., 2015]

and Flickr30K [Plummer et al., 2015], for VLMs’ token-level compositional analysis. Moreover,

we present MetaReVision, Meta-Learning with Retrieval for Visually Grounded Compositional

Concept Acquisition, a retrieval-enhanced meta-learning framework for compositional concept

acquisition, which introduces retriever into GCCL. The retrieval mechanism plays a crucial role

in human learning.

It facilitates long-term retention, understanding enhancement, and knowl-

edge transfer during the learning process, which have been discussed by a large body of studies

in cognitive science [Karpicke and Blunt, 2011, Karpicke, 2012]. To mimic such human’s re-

trieving behavior[Roediger and Butler, 2011, Karpicke and Roediger III, 2008], MetaReVision

retrieves relevant primitive concepts from a pre-constructed concept database and provides them

as support evidence to do meta-learning for compositional concept learning. MetaReVision fol-

lows a Learn-Retrieval-Compose framework. It shares the compositional learning burden between

VLMs and the retriever. Through meta-learning from the episodes constructed by the retriever,

MetaReVision learns a generalized compositional representation that can be fast updated for novel

compositional recognition. We evaluate MetaReVision on the proposed CompFlickr and Comp-

COCO datasets. The empirical results show that coupling retrieval and meta-learning performs

2Element concepts are also called primitive concepts in our setting. We use them interchangeably in this work.

52

better in GCCL compared with previous baselines.

Contributions of this work can be summarized as follows:
• This work explores a novel angle of retrieval-enhanced compositional concept learning. The

model relies on retrieval to construct episodes for meta-learning. It addresses the domain-shift

problem in compositional learning by learning from the retrieved instances.

• Two datasets are created to serve as benchmarks for grounded compositional concept learning.

These datasets enrich existing zero-shot vision-language tasks, from the end-task level to the

token-level.

• Our experiments show that MetaReVision demonstrates stronger performance in GCCL,

especially in the novel setting. This empirically shows the effectiveness of combining

retrieval and meta-learning techniques in the context of grounded compositional learning.

5.2 Related Works

Meta-Learning also known as learning to learn, aims to solve a low-resource problem by leveraging

the learned experience from a set of related tasks. Meta-learning algorithms deal with the problem

of efficient learning so that they can learn new concepts or skills fast with just a few seen examples

(few-shot setting) or even without seen examples (zero-shot setting). Different from the typical

meta-learning scenario where the training and test episodes are given in advance in few-shot

learning [Sung et al., 2018a, Snell et al., 2017b, Nichol et al., 2018a, Finn et al., 2017], in GCCL, we

need to construct episodes to employ meta-learning methods for compositional concept learning.

In MetaReVision, we introduce a retriever to actively construct episodes to help compositional

concept learning. During the test time, with additional retrieved support items, MetaReVision can

further fast-update VLMs for current compositional concept recognition in the query set. This

test-time fine-tuning is different from previous works which apply meta-learning in the zero-shot

setting[Conklin et al., 2021].

Retrieval-Enhanced Learning. Retrieving related instances from a database, either the training

set or external knowledge base, has been widely applied in tasks such as language modeling

[Khandelwal et al., 2019], reinforcement learning [Goyal et al., 2022] and language tasks such as

53

NER [Wang et al., 2021]. Instead of distilling all training information into the model’s parameters

through gradient updates, retrieval-enhanced learning introduces a retriever to find related instances

and based on these instances conduct further learning. For example, kNN-LM [Khandelwal et al.,

2019] extends the pre-trained language model by linearly interpolating its next word distribution

with a retrieval module. This design shows effective domain adaptation ability. [Wang et al., 2021]

finds external contexts for the target instance by retrieving a set of semantically relevant texts to

fine-tune the CRF module to address the NER problem. These studies highlight the significance of

actively recalling information from a database to enhance learning outcomes. The general scheme

of such methods is to combine a parametric model with a non-parametric retrieval system [Long

et al., 2022]. Different from these settings, in GCCL, we train our own concept retriever and show

retrieval’s importance in compositional learning.

Compositional Learning. Recent research suggests that compositionality remains a challenge for

state-of-the-art (SoTA) neural models such as Transformers and Graph Neural Networks [Nikolaus

et al., 2019, Hupkes et al., 2020, SHAO et al., 2023]. To tackle this challenge, inspired by symbolic

AI, some works try to add structural constraints into neural models [Bergen et al., 2021]. There

are also some attempts to generate new data for the compositions [Naeem et al., 2023, Xian et al.,

2018b]. Also, there have been noteworthy advancements in vision-language benchmarks that focus

on probing and enhancing VLM’s compositional abilities recently [Eisenschlos et al., 2023, Thrush

et al., 2022, Ruis et al., 2020, Ma et al., 2023a]. Nevertheless, these works build end tasks in a

compositional manner. They emphasize the performance of these compositional end tasks without

giving consideration to the token-level compositional ability. However, GCCL targets VLM’s

token-level compositional ability. Moreover, different from symbolic and data-augment solutions,

MetaReVision explores the retrieval method to solve the compositional problem.

5.3 Grounded Compositional Concept Learning (GCCL)

We start by introducing the settings of Grounded Compositional Concept Learning (GCCL)

and further introduce the benchmarks we curated for this problem in this section.

54

5.3.1 Problem Definition

Existing VLMs try to learn a generic representation for multi-modal tokens in different contexts.

These VLMs are expected to obtain generic token representations that have strong transfer ability

for downstream tasks. We consider a setting that directly examines whether VLMs have the ability

to acquire compositional meanings of tokens through the lens of language modeling. Different

from the task-level compositional studies, GCCL approaches the compositional problem from the

token-level and investigates whether VLMs possess the capability to acquire the compositional

meanings of tokens.

Figure 5.2 GCCL task definition. Red highlights seen compositional concepts and blue highlights
novel compositional concepts.

Figure 5.2 shows an example of the GCCL task. Given a set of image-caption pairs with the

compositional concepts masked out from the caption, the model is tasked to learn the concept

representations and predict the masked compositional concept conditioned on the contextual in-

formation. The learned model is then applied in the testing phase on both novel compositions as

well as seen compositions. The model is evaluated based on its ability to learn novel compositions

while maintaining (i.e., not forgetting) seen compositions.

Formally, given a set of text-image pairs (cid:8)(cid:0)𝑥𝑐𝑎 𝑝, 𝑥𝑖𝑚𝑔(cid:1) (cid:9)𝑛

𝑖=1 where 𝑥𝑖𝑚𝑔 ∈ I is the image with
annotated bounding boxes, 𝑥𝑐𝑎 𝑝 ∈ T is the caption with the compositional concepts replaced by

MASK. The objective of GCCL is to predict the masked tokens based on the contextual informa-

tion[Ma et al., 2023b, Jin et al., 2020]. Therefore, for BBoxes, only the locations are considered

55

Red apple in basketBlue bus parking on roadTrainingNovel Comp. TestingSeen Comp. Testinga man standing by a red busBlue bus parking at corneras input, not their label information. A model capable of solving GCCL can be described as a

functional 𝑓 : I × T → V𝑎𝑡𝑡𝑟 × V𝑜𝑏 𝑗 , where V𝑎𝑡𝑡𝑟 × V𝑜𝑏 𝑗 is the target compositional concepts

which could be either adjective + noun pairs or noun + verb pairs. Based on whether V𝑎𝑡𝑡𝑟 × V𝑜𝑏 𝑗

have been seen during training, GCCL can be categorized into seen compositional testing and

novel compositional testing. The desired compositional VLMs should achieve improved novel

performance without sacrificing the seen performance.

5.3.2 GCCL Dataset Creation

We build GCCL’s benchmarks, CompFlickr and CompCOCO, from MSCOCO [Chen et al.,

2015] and Flickr30K [Plummer et al., 2015]. We use the same data split introduced by [Nikolaus

et al., 2019]. Their work studies the composition ability of image captioning systems by selecting

24 pairs as novel compositions by removing all images related to these 24 pairs from the training

dataset. This ensures that novel compositions have never been seen during training. Other works

adapt the same data split for compositional learning studies. For example, [Jin et al., 2020] utilized

this split to check current VL models’ compositional ability on phrases under the continual learning

setting. However, in [Jin et al., 2020]’s work, most of the extracted phrases are in the form of article

+ noun, like the car and a man. They are single objects instead of compositional concepts. Such

phase evaluation is not a good setting for compositional learning.

In order to evaluate the token-level compositional ability, we develop two benchmarks Compt-

COCO and ComptFlickr to address the above limitation. Concretely, after paring the captions using

Stanta [Qi et al., 2020], we use a number of rules to collect and mask the compositional concepts,

the details are in the Figure 5.3. After parsing by Stanza, we can extract compositional pairs using

the following rules. Compared with [Jin et al., 2020]’ phase extracting rule, MetaReVision extracts

more reasonable compositional pairs. Finally, the dataset is divided into 4 parts: training set with-

out novel compositions, validation set with both seen and novel compositions for hyper-parameter

tuning and model selection, seen test set, and novel test set. The detailed statistics of novel compo-

sitions for these two datasets are shown in Table 5.1. This table shows the statistics of the extracted

novel compositional concepts. From the table, we can see that CompCOCO has more novel pairs

56

Train Img. Train Caps. Test Img. Test Caps. Train Img. Train Caps. Val Img. Val Caps. Test Img. Test Caps.

MSCOCO

Flickr30K

black bird
small dog
white boat
big truck
eat horse
stand child
white horse
big cat
blue bus
small table
hold child
stand bird
brown dog
small cat
white truck
big plane
ride woman
fly bird
black cat
big bird
red bus
small plane
eat man
lie woman

205
681
373
417
212
1288
264
184
276
261
1328
532
613
252
262
967
595
245
840
215
566
481
555
301

323
1067
261
601
378
1556
500
216
506
296
1860
831
878
325
420
1345
674
526
1760
291
1212
833
698
388

122
316
196
191
106
577
151
103
143
134
664
260
291
149
121
357
300
132
448
123
232
158
250
144

190
481
134
288
187
741
300
108
243
154
992
406
430
183
175
494
330
283
940
169
474
279
314
194

17
360
69
28
2
1048
51
0
11
48
835
13
934
2
35
5
266
29
15
24
11
13
153
145

24
612
85
38
2
1475
100
0
16
54
1289
24
1838
3
42
5
537
53
27
34
20
20
272
278

0
11
0
0
0
38
3
0
0
1
27
0
31
0
2
0
8
0
0
0
0
0
4
1

0
12
0
0
0
57
4
0
0
1
37
0
61
0
2
0
17
0
0
0
0
0
5
2

2
17
3
1
0
26
4
1
0
1
35
0
29
0
2
0
9
0
1
0
1
0
5
4

3
33
8
1
0
36
8
1
0
1
60
0
58
0
2
0
23
0
1
0
1
0
10
8

Table 5.1 Novel Pair Statistics for both CompCOCO and CompFlickr. We use the same 24 pairs to
verify the compositional generalization.

than CompFlickr. And CompCOCO is a more reliable evaluation for novel compositional learning

than And CompFlickr

(a) Rules to extract adj-noun pairs.

(b) Rules to extract verb-noun pairs.

Figure 5.3 Extracting rules to Construct CompFlickr and CompCOCO.

5.4 Meta-Learning with Retrieval for GCCL (MetaReVision)

Traditional word acquisition models typically learns a one-size-fits-all model from the entire

training dataset and makes predictions for each test example in the inference phase. However,

57

A      black      catis   inside   a   white   toilet.NounAdjAMODA    brown   and   black   horse   in the middle of the city eating grass.NounAdjAdjAMODCONJThe  big book   busis   blue  and   yellow.NSUBJAn  orangeblue and white busand a brown round structure behind it.AdjNounAdjAdjNounAMODAMODA   large   passenger   airplane  flyingthrough   the   air .NounVerbACLAn   airplane  that   is , either , landingor just taking off .NounVerbACL:RECLA  cute  kittenis   sittingin   a   dish   on   a   table .NounVerbNSUBJFigure 5.4 MetaReVision Architecture. The whole system includes two modules: retrieve and
meta-trained VLM. During testing, MetaReVision retrieves related instances to fast-update VLM
for novel compositional learning.

GCCL is a domain-shift problem and it is desirable to learn a customized model for each novel

composition.

In this work, we study to combine retrieval and meta-learning to address such

customization and propose Meta-Learning with Retrieval for GCCLMetaReVision.

MetaReVision mainly consists of two modules:

the retrieval model and the meta-learner as

shown in Figure 5.4. The retrieval module learns to find similar element concepts from the training

data. The meta-learner organizes the retrieved items as a pseudo task to meta-tune VLMs for

compositional learning. In this part, we will discuss the base VLMs, retrieval module, and meta-

learning module in detail and answer two key questions in MetaReVision’s design: 1) How to

retrieve related items, 2) How to utilize the retrieved items in the context of meta-learning.

5.4.1 Vision-Language Models (VLMs)

VLBERT [Su et al., 2020] and LXMERT [Tan and Bansal, 2019] are two representative VLMs

that are suitable in our GCCL setting. They represent one-stream and two-stream VLMs separately.

The difference is that two-stream VLMs have additional self-attention layers before cross-attention

layers. We conduct experiments using these two types of VLMs to show the general effectiveness

of the proposed framework. Moreover, all VLMs are trained from scratch to make sure that they

do not see novel compositions during their training time.

58

Input: A  [MASK] [MASK] driving on the road. Output: Blue BusFAISS Indexer•Key: Dense Vector <0.1, 0.3, …, 0.95>•Value:•Element concept: Blue•Element type: Adjective•Sentence: a blue bus driving on the road•Image ID: 7616Element Concept DBRetriever ConstructionMeta-TrainingInput: large [MASK] [MASK] passing parked cars. Label:  White busQuery ItemSupport Itema [blue] [bus] driving on the road.a [white] [plate] on the tableTest PhaseNovel CompositionsInput: People sitting on the [MASK] [MASK].Fine-Tune and Predict: Red BusVLM-EncoderVLM-EncoderVLM-EncoderFAISS IndexerFAISS IndexerSupport Itema [blue] [bus] driving on the road.A [red] [truck] traveling on an intersection.MAML Updating:5.4.2 Retriever and Element Concept Database

Given the compositional concepts, the ideal retriever is expected to retrieve the training examples

that are the most beneficial for the target compositional concept learning. It is usually assumed

that the examples that are the nearest neighbors of query examples are more likely to be beneficial

ones for generalizing [Long et al., 2022]. GCCL retriever needs an encoder to encode the element

concept, construct a database to organize these element concepts’ information, and retrieve relevant

concepts.

Element Concept Encoder. Given the linguistic and visual clues for the compositional concepts,

the encoder is acting as a function 𝑓 (𝑥𝑐𝑎 𝑝, 𝑥𝑖𝑚𝑔) that maps a MASK concept to a fixed-length vector

R𝑑. Then for each primitive concept in the target compositions, 𝑓 (·) can help retrieve related prim-

itive concepts. MetaReVision relies on these retrieved concepts to conduct further compositional

learning. In this way, MetaReVision enhances its own compositional capability by augmenting the

input through the retrieval procedure. The encoding function 𝑓 (·) is the key component for the

retriever. In traditional vision-language tasks, like VQA and Visual Entailment[Song et al., 2022],

CLIP [Radford et al., 2021] is usually used as the encoder to encode the whole visual or textural

input and help build the retriever. However, in GCCL’s token-level compositional setting, we focus

on the token’s representation and therefore use the VLMs as an encoder to extract MASK concept’s

representation for further compositional learning. These vectors are used as keys to construct

the Element Concept Database and perform an approximate nearest neighbor search to augment

compositional learning. We add a two-layer MLP and adopt Masked Language Modeling (MLM)

to train vision-language retriever. For the encoder’s training, since we focus on concept acquisition,

words in compositional concepts are masked with a probability of 1.0, and others are not masked

during training.

Element Concept Database. The element concept datastore DB = {(𝑘𝑖, 𝑣𝑖)}, which is constructed

offline using the above-trained vision-language encoder, consists of dense representations of masked
element concepts 𝑘 = 𝐸𝑛𝑐 (cid:0)𝑥𝑐𝑎 𝑝, 𝑥𝑖𝑚𝑔(cid:1) ∈ R𝑑 is as keys and the corresponding (𝑥𝑐𝑎 𝑝, 𝑥𝑖𝑚𝑔) as values.

To efficiently access this database, we implement the dense retriever for GCCL by an off-the-shelf-

59

retriever engine FAISS [Johnson et al., 2019] with a flat index (IndexFlatIP) without any training.

Then given a masked concept, we can retrieve the top-K DB items by calculating the cosine

similarity scores between the [MASK] concept with all DB items in nearly real-time as follows:

Ret(𝑘) = {(𝑘1, Val1) , . . . , (𝑘 𝑀, Val𝑀)}

(5.1)

where 𝑘 is the mask concept’s embedding vector, 𝑘𝑖 is the DB item’s key, Val𝑖 = (𝑥𝑐𝑎 𝑝𝑖 , 𝑥𝑖𝑚𝑔𝑖 ) is

the retrieved DB item’s value, and 𝑅𝑒𝑡 is the retrieved DB item set.

After adding the retrieval module into GCCL, the problem can be re-formulized as:

𝑝(𝑣 | 𝑥) = 𝑝(𝑣 | 𝑥, 𝑅𝑒𝑡 (𝑥))
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:124)

(cid:123)(cid:122)
𝐿𝑒𝑎𝑟𝑛𝑒𝑟

𝑝(𝑅𝑒𝑡 (𝑥) | 𝑥)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙

(5.2)

where 𝑣 is the MASK compositional concept’s prediction, 𝑥 ∈ R𝑑 is the maksed concept’s encoded

vector and 𝑅𝑒𝑡 (𝑥) is the retrieved DB items based on its vector 𝑥 as Equation 5.1. The compositional

learning happens in two levels: 1) retrieve related items from DB based on the encoding vector, 2)

learn conditioned on contextual information and the retrieved items.

5.4.3 Meta-Learning for GCCL

Given the retrieved items, there are several ways to exploit these examples to facilitate compo-

sitional learning. The most direct method is to fine-tuning (FT). However, because the retrieved

items are noisy and FT often faces over-fitting issues when they learn from a few labeled examples,

FT does not help GCCL. Another choice in in-context learning [Wei et al., 2022]. However, as

GCCL is a multi-modal problem. We have multiple image-caption pairs in the contextual input,

current large multi-modals, like LLaVA [Liu et al., 2023] and GPT-4 [Achiam et al., 2023], can not

be applied directly here. In MetaReVision, we choose meta-learning framework to utilize the re-

trieved items for GCCL. Meta-learning here is to train the base VLM with the ability to accumulate

knowledge across episodes3 and build internal generic representations for tokens that are suitable

for compositional learning. Moreover, we introduce the verbalizer module to enforce the predicted

concept for the query set coming from the retrieved support items. The verbalizer helps mitigate the

3episodes also called tasks in meta-learning.

60

memorization problem in meta-learning [Yin et al., 2019]. In the following part, we will discuss

episode construction, the details about MAML, and verbalizer module used in MetaReVision.

Episode Constructions. We construct GCCL tasks 𝜏𝑖 for meta-learning as follows:

(cid:16)

𝜏𝑖 =

Dsupport
𝜏𝑖

, Dquery
𝜏𝑖

(cid:17)

,

(5.3)

where Dsupport

𝜏𝑖

indicates the support set and Dquery

𝜏𝑖

indicates the query set. Specifically, for one

task, we randomly select one compositional concept as the query set. Then we retrieve a small

number of examples that are similar to the query concepts. These retrieved items make up the

support set. Meta-learning’s objective in GCCL is to predict the compositional concepts in the

query set after learning the element concepts in the support set. Here, episodes help VLMs to

accumulate compositional knowledge and learn a generic compositional representation for masked

concepts from the task-level instead of instance-level.

Meta-Learner. We use MAML [Finn et al., 2017] as our meta-learning algorithm. As an

optimization-based method, MAML has two optimizing steps within each episode: the meta-train

step and the meta-test step. In the meta-train step, MAML learns a task-specific learner 𝜃′ based

on the current parameter 𝜃 and retrieved support items 𝑆. In the meta-test step, MAML updates

the parameter 𝜃 based on the fast-updated parameter 𝜃′ and the compositional query items 𝑄 as

shown in Figure 5.5. Moreover, MAML can be solved by formulating it as a bi-level optimization

problem. Equation 5.2 can be extended to Equation 5.4.

min
𝜽

L (Alg (𝜽, Retriever (𝑺)) , 𝑸) ,

where Alg(𝜽, 𝑺) = 𝜽 − 𝛼∇𝜽 L (𝜽, 𝑺),

(5.4)

where 𝜽 is the learnt parameters, Retriever(𝑺) stands for the retrieved DB items, 𝑸 is target

compositional concept and Alg represents the optimization algorithm adapting to the support

instances. There are different versions regarding Alg [Nichol et al., 2018b, Finn et al., 2017].

We use MAML which unrolls the optimizing process and tries to find a good initial parameter

configuration for all compositions.

61

Figure 5.5 MAML’s computing procedure.

Verbalizer. MAML’s classical application is in few-shot learning, where class-to-label assignment

needs to be conducted within each episode, that is, the same class has different labels among

different episodes. Without such re-assignment, the models can memorize the class information

and conduct prediction directly without considering the items in the support set. This is known

as memorization problem in MAML discussed in [Yin et al., 2019]. To help MetaReVision learn

from the retrieved instances, we introduce the verbalizer module into MetaReVision. It enforces

prediction for the query set by selecting concepts from the support set as shown in Figure 5.6. In

this way, MetaReVision will rely on the retrieved element concepts rather than memorizing the

labels to do compositional learning. This helps alleviate the MAML’s memorization problem.

5.4.4

Inference

During inference time, we consider each test compositional concept as a query item and retrieve

relevant instances from concept DB as support instances. Therefore, we construct a specific task for

the current compositional concept. Instead of applying the general model 𝜃 directly, MetaReVision

retrieves support instances to fast-update the model to adapt to current compositions and make

62

MAML Learner for GCCLRetrieved Support Items𝒮𝒮=𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐,𝑋𝑋𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖=1𝑘𝑘Meta-Train𝜃𝜃𝑖𝑖′=𝜃𝜃−𝛼𝛼∇𝜃𝜃ℒ𝑉𝑉𝑉𝑉𝑓𝑓𝜃𝜃𝒮𝒮 Meta-Test𝜃𝜃←𝜃𝜃−𝛽𝛽𝛻𝛻𝜃𝜃ℒ𝑉𝑉𝑉𝑉𝑓𝑓𝜃𝜃𝑖𝑖′𝑄𝑄Target Compositional Items𝑄𝑄=𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐,𝑋𝑋𝑖𝑖𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗=1𝑉𝑉𝜃𝜃𝜃𝜃′Figure 5.6 Verbalizer helps VLM consider retrieved instances when learning.

VL-Model

Metric
Train-Scratch
MAML w/o Ret.
Ours(Top 4)
Ours(Div 4)
Train-Scratch
MAML w/o Ret.
Ours(Top 4)
Ours(Div 4)

O
C
O
C

r
k
c
i
l
F

VLBERT

LXMERT

Pair Accu.↑ Attr. Accu.↑ Obj. Accu.↑ Pair Accu.↑ Attr. Accu.↑ Obj. Accu.↑

7.73%
9.03%
11.15%
13.50%
6.04%
8.60%
10.7%
11.50%

25.88%
27.08%
29.84%
31.85%
17.53%
22.06%
24.58%
25.49%

50.74%
50.04%
50.17%
50.92%
65.21%
64.38%
65.54%
66.58%

8.14%
9.04%
12.01%
13.79%
5.12%
7.52%
9.38%
10.58%

26.36%
27.01%
29.36%
33.76%
18.10%
18.45%
20.45%
22.45%

55.06%
56.19%
58.81%
59.87%
61.68%
64.55%
65.10%
65.15%

Table 5.2 MetaReVision’s Results on Novel Compositional Concept.

predictions as 𝑣𝑖 = argmax𝑣∈𝑆𝑢 𝑝 𝑃(𝑣), where the prediction comes from the retrieved concepts. In

MAML’s testing, it is observed that a larger number of updates can give a considerable performance

boost. Thus, we choose the inner loop updates to 20 before testing.

5.5 Experiments

In this section, we introduce the GCCL’s datasets, demonstrate the implementing details of

MetaReVision, and compare its results with other baselines. Ultimately, we empirically analyze

the retriever importance in MetaReVision.

63

VLM Encoder[IMG] [BBOX] [SEP]A  [Red]  [Bus].Concept Predict HeadRetrieved Supporting ConceptsRedBlueYellowBusTrainHorseSelectionCross-Entropy LossVocabularyRetrieved ConceptVL-Model

Metric
Train-Scratch
MAML w/o Ret.
Ours(Top 4)
Ours(Div 4)
Train-Scratch
MAML w/o Ret.
Ours(Top 4)
Ours(Div 4)

O
C
O
C

r
k
c
i
l
F

VLBERT

LXMERT

Pair Accu.↑ Attr. Accu.↑ Obj. Accu.↑ Pair Accu.↑ Attr. Accu.↑ Obj. Accu.↑

32.45%
32.23%
32.27%
32.46%
24.34%
23.73%
23.75%
26.52%

49.06%
49.05%
49.15%
50.01%
42.72%
41.92%
41.95%
46.11%

60.03%
59.20%
59.98%
60.05%
52.53%
49.01%
49.04%
53.23%

34.12%
34.09%
34.02%
34.15%
22.68%
22.15%
22.75%
23.41%

50.33%
49.97%
49.90%
50.32%
40.86%
41.21%
41.19%
42.02%

61.96%
61.93%
61.90%
62.00%
50.11%
49.97%
50.01%
51.61%

Table 5.3 MetaReVision’s Results on Seen Compositional Concept.

5.5.1 Dataset

CompCOCO is constructed from MSCOCO [Chen et al., 2015] using its 2014’s split.

In this

split, COCO-captions has 103175 training images and 15112 validation images [Chen et al., 2015].

Because MSCOCO does not provide test data, we use the validation data as the testing data in

CompCOCO. Moreover, in order to extract more compositional concepts, we modify [Lu et al.,

2018]’s category and change the drier synonym list as: hair drier, hairdryer, hair dryer, blow dryer,

blow drier, which helps to extract more clean concepts.

CompFlickr is constructed from Flickr30k Entities [Plummer et al., 2015]. Flickr30k contains 276𝑘

manually annotated bounding boxes for 31, 783 images and a total of 158, 915 English captions

(five per image). We use the given train/val/test split to construct CompFlickr.

5.5.2 Evaluation Metrics.

We use accuracy as our primary metric to measure the GCCL performance and report object,

attribute, and compositional accuracy separately. [Jin et al., 2020] uses perplexity as the forgetting

metric in continual learning which is not appropriate in our work due to MetaReVision’s offline

setting.

5.5.3

Implementation Details

The implementation of MetaReVision uses the HuggingFace Transformers library [Wolf et al.,

2020]. For MAML, we use Adam optimizer [Kingma and Ba, 2014] as both inner and outer

optimizers. We set the inner learning rate to 5𝑒 − 5, the outer learning rate to 1𝑒 − 5, and based on

64

Target Context

Target Concepts

A white truck parked in front of a house that is being built.

White Truck

A couple of birds flying through a cloudy sky.

bird fly

a small boy is eating from a green plate

boy eat

A brown dog is on the deck of a boat on water.

Brown Dog

a blue bus with a large sign on the side of it.

Blue Bus

blue bus parked in front of an azure building.

A

Blue Bus

Retrieved Context
Several bikes parked next to a white van.
A man in a suit poses by an colored truck.
A woman smiling in front of a big bus.
People waiting on the side of the road for the yellow bus.
Two geese are flying in the air near trees.
Two hawks flying near a snow covered mountain.
Two birds sit in the grass next to each other.
Two black birds are sitting on top of a mountain.
A young boy is enjoying his pizza at the dinner table.
The little girl is eating lunch and having milk.
The woman is eating her meal at the table by herself.
An elderly couple is having a small snack in their kitchen.
A white and black dog laying on top of a yellow boat.
a brown and black horse some green grass and some houses
The black and white puppy is playing with a small toy.
A white and black animal lays on a bench that is on grass outdoors.
A red bus driving down a street in front of a red double decker bus.
a red car driving down a city road on a cloudy day
A red bus driving next to an orange and green bus.
a red double decker bus a regular bus and a tow truck outdoors.
Two men in suits stand in front of a blue and white semi truck.
a white and black bus with a rainbow colored flag on the front
Four friends stand in front of an orange van.
A large blue RV parked outside a large brick building.

Retrieved Concepts
White Van
Colored Truck
Big Bus
Yellow Bus.
Geese Fly
Hawk Fly
Bird Sit
Black Bird.
Boy Enjoy
Girl Eat
Woman Eat
Couple Have.
Black Dog
Brown Horse.
Dog Play
White Animal
Red Bus
Red Car.
Green Bus
Regular Bus
Blue Truck
Black Bus.
Orange Van
Blue RV

Table 5.4 Episode examples constructed by MetaReVison’s retrieval modules.

HIGHER 4 to calculate the higher gradients. The code for this chapter will be released at 5.

5.5.4 Episode Examples

Table 5.4 shows episode examples constructed in MetaReVision. From the table, we can see

that MetaReVision can retrieve true element concepts for target compositional concepts, such as

white truck, bird fly, boy eat. But there also exist cases we can not find true element concepts in the

retrieved support set, such as blue bus. In this example, MetaReVision can retrieve many similar

objects, but has a challenge to retrieve the true color blue. Also, from these randomly sampled

episodes, we can see that in GCCL, objects are easier to be retrieved compared to objects.

5.5.5 Baselines

We use two types of baselines in this evaluation. The first is the train-from-scratch baseline

which trains VLMs from random initialized parameters. Another baseline is MAML without

retriever. In this setting, VLMs are meta-trained using the same retrieved tasks, but VLMs can

not access the support set. It predicts directly during test time. This baseline is used to show the

importance of the retriever during test time for GCCL. Moreover, we also compare two variants

of MetaReVision, including Top 4 and Div 4. Top 4 retrieves top 4 similar concepts, which may

contain duplicated concepts. The same concept could have different vector representation which

4https://github.com/facebookresearch/higher
5https://github.com/HLR/MetaReVision

65

is affected by different visual and textual contexts. For example, car could have different vector

values when modified by red or blue. Div 4 retrieves the top 4 distinct similar concepts expecting

that the true primitive concept will be in the retrieved set.

5.5.6 Main Results

We report the performance under both novel and seen settings as shown in Table 5.2 and

Table 5.3. From the two tables, we can see that MetaReVision does help compositional learning,

especially in the novel setting.

Novel Compositions. As shown in Table 5.2, MetaReVision improves the performance on the novel

setting compared to the pre-trained model and MAML models. This suggests that MetaReVision

captures a generic representation which is beneficial for compositional learning through meta-

learning on the retrieved tasks. However, compared with seen compositions (i.e., Table 5.3), the

performance on novel pairs drops significantly across the board. MetaReVision’s accuracy drops by

about 20% on CompCOCO dataset in novel setting compared with the seen setting. This indicates

that such compositional generalization is still a very difficult and open task for current VL models.

Seen Compositions. Table 5.3 shows the performance in the seen setting. From the table, we can

see that all models have similar accuracy in the seen setting. One possible reason is that all the

models have been fully trained using the seen compositional concepts. MAML-based methods do

not hurt the in-domain performance during this meta-learning phase.

5.5.7 Empirical Analysis of Retriever

Retrieval Accuracy. Figure 5.7 shows the retriever’s top-4 accuracy for attributes, objects, and

pairs under both seen and novel settings. Attribute recognition is the key challenge compared with

object recognition in GCCL, even in the retrieval phase. In GCCL, the learned VLMs are biased

to the seen attributes that need to be adjusted for effective compositional learning.

Importance of diverse sampling. Retrieving true concepts into the support set is important for

GCCL. In this part, we assume an oracle situation where we can always select the true element

concepts into the support set during test time. We study potential advantages that can be derived

under this configuration. From Figure 5.8, we can see that the true concept in the support set

66

Figure 5.7 Comparison of the retriever accuracy between seen pairs and novel pairs in CompCOCO
dataset.

does help the compositional learning. It also explains the importance of diverse sampling which

increases the probability of selecting the correct elemental concepts.

5.6 Conclusions and Future Work

In this work, we propose MetaReVision, which combines retrieving method and meta-learning

to train VLMs for grounded compositional concept learning. Our work highlights the significance

of retrieval in compositional learning. Our empirical results on two proposed datasets, CompCOCO

and CompFlickr, have shown that MetaReVision consistently outperforms conventional VLMs and

meta-learning methods without retriever, especially in novel settings. However, GCCL is still

a challenging open problem and many problems remain. Our future work will explore more

cognitively plausible models and explicitly address the grounding ability in compositional concept

learning.

67

■Novel■Seen, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Retrieval Accuracy , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Pair_Accu 0.8 , ' , ' , ' e. 7 , , ' , , , 8.6 ' ' ' ' ' ' , , ' ' ' ' ' ' ' ' ' ' ' \ \ ' ' \ \ \ \ ' Figure 5.8 MetaReVision’s accuracy on CompCOCO using different retrievers.

5.7 Limitations

The limitations of the proposed MetaReVision include 1) Grounding limitation. Currently, we

rely on VLM’s attention mechanism to do grounding. We do not have an explicit grounding design

to align the textual concepts and visual regions. This could be an interesting direction for future

GCCL works. 2) SoTA generative model comparisons. Currently, we can not directly apply SoTA

generative models, such as BLIP-2 and MiniGPT, on GCCL due to the following reasons. One

reason is the GCCL problem setting. In GCCL, it is not easy to transform the supporting items,

including multiple images and captions, into contextual input for these generative models. Another

reason is controlled evaluation which means that these huge generative models may have already

seen the novel compositions during training and it is not a fair comparison with other models. 3)

Updating retriever. We construct our element concept DB in advance and not updating this DB

during the meta-learning time. Training both the learner and the retriever in an end-to-end manner

could improve the performance for GCCL and other retrieval-enhanced models.

68

12.01%13.79%24.41%Top-4Div-4OracleCHAPTER 6

GENCZSL: GENERATIVE COMPOSITIONAL ZERO-SHOT CONCEPT
RECOGNITION

6.1

Introduction

The large generative language models, represented by GPT-series [Brown et al., 2020, Achiam

et al., 2023], have achieved huge success in many natural language processing tasks. Moreover,

with the scaling of model size and corpus size, these large language models demonstrate an in-

context learning (ICL) ability. In this chapter, we aim to solve the compositional zero-shot learning

problem through the application of the in-context learning paradigm. We propose leveraging

foundation vision-language models to generate compositional concepts, thereby deviating from the

conventional discriminative approaches which arim at aligning the compositional concepts with

images in the constructed latent space.

While large language models have demonstrated remarkable in-context learning capabilities

across various natural language processing tasks, deploying such paradigms in vision-language

settings presents a considerable challenge. Applying in-context learning in vision-language models

comes with the following set of challenges. However, directly applying Flamingo to generate

compositional concepts for CZSL is challenging due to the following reasons: 1) Informative In-

Context Example Selection: Different from the few-shot setting, CZSL is essentially a zero-shot

learning problem and the in-context examples are not available in CZSL problem. The ICL models

need to find related examples to conduct compositional learning. Moreover, since both the selected

examples and the order of the examples are important to the final performance in ICL [Zhang

et al., 2022, Nguyen and Wong, 2023, Chang and Jia, 2023]. selecting and ranking informative

in-context examples becomes a crucial element when applying Flamingo in the context of CZSL. 2)

Mapping Between Predicted Tokens and Compositional Concept Labels: Generative models

can generate any tokens from its large vocabulary set based on the current contextual input. Mapping

the generative tokens to the compositional labels is also challenging when applying ICL in CZSL. 3)

Foundation Model Selection: Handling sequences of arbitrarily interleaved visual and textual data

69

is a requirement for the foundation model. Effectively processing such mixed sequences demands

the ability to seamlessly transition between visual and linguistic information. The recently proposed

Flamingo aims to tackle the aforementioned challenges and gives a resolution for vision-language

tasks in a few-shot setting where the input comprises interleaved textual and visual information.

To enable using generative models for CZSL, we propose a new approach called GenCZSL

which is based on Flamingo to generate the compositional concepts. In our proposed technique,

we use a retriever to select informative examples and ranker to further sort the retrieved example

to help Flamingo in recognizing novel compositions. Concretely, given an image that corresponds

to a novel compositional concept, GenCZSL applies CLIP’s visual encoder to select related (img,

concept) pairs to construct the candidate example pool for in-context learning.

Then GenCZSL introduces a ranker to sort the selected examples for in-context learning. For

label mapping, instead of taking the argmax from the whole vocabulary, we restrict the model’s

output to a set of special tokens that correspond to the set of compositional labels, e.g., with the

token “red car” corresponding to the compositional concepts. Overall, the contribution of this work

can be summarized as follows:

• To the best of our knowledge, we are the first to apply the generative method to solve the

compositional zero-shot learning problem. In contrast to previous discriminative models that

train an alignment between images and compositional concepts, our work directly generates

the corresponding compositional concept given an input image.

• We propose to use retrieval and ranking techniques for more effective in-context learning.

Our experimental results show improved performance compared to basic in-context learning.

6.2 Preliminaries

6.2.1

In-Context Learning (ICL)

In this section, we present the background of in-context learning. We focus on in-context

learning for CZSL using the vision-language model, Flamingo [Alayrac et al., 2022]. Given the

vision-language model Flamingo, 𝑛 relevant in-context examples for a specific task in hand, denoted
as {𝑥𝑖, 𝑦𝑖}𝑛

𝑖=1 where 𝑥𝑖 is an image and 𝑦𝑖 is the related compositional concept (𝑎𝑖, 𝑜𝑖), and a test

70

image input 𝑥𝑡𝑒𝑠𝑡, the compositional concept prediction for 𝑥test is generated as follows:

𝑦∗ = arg max

𝑦∈Y

𝑝𝐺 (𝑦 | 𝑥1 ⊕ 𝑦1 · · · 𝑥𝑚 ⊕ 𝑦𝑚 ⊕ 𝑥test ) ,

(6.1)

where Y is the compositional concept label space, ⊕ is the concatenation operation and 𝑚 is the

in-context example number. To deal with CZSL tasks, the original label is often mapped to word

or words in Flamingo’s vocabulary. As Equation 6.1 shows, Flamingo receives CZSL’s supervision
only from the concatenated {𝑥𝑖, 𝑦𝑖}𝑚

𝑖=1 and directly outputs the compositional concept prediction
for the test image 𝑥test. Typically, the number of in-context examples 𝑛 is limited by the max input

length of Flamingo. In previous works, the in-context examples are randomly sampled from the

whole training dataset D [Brown et al., 2020]. However, recent researches have shown that ICL

is sensitive to the provided examples and random in-context examples show significant instability

and can cause inferior performance (Lu et al., 2022; Chen et al., 2022). In our work, we focus on

selecting a small number of supporting in-context examples that are informative for the CZSL task

and effective for in-context learning, from the entire dataset D.

6.2.2 Foundation Model: Flamingo

Flamingo is a visual-language model that sets a new state-of-the-art in for few-shot learning on

a wide range of open-ended multi-modal tasks. Flamingo can tackle a diverse spectrum of open-

ended multimodal tasks with just a handful of task-specific examples in a few-shot setting, without

any additional training required. Following the ICL paradigm, Flamingo takes input consisting of

interleaved images and text and then outputs associated language as shown in Figure 6.1. Given a

few example pairs of visual inputs and expected text responses composed in Flamingo’s prompt,

the model can be asked a question with a new image, and then generate an answer.

To address the challenge of fusing the information of interleaved images and texts, Flamingo

introduces the following two key components in addition to standard auto-regression architecture:
• Perceiver. Flamingo uses perceiver to transforms image features from the vision encoder

to a fixed number of visual outputs by the attention-based fusing mechanism. In particular,

71

Figure 6.1 Flamingo architecture overview. Flamingo is a visual-language model that takes visual
data interleaved with text as input and produces free-form text as output. It is originally proposed
to address few-shot learning problem. Our work explores its in-context learning ability in CZSL.

Flamingo learns a predefined number of latent input queries which are fed to a Transformer

and attend to the extracted visual features using attention machenism.

• Multi-modal Fuser. Flamingo freezes the pre-trained language model (LM) blocks, and

inserts dense blocks of cross attention layers between the original LM layers for fuse informa-

tion from visual input to textual input. And these inserted cross-attention layers are trained

from scratch.

6.3 GenCZSL: Generative In-Context Learning for CZSL

In this section, we highlight the challenges of in-context learning for CZSL and explain our

proposed solution based on Flamingo.

6.3.1 Challenges applying ICL in CZSL

Retrieving informative in-context examples is the critical challenge CZSL [Li et al., 2023b].

Different from the standard few-shot learning in ICL, CZSL requires selecting the examples for

ICL first, And the difference is illustrated in Figure 6.2. As it is demonstrated in the figure, a few

image-text examples are provided in advance in few-shot learning, and the pre-trained Flamingo

72

Query VectorVisual EncoderStacked Cross Attention LayerQuery Vector'<image>An image of red bus.<|endofchunk|> <image>An image of white boat.<|endofchunk|><image>An image of'An image of red bus.                          An image of blue car.                            An image of'Self Attention LayerCross Attention LayerQuery VectorVisual EncoderStacked Cross Attention LayerQuery VectorGPT Blocks * NRed CarLearnableFreezeFigure 6.2 ICL difference between few-shot and zero-shot leanings.

Figure 6.3 GenCZSL Architecture. GenCZSL uses the freezed CLIP’s visual encoder to retrieve
examples and uses ranker to sort the retrieved items. Flamingo is freezed in GenCZSL.

conducts prediction based on the concatenation of these demonstrated text-image pairs and the

query image in a generative manner. However, since CZSL is a zero-shot setting problem, the in-

context examples will not be provided. For a more accurate prediction, GenCZSL should retrieve

related examples from the training set based on the query image and conduct prediction as Equation

6.1.

73

This is a chinchilla. They are mainly found in Chile.This is a shiba. They are very popular in Japan.This isIn-Context Examples in Few-Shot Learninga flamingo. They are found in theCaribbean and South America.PredictionICL for Few-shot Learning where examples are given?An image ofWe need to select in-context examples from training set for zero-shot learningRed carPredictionICL for Zero-shot Learning where examples are missing?Query ImageTraining Set as KBCLIP’s Visual EncoderCLIP Visual Retriever/Ranker'<image>An image of red bus.<|endofchunk|><image>An image of white boat.<|endofchunk|><image>An image of'Ranked In-Context ExamplesQuery ImageIn-context LearningIn-context Learning Output:Red CarFlamingo6.3.2 GenCZSL Architecture

In this section, we will the architecture of GenCZSL, especially focusing on how GenCZSL

retrieve and rank the in-context examples to help the compositional learning. Overall, GenCZSL

freezes the foundation vision-language model Flamingo and introduce two components, including

retriever and ranker separately, to select and rank few-shot examples for Flamingo to conduct

compositional learning as shown in Figure 6.3.

First Stage: Retriever. Selecting support examples within a context is a challenging task. The

difficulty arises from the impracticality of considering all possible combinations and evaluating

them, given the overwhelming complexity caused by the phenomenon of combinatorial explosion.

Therefore, in the first stage, we aim to find those informative individual examples from the training
dataset Dtrain = {(𝑥𝑖, 𝑦𝑖)}𝑁

𝑖=1, where 𝑥𝑖 is the image, 𝑦𝑖 is the related compositional label and 𝑁

is the training set size. In this phase, we assume similar images to the query image will be more

informative in ICL for CZSL. Based on this assumption, we introduce CLIP’s pre-trained visual

encoder [Radford et al., 2021] to select similar images from the training set based on the current

query image. In such way, we first retrieve a set of relevant examples of size 𝑛(𝑛 << 𝑁).

Second Stage: Ranker. Previous works on example selection for ICL [Chang and Jia, 2023, Ye

et al., 2023, Lu et al., 2021] show that order of examples can impact the accuracy of the ICL’s

generation. Given these results, we introduce a ranking module for GenCZSL to reorder the

retrived examples as shown in Figure 6.4. We approximate the ranking function using an MLP

layer. Previous works mostly apply reinforcement-learning methods for example selection and

ranking [Zhang et al., 2022] when using black box language models as the backbone generative

model. However, the model architecture and parameter of Flamingo are all open-sourced and

available which provides the possibility to back-propagate the gradients to ranker. Therefore, we

adopt an easier and more efficient method to update the ranker to obtain a better ordering for the

retrieved in-context examples.

74

Figure 6.4 Ranker Architecture.

Figure 6.5 GenCZSL’s Scoring Function.
In CZSL, compositional labels include the element
attribute and object labels. GenCZSL calculates the average of these element concept probability
as the compositional prediction.

6.3.3 Scoring Functions for Composition Labels

As with other ICL methods, GenCZSL utilizes the scoring function to decide how the predictions

of the generative model are mapped into an estimation of the likelihood of a specific label. GenCZSL

uses the direct estimation method which uses the probability of candidate answers conditioned on the

in-context inputs. The compositional labels are selected from the generated probability distribution

for the tokens in Flamingo vocabulary and the most probable composition is selected afterwards.

75

Retrieved ItemsScore 1Score 2Score 3Score 3ArgMax'<image>An image of red bus.<|endofchunk|><image>An image of blue car.<|endofchunk|><image>An image of'Score/Rank FunctionFlamingoPredefinedComp. Label……Vocabulary SizeTokens Prob.Red BusSliced AppleGreen CarPrimitiveLabel Prob.Prob. Pooling asComp. PredictionAvgPool(                       ) = 0.14AvgPool(                       ) = 0.21AvgPool(                       )  = 0.15MIT-States

UT-Zappos

C-GQA

Method
AoP [Nagarajan and Grauman, 2018b]
LE+ [Misra et al., 2017a]
TMN [Purushwalkam et al., 2019b]
SymNet [Li et al., 2020b]
CompCos [Mancini et al., 2021]
CGE [Naeem et al., 2021]
SCEN [Li et al., 2022]
CLIP [Radford et al., 2021]
COOP [Zhou et al., 2022a]
CSP [Nayak et al., 2022]
GIPCOL [Xu et al., 2024]
GenCZSL (random)
GenCZSL (with Retriever)
GenCZSL (with Ranker)

H
U
S
17.4
9.9
14.3
15.0
20.1 10.7
20.2 20.1 13.0
25.2 16.1
24.2
25.3
16.4
24.6
32.8 28.0 21.4
29.9 25.2 18.4
26.1
40.0
30.2
34.4 47.6 29.8
46.6 49.9 36.3
48.5 49.6
36.6
30.2 37.3 25.3
34.6 42.4 30.6
37.2 43.9 32.0

AUC
1.6
2.0
2.9
3.0
4.5
6.5
5.3
11.0
13.5
19.4
19.9
9.1
12.5
13.2

S

H

U
59.8 54.2 40.8
53.0 61.9 41.0
58.7 60.0 45.0
40.4
49.8 57.4
59.8 62.5 43.1
64.5 71.5
60.5
63.5 63.1 47.8
49.1 15.6
15.8
52.1 49.3 34.6
64.2 66.2 46.6
65.0 68.5 48.8
39.9 24.8 28.3
48.7 30.2 32.5
50.5 32.1 35.7

S
17.0
18.1
23.1
26.8
28.1
33.5
28.9
7.5
20.5
28.8
31.92

U
5.6
5.6
6.5
10.3
11.2
15.5
25.4
25.0
26.8
26.8
28.4

H
5.9
6.1
7.5
11.0
12.4
16.0
17.5
8.6
17.1
20.5
22.5

AUC
0.7
0.8
1.1
2.1
2.6
4.2
5.5
1.4
4.4
6.2
7.14

AUC
25.9
25.7
29.3
23.4
28.7
33.5
32.0
5.0
18.8
33.0
36.2
10.5
16.3
18.1

Table 6.1 GenCZSL on Closed-World CZSL results on UT-Zappos, Mit-States and C-GQA datasets.

MIT-States

UT-Zappos

C-GQA

S

Method

H
U
AoP [Nagarajan and Grauman, 2018b] 16.6
4.7
5.7
2.7
2.5
14.2
1.2
0.9
12.6
5.8
7.0
21.4
8.9
10.0
25.4
6.0
5.1
32.4
30.1 14.3 12.8
34.6
12.3
9.3
46.3 15.7 17.4
17.9
48.5 16.0
10.3
9.7
37.6
11.6
40.1 10.5
12.0
41.2 10.4

LE+ [Misra et al., 2017a]
TMN [Purushwalkam et al., 2019b]
SymNet [Li et al., 2020b]
CompCos [Mancini et al., 2021]
CGE [Naeem et al., 2021]
CLIP [Radford et al., 2021]
COOP [Zhou et al., 2022a]
CSP [Nayak et al., 2022]
GIPCOL [Xu et al., 2024]
GenCZSL (random)
GenCZSL (with Retriever)
GenCZSL (with Ranker)

AUC
0.7
0.3
0.1
0.8
1.6
1.0
3.0
2.8
5.7
6.3
2.6
3.2
3.6

S

U

H
29.4
50.9 34.2
36.5 30.5
60.4
55.9 18.1 21.7
53.3 44.6 34.5
59.3 46.8 36.9
61.7 47.7 39.0
11.2
15.7 20.6
31.5
52.1
28.9
64.1
44.1 38.9
65.0 45.0 40.1
46.8 28.5 18.4
58.2 35.7 24.6
60.1 38.5 23.7

AUC
13.7
16.3
8.4
18.5
21.3
23.1
2.2
13.2
22.7
23.5
9.1
10.8
11.2

-

S
-

H AUC
U
-
-
19.2 0.7 1.0
-
-
26.7 2.2 3.3
-
-
32.1 1.8 2.9
4.6 4.0
7.5
21.0 4.6
5.5
28.7 5.2 6.9
31.6 5.5 7.3

-
0.08
-
0.43
-
0.47
0.27
0.70
1.20
1.30

-

Table 6.2 Open-World CZSL results on UT-Zappos, MIT-States and C-GQA datasets.

This practice is similar to the way GPT is adapted to classification tasks [Brown et al., 2020]. In

CZSL, we have a pre-defined set of compositional labels. We use AvgPool operation to average the

element concept’s probability and calculate each compositional label’s probability accordingly, as

shown in Figure 6.5.

6.4 Experiments

6.4.1 Dataset

We conduct experiments on three compositional zero-shot learning benchmarks, MIT-States [Isola

et al., 2015a], UT-Zappos [Yu and Grauman, 2014] and C-GQA [Naeem et al., 2021]. MIT-States

and C-GQA include images with the object and their attribute labels. The domain of these datasets

76

is very general. In contrast, UT-Zappos contains images of shoes paired with their material at-

tributes which is a more domain-specific dataset. Our experiments follow the previous works

[Purushwalkam et al., 2019b, Naeem et al., 2021] for the selection of train and test splits of the

datasets. More details about the data splits and statistics can be found in Chapter 4.

6.4.2 Results

We compare our results with two types of baselines: 1) Task-specific architectures designed

for CZSL and 2) CLIP-based methods in closed and open-world settings. The difference of these

two settings is explained in Chapter 4. Both Table 6.1 and Table 6.2 show that although GenCZSL

can not achieve SoTA results compared with CLIP-based methods, it obtains competitive results

compared to task-specific architectures.

We conduct experiments for GenCZSL in three settings regarding different in-context example

selecting methods: random example selection, retrieval-based example selection, and using an

additional ranker to sort the retrieved examples. From the results, we can observe that compared with

random sampled in-context examples, using CLIP’s visual encoder help retrieve more informative

examples for GenCZSL to solve the CZSL problem. Moreover, the introduction of the ranker can

further improve the CZSL’s performance using in-context learning methods.

6.5 Conclusion and Future Work

In this chapter, we provide an evaluation of in-context learning in solving the compositional

zero-shot learning problem using the foundation vision-language model Flamingo. We propose

an approach called GenCZSL to improve in-context learning for this multi-modal setting. To

improve the efficacy of compositional zero-shot learning in GenCZSL, we focused on the selection

and ranking of informative in-context examples. Especially, GenCZSL introduces a retriever to

select more informative examples, and a ranker to reorder the selected examples, to help Flamingo

conduct compositional learning. For future work, more analysis should be conducted to show what

examples help Flamingo to do compositional learning, and whether these examples are as effective

for human prediction.

77

CHAPTER 7

CONCLUSION AND FUTURE WORK

In this chapter, we summarize our work presented in this thesis, highlight the contributions and

point to potential future directions.

7.1 Summary of Contributions

Compositional learning is a fundamental characteristic of human intelligence. This ability

requires the computational models to understand that “the meaning of the whole is a function of the

meanings of its parts”. Although deep learning models have achieved huge success in many fields,

they owe their success to training on large-scale datasets and have difficulties in adapting to new

compositions. In this thesis, we focus on grounded compositional zero-shot learning (CZSL) and

conduct experiments based on a variety of models. We demonstrate that large models struggle in

compositional learning. Consequently, we provide various novel techniques and develop parameter-

efficient methods to improve these models’ compositional ability. Compared to previous CZSL

methods, our proposed methods have achieved better performance on multiple benchmarks which

demonstrates our significant contribution in advancing compositional learning. Our contribution

which is explained in the chapters of this thesis can be summarized as follows.

• In Chapter 3, we study the problem of recognizing compositional attribute-object concepts

within the zero-shot learning(ZSL) framework. We propose an episode-based cross-attention

(EpiCA) network that combines the merits of the cross-attention mechanism and episode-

based training strategy to recognize novel compositional concepts. Firstly, EpiCA is based

on cross-attention to associate linguistic concepts with visual information and utilizes the

gated pooling layer to build contextualized representations for both images and concepts.

The updated representations are used for a more in-depth multi-modal relevance calculation

for concept recognition. Secondly, a two-phase episode training strategy, especially the

transductive phase, is adopted to utilize unlabeled test examples to alleviate the low-resource

learning problem.

• In Chapter 4, we propose MetaReVision, a novel meta-learning framework to train vision-

78

language models for compositional learning. The episodic training and the bi-level optimiza-

tion of meta-learning encourage gradients learned from the support set to be beneficial for

compositional concept learning in the query set. Moreover, we created two datasets based

on MSCOCO and Flicker30K to specifically target the evaluation of novel compositional

concept learning with rich textual input.

• In Chapter 5, we propose GIPCOL, a new CLIP-based prompting framework, to solve the

CZSL problem. GIPCOL models the interactions between element concepts via a graph

neural network and learns rich compositional representation that are used to provide effective

soft prompt to the CLIP model. Our experiments show that GIPCOL achieves better results

compared with other prompting-based methods. Moreover, we analyze the importance

of training data for compositional learning. Specially, our initial results have shown that

GIPCOL performs better for a wider domain such as MIT-States and C-GQA, but less

effective for a more specific domain such as UT-Zappos. These results demonstrate potential

advantages and limitations in applying CLIP-based prompting approaches to compositional

concept learning in the future.

• In Chapter 6, firstly we evaluate the large generative vision-language models in solving

grounded CZSL problem and highlight their shortcomings. Moreover, we propose an effective

in-context learning method to be used by such models. Our proposed approach is to select the

most informative examples for in-context learning using a retriever module and use a ranker

to reorder the selected examples and find their most effective order. This approach helps the

VLM (Flamingo here) to better generalize over novel compositions. Our experiments show

the effectiveness of the two retriever and ranker modules in the context of CZSL.

7.2 Future Directions

This dissertation explores different approaches to compositional learning, especially in the

grounded compositional zero-shot learning field. To extend these approaches to a variety of

real-world applications, possible future works are suggested as follows.

79

7.2.1 Explicit Grounded Compositional Learning

Visual-language models based on transformer architectures [Vaswani et al., 2017a] have

achieved great success in many downstream tasks [Su et al., 2020, Tan and Bansal, 2019, Zhuge

et al., 2021]. However, current pre-training strategies for these vision-language models usually

depend on the attention mechanism to conduct implicit alignment between modalities that mainly

focus on learning the coarse alignment between textual and visual input. Such methods lack the

fine-grained alignment information between the visual regions and the textual tokens which could

be a key component in compositional zero-shot learning [Thornberg et al., 2014]. Therefore,

One possible research direction for Compositional Learning is explicit grounding vision-language

models. Explicit grounding vision-language models are expected to consist of the following key

features:

7.2.2 Exploring Diffusion Models for Compositional Learning

Recent research shows generative modeling is a crucial strategy for training artificial neural

networks for discriminative tasks like image recognition [Hinton, 2007]. The recent large-scale

text-to-image diffusion models have dramatically increased the text-based image generation abilities

[Ho et al., 2020]. These generative models are trained to maximizes the evidence lower bound [Blei

et al., 2017] of the given data’s log-likelihood and learning to model the data distribution via an

iterative noising and denoising procedure [Sohl-Dickstein et al., 2015]. These models can generate

realistic images from textual prompts and exhibit impressive compositional generalization abilities.

However, these diffusion models could be converted into classifiers which are useful for tasks beyond

image generation, especially in the zero-shot setting. Diffusion classifier [Li et al., 2023a] is among

the first works to apply diffusion models to discriminative tasks. However, directly utilizing such

models in CZSl is still challenging. For example, how to filter out unfeasible compositional labels

in the open CZSL setting is one challenge.

7.2.3 Retrieval-Based Compositional Learning

Humans recognize novel compositional concepts by recalling previously acquired primitive

concepts and generalizing them to the novel compositional concepts even if they have never seen

80

the novel compositions before. However, deep learning models have difficulty in such compositional

learning. In our thesis, we explored the importance of example selection for applying in-context

learning in CZSL. In our current work, we freeze the foundation model Flamingo and train a ranker

to sort the retrieved in-context examples. Such a design aims to adapt the ranker for the foundation

model to conduct compositional learning. One possible direction is to design a better ranker to

help explore the foundation model’s compositional learning ability, such as retrieving more diverse

in-context examples. Another direction is to learn the ranker and fine-tune the foundation model

simultaneously. In the current design, the compositional learning burden mostly rests upon the

ranker. Using bi-level optimization methods to train ranker and foundation models could be an

interesting framework for solving compositional learning.

As more multi-modal applications start to enter our daily lives, it will be more important to equip

intelligent agents with compositional ability and enable them to perceive complex environments

they have never seen before. Despite the efforts we have made in this dissertation, a lot of important

and interesting problems remain open. We believe that future research on this topic is of great

value to make fundamental advances in AI.

81

BIBLIOGRAPHY

[Achiam et al., 2023] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L.,
Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report.
arXiv preprint arXiv:2303.08774.

[Alayrac et al., 2022] Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc,
K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model
for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.

[Anderson et al., 2018] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf,
N., Reid, I., Gould, S., and Van Den Hengel, A. (2018). Vision-and-language navigation:
Interpreting visually-grounded navigation instructions in real environments. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 3674–3683.

[Andreas, 2019] Andreas, J. (2019). Measuring compositionality in representation learning. arXiv

preprint arXiv:1902.07181.

[Andreas et al., 2016] Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016). Neural module
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 39–48.

[Antol et al., 2015] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and
Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international
conference on computer vision, pages 2425–2433.

[Arnab et al., 2021] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C.
In Proceedings of the IEEE/CVF international

(2021). Vivit: A video vision transformer.
conference on computer vision, pages 6836–6846.

[Bagad et al., 2023] Bagad, P., Tapaswi, M., and Snoek, C. G. (2023). Test of time: Instilling
video-language models with a sense of time. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2503–2516.

[Baroni and Zamparelli, 2010] Baroni, M. and Zamparelli, R. (2010). Nouns are vectors, adjectives
are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of
the 2010 conference on empirical methods in natural language processing, pages 1183–1193.

[Bergen et al., 2021] Bergen, L., O’Donnell, T., and Bahdanau, D. (2021). Systematic generaliza-
tion with edge transformers. Advances in Neural Information Processing Systems, 34:1390–1402.

[Biederman and Vessel, 2006] Biederman, I. and Vessel, E. A. (2006). Perceptual pleasure and
the brain: A novel theory explains why the brain craves information and seeks it through the
senses. American scientist, 94(3):247–253.

[Blei et al., 2017] Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference:
A review for statisticians. Journal of the American statistical Association, 112(518):859–877.

82

[Brown et al., 2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot
learners. Advances in neural information processing systems, 33:1877–1901.

[Bugliarello et al., 2020] Bugliarello, E., Cotterell, R., Okazaki, N., and Elliott, D. (2020).
Multimodal pretraining unmasked: Unifying the vision and language berts. arXiv preprint
arXiv:2011.15124.

[Carnap, 1988] Carnap, R. (1988). Meaning and necessity: a study in semantics and modal logic,

volume 30. University of Chicago Press.

[Chai et al., 2018] Chai, J. Y., Gao, Q., She, L., Yang, S., Saba-Sadiya, S., and Xu, G. (2018).
Language to action: Towards interactive task learning with physical agents. In ĲCAI, pages 2–9.

[Chang et al., 2016] Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. (2016).
arXiv preprint

A compositional object-based approach to learning physical dynamics.
arXiv:1612.00341.

[Chang and Jia, 2023] Chang, T.-Y. and Jia, R. (2023). Data curation alone can stabilize in-context
In Proceedings of the 61st Annual Meeting of the Association for Computational

learning.
Linguistics (Volume 1: Long Papers), pages 8123–8144.

[Chao et al., 2016] Chao, W.-L., Changpinyo, S., Gong, B., and Sha, F. (2016). An empirical study
and analysis of generalized zero-shot learning for object recognition in the wild. In Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14,
2016, Proceedings, Part II 14, pages 52–68. Springer.

[Chen and Grauman, 2014] Chen, C.-Y. and Grauman, K. (2014). Inferring analogous attributes.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 200–
207.

[Chen et al., 2020a] Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., and Han, J. (2020a). Imram:
Iterative matching with recurrent attention memory for cross-modal image-text retrieval.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
12655–12663.

[Chen et al., 2015] Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick,
C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325.

[Chen et al., 2020b] Chen, Y., Gong, S., and Bazzani, L. (2020b). Image search with text feedback

by visiolinguistic attention learning. pages 3001–3011.

[Chorowski et al., 2015] Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio,
Y. (2015). Attention-based models for speech recognition. Advances in neural information
processing systems, 28:577–585.

[Conklin et al., 2021] Conklin, H., Wang, B., Smith, K., and Titov, I. (2021). Meta-learning to

compositionally generalize. arXiv preprint arXiv:2106.04252.

83

[Devlin et al., 2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-
training of deep bidirectional transformers for language understanding. Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.

[Dhillon et al., 2019] Dhillon, G. S., Chaudhari, P., Ravichandran, A., and Soatto, S. (2019). A

baseline for few-shot image classification. arXiv preprint arXiv:1909.02729.

[Dosovitskiy et al., 2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai,
X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929.

[Dou et al., 2019] Dou, Z.-Y., Yu, K., and Anastasopoulos, A. (2019).

Investigating meta-
learning algorithms for low-resource natural language understanding tasks. arXiv preprint
arXiv:1908.10423.

[Eisenschlos et al., 2023] Eisenschlos, J. M., Cole, J. R., Liu, F., and Cohen, W. W. (2023).
WinoDict: Probing language models for in-context word acquisition. In Proceedings of the 17th
Conference of the European Chapter of the Association for Computational Linguistics.

[Finn et al., 2017] Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for
fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine
Learning, 70:1126–1135.

[Fodor, 1975] Fodor, J. A. (1975). The language of thought, volume 5. Harvard university press.

[Fodor and Pylyshyn, 1988a] Fodor, J. A. and Pylyshyn, Z. W. (1988a). Connectionism and cog-

nitive architecture: A critical analysis. Cognition, 28(1-2):3–71.

[Fodor and Pylyshyn, 1988b] Fodor, J. A. and Pylyshyn, Z. W. (1988b). Connectionism and cog-

nitive architecture: A critical analysis. Cognition, 28(1-2):3–71.

[Gao et al., 2023] Gao, K., Chen, L., Zhang, H., Xiao, J., and Sun, Q. (2023). Compositional
prompt tuning with motion cues for open-vocabulary video relation detection. arXiv preprint
arXiv:2302.00268.

[Girdhar et al., 2022] Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., and Misra, I.
(2022). Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 16102–16112.

[Girshick, 2015] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international con-

ference on computer vision, pages 1440–1448.

[Goyal et al., 2022] Goyal, A., Friesen, A., Banino, A., Weber, T., Ke, N. R., Badia, A. P., Guez, A.,
Mirza, M., Humphreys, P. C., Konyushova, K., et al. (2022). Retrieval-augmented reinforcement
learning. In International Conference on Machine Learning, pages 7740–7765. PMLR.

[Gu et al., 2018] Gu, J., Wang, Y., Chen, Y., Cho, K., and Li, V. O. (2018). Meta-learning for

low-resource neural machine translation. arXiv preprint arXiv:1808.08437.

84

[Hao et al., 2020] Hao, W., Li, C., Li, X., Carin, L., and Gao, J. (2020). Towards learning a generic
In Proceedings of the IEEE/CVF

agent for vision-and-language navigation via pre-training.
Conference on Computer Vision and Pattern Recognition, pages 13137–13146.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778.

[Hermann, 2014] Hermann, K. M. (2014). Distributed Representations for Compositional Seman-

tics. PhD thesis.

[Hessel et al., 2021] Hessel, J., Holtzman, A., Forbes, M., Bras, R. L., and Choi, Y. (2021).
arXiv preprint

image captioning.

Clipscore: A reference-free evaluation metric for
arXiv:2104.08718.

[Hinton, 2007] Hinton, G. E. (2007). To recognize shapes, first learn to generate images. Progress

in brain research, 165:535–547.

[Ho et al., 2020] Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models.

Advances in neural information processing systems, 33:6840–6851.

[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

[Holla et al., 2020] Holla, N., Mishra, P., Yannakoudakis, H., and Shutova, E. (2020). Learning to
learn to disambiguate: Meta-learning for few-shot word sense disambiguation. arXiv preprint
arXiv:2004.14355.

[Hou et al., 2020] Hou, Z., Peng, X., Qiao, Y., and Tao, D. (2020). Visual compositional learning
for human-object interaction detection. In European Conference on Computer Vision, pages
584–600. Springer.

[Huang et al., 2023a] Huang, X., Huang, Y.-J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y.,
Li, Y., and Zhang, L. (2023a). Open-set image tagging with multi-grained text supervision.
arXiv e-prints, pages arXiv–2310.

[Huang et al., 2023b] Huang, X., Zhang, Y., Ma, J., Tian, W., Feng, R., Zhang, Y., Li, Y., Guo,
Y., and Zhang, L. (2023b). Tag2text: Guiding vision-language model via image tagging. arXiv
preprint arXiv:2303.05657.

[Hudson and Manning, 2019] Hudson, D. A. and Manning, C. D. (2019). Gqa: A new dataset
for real-world visual reasoning and compositional question answering. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.

[Hupkes et al., 2020] Hupkes, D., Dankers, V., Mul, M., and Bruni, E. (2020). Compositionality
decomposed: how do neural networks generalise? Journal of Artificial Intelligence Research,
67:757–795.

85

[Huynh and Elhamifar, 2020] Huynh, D. and Elhamifar, E. (2020). Compositional zero-shot learn-
ing via fine-grained dense feature composition. Advances in Neural Information Processing
Systems, 33:19849–19860.

[Isola et al., 2015a] Isola, P., Lim, J. J., and Adelson, E. H. (2015a). Discovering states and
transformations in image collections. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 1383–1391.

[Isola et al., 2015b] Isola, P., Lim, J. J., and Adelson, E. H. (2015b). Discovering states and

transformations in image collections.

[Jia et al., 2021] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H.,
Li, Z., and Duerig, T. (2021). Scaling up visual and vision-language representation learning with
noisy text supervision. In International Conference on Machine Learning, pages 4904–4916.
PMLR.

[Jin et al., 2021] Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X. (2021). A good prompt is
worth millions of parameters? low-resource prompt-based learning for vision-language models.
arXiv preprint arXiv:2110.08484.

[Jin et al., 2020] Jin, X., Du, J., Sadhu, A., Nevatia, R., and Ren, X. (2020). Visually grounded

continual learning of compositional phrases. In EMNLP.

[Johnson et al., 2019] Johnson, J., Douze, M., and Jégou, H. (2019). Billion-scale similarity search

with GPUs. IEEE Transactions on Big Data, 7(3):535–547.

[Karpicke, 2012] Karpicke, J. D. (2012). Retrieval-based learning: Active retrieval promotes

meaningful learning. Current Directions in Psychological Science, 21(3):157–163.

[Karpicke and Blunt, 2011] Karpicke, J. D. and Blunt, J. R. (2011). Retrieval practice produces
more learning than elaborative studying with concept mapping. Science, 331(6018):772–775.

[Karpicke and Roediger III, 2008] Karpicke, J. D. and Roediger III, H. L. (2008). The critical

importance of retrieval for learning. science, 319(5865):966–968.

[Karthik et al., 2022] Karthik, S., Mancini, M., and Akata, Z. (2022). Kg-sp: Knowledge guided

simple primitives for open world compositional zero-shot learning.

[Kato et al., 2018] Kato, K., Li, Y., and Gupta, A. (2018). Compositional learning for human
object interaction. In Proceedings of the European Conference on Computer Vision (ECCV),
pages 234–251.

[Kemp and Tenenbaum, 2009] Kemp, C. and Tenenbaum, J. B. (2009). Structured statistical mod-

els of inductive reasoning. Psychological review, 116(1):20.

[Khandelwal et al., 2019] Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis,
M. (2019). Generalization through memorization: Nearest neighbor language models. arXiv
preprint arXiv:1911.00172.

86

[Kim and Linzen, 2020] Kim, N. and Linzen, T. (2020). Cogs: A compositional generalization

challenge based on semantic interpretation. arXiv preprint arXiv:2010.05465.

[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic opti-

mization. arXiv preprint arXiv:1412.6980.

[Kipf and Welling, 2016] Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with

graph convolutional networks. arXiv preprint arXiv:1609.02907.

[Lake and Baroni, 2018] Lake, B. and Baroni, M. (2018). Generalization without systematicity:
On the compositional skills of sequence-to-sequence recurrent networks. International Confer-
ence on Machine Learning, pages 2873–2882.

[Lake, 2019] Lake, B. M. (2019). Compositional generalization through meta sequence-to-

sequence learning. arXiv preprint arXiv:1906.05381.

[Lake et al., 2015a] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015a). Human-level
concept learning through probabilistic program induction. Science, 350(6266):1332–1338.

[Lake et al., 2015b] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015b). Human-level
concept learning through probabilistic program induction. Science, 350(6266):1332–1338.

[Lake et al., 2017] Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017).

Building machines that learn and think like people. Behavioral and brain sciences, 40.

[Lee et al., 2018] Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. (2018). Stacked cross attention

for image-text matching. pages 201–216.

[Li et al., 2023a] Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. (2023a). Your

diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203.

[Li et al., 2018] Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. (2018). Learning to generalize:
Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial
intelligence, volume 32.

[Li et al., 2019a] Li, K., Min, M. R., and Fu, Y. (2019a). Rethinking zero-shot learning: A
conditional visual classification perspective. The IEEE International Conference on Computer
Vision (ICCV).

[Li et al., 2023b] Li, X., Lv, K., Yan, H., Lin, T., Zhu, W., Ni, Y., Xie, G., Wang, X., and
Qiu, X. (2023b). Unified demonstration retriever for in-context learning. arXiv preprint
arXiv:2305.04320.

[Li et al., 2019b] Li, X., Sun, Q., Liu, Y., Zhou, Q., Zheng, S., Chua, T.-S., and Schiele, B.
(2019b). Learning to self-train for semi-supervised few-shot classification. Advances in Neural
Information Processing Systems, pages 10276–10286.

[Li et al., 2022] Li, X., Yang, X., Wei, K., Deng, C., and Yang, M. (2022). Siamese contrastive
In Proceedings of the IEEE/CVF

embedding network for compositional zero-shot learning.
Conference on Computer Vision and Pattern Recognition, pages 9326–9335.

87

[Li et al., 2020a] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L.,
Wei, F., et al. (2020a). Oscar: Object-semantics aligned pre-training for vision-language tasks.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XXX 16, pages 121–137. Springer.

[Li et al., 2020b] Li, Y.-L., Xu, Y., Mao, X., and Lu, C. (2020b). Symmetry and group in attribute-
object compositions. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 11316–11325.

[Li et al., 2020c] Li, Y.-L., Xu, Y., Mao, X., and Lu, C. (2020c). Symmetry and group in attribute-
object compositions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 11316–11325.

[Liu et al., 2023] Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. arXiv

preprint arXiv:2304.08485.

[Liu et al., 2021] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2021). Pre-train,
prompt, and predict: A systematic survey of prompting methods in natural language processing.
arXiv preprint arXiv:2107.13586.

[Long et al., 2022] Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A.,
Shen, C., and van den Hengel, A. (2022). Retrieval augmented classification for long-tail visual
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
recognition.
Recognition, pages 6959–6969.

[Lu et al., 2019] Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pretraining task-agnostic

visiolinguistic representations for vision-and-language tasks. pages 13–23.

[Lu et al., 2018] Lu, J., Yang, J., Batra, D., and Parikh, D. (2018). Neural baby talk. In Proceedings

of the IEEE conference on computer vision and pattern recognition, pages 7219–7228.

[Lu et al., 2021] Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. (2021). Fantastically
ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv
preprint arXiv:2104.08786.

[Ma et al., 2023a] Ma, Z., Hong, J., Gul, M. O., Gandhi, M., Gao, I., and Krishna, R. (2023a).
Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10910–10921.

[Ma et al., 2023b] Ma, Z., Pan, J., and Chai, J. (2023b). World-to-words: Grounded open
arXiv preprint

vocabulary acquisition through fast mapping in vision-language models.
arXiv:2306.08685.

[Mancini et al., 2021] Mancini, M., Naeem, M. F., Xian, Y., and Akata, Z. (2021). Open world
compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5222–5230.

[Mancini et al., 2022] Mancini, M., Naeem, M. F., Xian, Y., and Akata, Z. (2022). Learning graph
embeddings for open world compositional zero-shot learning. IEEE Transactions on pattern
analysis and machine intelligence.

88

[Miller and Charles, 1991] Miller, G. A. and Charles, W. G. (1991). Contextual correlates of

semantic similarity. Language and cognitive processes, 6(1):1–28.

[Misra et al., 2017a] Misra, I., Gupta, A., and Hebert, M. (2017a). From red wine to red tomato:
Composition with context. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1792–1801.

[Misra et al., 2017b] Misra, I., Gupta, A., and Hebert, M. (2017b). From red wine to red tomato:
Composition with context. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1792–1801.

[Mokady et al., 2021] Mokady, R., Hertz, A., and Bermano, A. H. (2021). Clipcap: Clip prefix

for image captioning. arXiv preprint arXiv:2111.09734.

[Munkhdalai et al., 2018] Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. (2018). Rapid
adaptation with conditionally shifted neurons. In International Conference on Machine Learning,
pages 3664–3673. PMLR.

[Naeem et al., 2023] Naeem, M. F., Khan, M. G. Z. A., Xian, Y., Afzal, M. Z., Stricker, D.,
Van Gool, L., and Tombari, F. (2023). I2mvformer: Large language model generated multi-
view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 15169–15179.

[Naeem et al., 2021] Naeem, M. F., Xian, Y., Tombari, F., and Akata, Z. (2021). Learning graph
embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 953–962.

[Nagarajan and Grauman, 2018a] Nagarajan, T. and Grauman, K. (2018a). Attributes as operators.

ECCV.

[Nagarajan and Grauman, 2018b] Nagarajan, T. and Grauman, K. (2018b). Attributes as operators:
factorizing unseen attribute-object compositions. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 169–185.

[Nan et al., 2019] Nan, Z., Liu, Y., Zheng, N., and Zhu, S.-C. (2019). Recognizing unseen attribute-
object pair with generative model. Proceedings of the AAAI Conference on Artificial Intelligence,
33:8811–8818.

[Nayak et al., 2022] Nayak, N. V., Yu, P., and Bach, S. H. (2022). Learning to compose soft

prompts for compositional zero-shot learning. arXiv preprint arXiv:2204.03574.

[Nguyen and Wong, 2023] Nguyen, T. and Wong, E. (2023). In-context example selection with

influences. arXiv preprint arXiv:2302.11042.

[Nichol et al., 2018a] Nichol, A., Achiam, J., and Schulman, J. (2018a). On first-order meta-

learning algorithms. arXiv preprint arXiv:1803.02999.

[Nichol et al., 2018b] Nichol, A., Achiam, J., and Schulman, J. (2018b). On first-order meta-

learning algorithms. arXiv preprint arXiv:1803.02999.

89

[Nikolaus et al., 2019] Nikolaus, M., Abdou, M., Lamm, M., Aralikatte, R., and Elliott, D. (2019).
In Proceedings of the 23rd Conference
Compositional generalization in image captioning.
on Computational Natural Language Learning (CoNLL), pages 87–98, Hong Kong, China.
Association for Computational Linguistics.

[Nye et al., 2020] Nye, M. I., Solar-Lezama, A., Tenenbaum, J. B., and Lake, B. M. (2020).
Learning compositional rules via neural program synthesis. arXiv preprint arXiv:2003.05562.

[Ontanón et al., 2021] Ontanón, S., Ainslie, J., Cvicek, V., and Fisher, Z. (2021). Making trans-

formers solve compositional tasks. arXiv preprint arXiv:2108.04378.

[Pan and Yang, 2009] Pan, S. J. and Yang, Q. (2009). A survey on transfer learning.

IEEE

Transactions on knowledge and data engineering, 22(10):1345–1359.

[Partee et al., 1995] Partee, B. et al. (1995). Lexical semantics and compositionality. An invitation

to cognitive science: Language, 1:311–360.

[Pennington et al., 2014a] Pennington, J., Socher, R., and Manning, C. (2014a). Glove: Global
vectors for word representation. Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP), pages 1532–1543.

[Pennington et al., 2014b] Pennington, J., Socher, R., and Manning, C. D. (2014b). Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP), pages 1532–1543.

[Plummer et al., 2015] Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier,
J., and Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences
for richer image-to-sentence models. In Proceedings of the IEEE international conference on
computer vision, pages 2641–2649.

[Purushwalkam et al., 2019a] Purushwalkam, S., Nickel, M., Gupta, A., and Ranzato, M.
(2019a). Task-driven modular networks for zero-shot compositional learning. arXiv preprint
arXiv:1905.05908.

[Purushwalkam et al., 2019b] Purushwalkam, S., Nickel, M., Gupta, A., and Ranzato, M. (2019b).
In Proceedings of the

Task-driven modular networks for zero-shot compositional learning.
IEEE/CVF International Conference on Computer Vision, pages 3593–3602.

[Qi et al., 2020] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A
Python natural language processing toolkit for many human languages. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.

[Radford et al., 2021] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,
Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models
from natural language supervision. In International Conference on Machine Learning, pages
8748–8763. PMLR.

[Raffel et al., 2020] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou,
Y., Li, W., Liu, P. J., et al. (2020). Exploring the limits of transfer learning with a unified
text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.

90

[Rao et al., 2022a] Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu,
J. (2022a). Denseclip: Language-guided dense prediction with context-aware prompting. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[Rao et al., 2022b] Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu,
J. (2022b). Denseclip: Language-guided dense prediction with context-aware prompting. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
18082–18091.

[Ravi and Larochelle, 2016] Ravi, S. and Larochelle, H. (2016). Optimization as a model for

few-shot learning.

[Roediger and Butler, 2011] Roediger, H. L. and Butler, A. C. (2011). The critical role of retrieval

practice in long-term retention. Trends in cognitive sciences, 15(1):20–27.

[Ruis et al., 2020] Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D., and Lake, B. M. (2020).
A benchmark for systematic generalization in grounded language understanding. Advances in
neural information processing systems, 33:19861–19872.

[Schuhmann et al., 2021] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C.,
Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. (2021). Laion-400m: Open dataset of
clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.

[SHAO et al., 2023] SHAO, N., Cai, Z., xu, H., Liao, C., Zheng, Y., and Yang, Z. (2023). Composi-
tional task representations for large language models. In The Eleventh International Conference
on Learning Representations.

[Sharma et al., 2018] Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018). Conceptual
captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 2556–2565.

[Snell et al., 2017a] Snell, J., Swersky, K., and Zemel, R. (2017a). Prototypical networks for

few-shot learning. Advances in neural information processing systems, pages 4077–4087.

[Snell et al., 2017b] Snell, J., Swersky, K., and Zemel, R. S. (2017b). Prototypical networks for

few-shot learning. arXiv preprint arXiv:1703.05175.

[Sohl-Dickstein et al., 2015] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S.
(2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International
conference on machine learning, pages 2256–2265. PMLR.

[Song et al., 2022] Song, H., Dong, L., Zhang, W.-N., Liu, T., and Wei, F. (2022). Clip mod-
els are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint
arXiv:2203.07190.

[Su et al., 2020] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. (2020). Vl-bert: Pre-
training of generic visual-linguistic representations. In International Conference on Learning
Representations.

91

[Sun et al., 2023] Sun, S., Liu, Y., Iter, D., Zhu, C., and Iyyer, M. (2023). How does in-context

learning help prompt tuning? arXiv preprint arXiv:2302.11521.

[Sung et al., 2018a] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M.
(2018a). Learning to compare: Relation network for few-shot learning. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 1199–1208.

[Sung et al., 2018b] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M.
(2018b). Learning to compare: Relation network for few-shot learning. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 1199–1208.

[Surís et al., 2020] Surís, D., Epstein, D., Ji, H., Chang, S., and Vondrick, C. (2020). Learning to

learn words from visual scenes. European Conference on Computer Vision (ECCV).

[Tan and Bansal, 2019] Tan, H. and Bansal, M. (2019). Lxmert: Learning cross-modality encoder
representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing.

[Thornberg et al., 2014] Thornberg, R., Perhamus, L., and Charmaz, K. (2014). Grounded theory.
Handbook of research methods in early childhood education: Research methodologies, 1:405–
439.

[Thrush et al., 2022] Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and
Ross, C. (2022). Winoground: Probing vision and language models for visio-linguistic com-
positionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 5238–5248.

[Touvron et al., 2023] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix,
T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient
foundation language models. arXiv preprint arXiv:2302.13971.

[Tsai et al., 2020] Tsai, Y.-Y., Chen, P.-Y., and Ho, T.-Y. (2020). Transfer learning without knowing:
Reprogramming black-box machine learning models with scarce data and limited resources. In
International Conference on Machine Learning, pages 9614–9624. PMLR.

[Tsimpoukelli et al., 2021] Tsimpoukelli, M., Menick, J. L., Cabi, S., Eslami, S., Vinyals, O., and
Hill, F. (2021). Multimodal few-shot learning with frozen language models. Advances in Neural
Information Processing Systems, 34:200–212.

[Van der Maaten and Hinton, 2008] Van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of machine learning research, 9(11).

[Vaswani et al., 2017a] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., and Polosukhin, I. (2017a). Attention is all you need. arXiv preprint
arXiv:1706.03762.

[Vaswani et al., 2017b] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, Ł., and Polosukhin, I. (2017b). Attention is all you need. Advances in neural
information processing systems, pages 5998–6008.

92

[Velickovic et al., 2017] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio,

Y., et al. (2017). Graph attention networks. stat, 1050(20):10–48550.

[Vinyals et al., 2016] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching
networks for one shot learning. Advances in neural information processing systems, pages
3630–3638.

[Vinyals et al., 2015] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A
neural image caption generator. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 3156–3164.

[Wang et al., 2021] Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, Z., Huang, F., and Tu, K.
Improving named entity recognition by external context retrieving and cooperative

(2021).
learning. arXiv preprint arXiv:2105.03654.

[Wang et al., 2019] Wang, X., Yu, F., Wang, R., Darrell, T., and Gonzalez, J. E. (2019). Tafe-net:
Task-aware feature embeddings for low shot learning. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1831–1840.

[Wei et al., 2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D.
(2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903.

[Wei et al., 2019] Wei, K., Yang, M., Wang, H., Deng, C., and Liu, X. (2019). Adversarial
fine-grained composition learning for unseen attribute-object recognition. pages 3741–3749.

[Wolf et al., 2020] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
Rault, T., Louf, R., Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language
processing. In Proceedings of the 2020 conference on empirical methods in natural language
processing: system demonstrations, pages 38–45.

[Xian et al., 2018a] Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. (2018a). Feature generating
networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 5542–5551.

[Xian et al., 2018b] Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. (2018b). Feature generating
networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 5542–5551.

[Xu et al., 2024] Xu, G., Chai, J., and Kordjamshidi, P. (2024). Gipcol: Graph-injected soft
In Proceedings of the IEEE/CVF Winter

prompting for compositional zero-shot learning.
Conference on Applications of Computer Vision, pages 5774–5783.

[Xu et al., 2022] Xu, G., Kordjamshidi, P., and Chai, J. (2022). Prompting large pre-trained vision-

language models for compositional concept learning. arXiv:2211.05077.

[Xu et al., 2021] Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F.,
Zettlemoyer, L., and Feichtenhofer, C. (2021). Videoclip: Contrastive pre-training for zero-shot
video-text understanding. arXiv preprint arXiv:2109.14084.

93

[Yang et al., 2022] Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. (2022).
An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 36, pages 3081–3089.

[Ye et al., 2023] Ye, J., Wu, Z., Feng, J., Yu, T., and Kong, L. (2023). Compositional exemplars
for in-context learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and
Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning,
volume 202 of Proceedings of Machine Learning Research, pages 39818–39833. PMLR.

[Yin et al., 2019] Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn, C. (2019). Meta-learning

without memorization. arXiv preprint arXiv:1912.03820.

[Yu and Grauman, 2014] Yu, A. and Grauman, K. (2014). Fine-grained visual comparisons with
local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 192–199.

[Yu and Grauman, 2017] Yu, A. and Grauman, K. (2017). Semantic jitter: Dense supervision for
visual comparisons via synthetic images. Proceedings of the IEEE International Conference on
Computer Vision, pages 5570–5579.

[Yu et al., 2016] Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. (2016). Modeling
context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85.
Springer.

[Zellers et al., 2019] Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. (2019). From recognition to
cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 6720–6731.

[Zhang et al., 2022] Zhang, Y., Feng, S., and Tan, C. (2022). Active example selection for in-

context learning. arXiv preprint arXiv:2211.04486.

[Zhang et al., 2023] Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li,
Y., Liu, S., et al. (2023). Recognize anything: A strong image tagging model. arXiv preprint
arXiv:2306.03514.

[Zhang and Sabuncu, 2018] Zhang, Z. and Sabuncu, M. (2018). Generalized cross entropy loss
for training deep neural networks with noisy labels. Advances in neural information processing
systems, pages 8778–8788.

[Zhou et al., 2022a] Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022a). Conditional prompt

learning for vision-language models. In CVPR.

[Zhou et al., 2022b] Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022b). Conditional prompt
learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR).

[Zhou et al., 2020] Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., and Gao, J. (2020).
Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI
conference on artificial intelligence, volume 34, pages 13041–13049.

94

[Zhu et al., 2023] Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. (2023). Minigpt-4:
Enhancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592.

[Zhuge et al., 2021] Zhuge, M., Gao, D., Fan, D.-P., Jin, L., Chen, B., Zhou, H., Qiu, M., and Shao,
L. (2021). Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12647–12657.

95