GROUNDED COMPOSITIONAL CONCEPT LEARNING

Humans learn concepts in a grounded and compositional manner. Such compositional and grounding abilities enable humans to understand an endless variety of scenarios and expressions. Although deep learning models have pushed performance to newlimits on manyNatural Language Processing and Computer Vision tasks, we still have a lack of knowledge about how these models process compositional structures and their potential to accomplish human-like meaning composition. The goal of this thesis is to advance the current compositional generalization research on both the evaluation and design of the learning models. In this direction, we make the following contributions.Firstly, we introduce a transductive learning method to utilize the unlabeled data for learning the distribution of both seen and novel compositions. Moreover, we utilize the cross-attention mechanism to align and ground the linguistic concepts into specific regions of the image to tackle the grounding challenge. Unlike traditional learning, we use episodic training where each training item consists of one image and the sampled positive and negative compositional labels. We select the image’s compositional label by computing their matching scores Our empirical results show that combining episodic training and transductive learning does help compositional learning.Secondly, we develop a new prompting technique for compositional learning by considering the interaction between element concepts. In our proposed technique called GIPCOL, we construct a textual input that contains rich compositional information when prompting the foundation visionlanguage model. We use the CLIP model as the pre-trained backbone vision-language model and improve its compositional zero-shot learning ability with our novel soft-prompting approach. GIPCOL freezes the majority of CLIP’s parameters and only learns CLIP’s word embedding layer through a graph neural network. By concatenating the learnable soft prompt and the updated word embeddings, GIPCOL achieves better results compared with other prompting-based methods.Thirdly, since retrieval plays a critical role in human learning, our work studies how retrieval can help compositional learning. We propose MetaReVision which is a new retrieval-enhanced meta-learning model to address the visually grounded compositional concept learning problem. Given an image with a novel compositional concept, MetaReVision first uses a retrieval module to find relevant items from the training set. Then it constructs an episode for which the retrieved items form the support set and the test item forms the query set. The retrieved support set mimics the primitive concept learning scenario, while the query set encourages the compositional strategy learning by meta-learning’s bi-level optimization objective. The experimental results show that such retrieval-enhanced meta-learning framework helps the vision-language model’s compositional learning. Moreover, we create two new benchmarks called CompCOCO and CompFlickr for the evaluation of grounded compositional concept learning.Finally, we evaluate the large generative vision and language models in solving compositional zero-shot learning within the in-context learning framework. We highlight their shortcomings and propose retriever and ranker modules to improve their performance in addressing this challenging problem. These two modules select the most informative in-context examples in their most effective order to guide the backbone generative model. Our approach is novel in the context of grounded compositional learning and our experimental results show improved performance compared to basic in-context learning.

Read