Improving Grounding Ability of Vision and Language Navigation Agents

Understanding and following human instructions is crucial for intelligent agents to interact with humans in real-world environments. One formal problem setting designed to facilitate advancing this direction of research is Vision-and-Language Navigation (VLN). VLN requires the agent to carry out a sequence of actions in a photo-realistic simulated indoor environment in response to natural language instructions. Although significant progress has been achieved in this direction, navigation agents still have challenges understanding instructions and accurately grounding them in their visual perception.To elaborate, the challenges include 1) lacking explicit learning of spatial semantics in both text and vision modalities, 2) difficulties in handling ambiguous instructions and the lack of explainability, and 3) gaps in language understanding for navigation in realistic environments, such as continuous and 3D spaces. In this thesis, we develop new techniques to address these challenges. First, we explicitly model spatial semantics to improve the navigation agent’s grounding by incorporating navigation progress, the alignments between textual landmarks and visual objects, and the corresponding spatial directions. Besides, we design specialized modules to capture distinct semantic aspects through corresponding pre-training tasks, enabling the effective acquisition of the respective skills. Second, to help agents deal with ambiguous instructions, we introduce a translator to convert the original ambiguous instructions into easy-to-following instructions considering recognizable and distinctive landmarks. The designed translator bridges the gap between the instruction given by humans and the agent's visual perception ability. Furthermore, to improve the explainability of the decisions made by the agent, we introduce a language generator for the navigation agent to equip it with the ability to generate explanations about navigation progress, navigation difficulties, and observed visual objects in the selected target view. Such explanations enable the agent to explain the situation from its own perspective, enhancing its ability to interact with humans effectively. Third, to advance navigation in a more realistic setting, we contribute to language grounding in continuous and 3D environments. For navigation in continuous environments, we introduce a dual-action-perception module that integrates a low-level action decoder, jointly trained with high-level action prediction. This design enables the VLN agent to learn and ground the selected visual view to the corresponding low-level controls. Additionally, in 3D environments, we develop techniques to enhance the agent's situated spatial understanding, further improving its navigation capabilities in 3D scenarios. We evaluate our proposed methods across different commonly-used navigation benchmarks and provide comprehensive quantitative results and qualitative analysis. The experimental results demonstrate the effectiveness of our explicit grounding modules, the proposed pre-training tasks, and the synthesized data incorporating recognized and distinctive landmarks, significantly enhance navigation performance, generalizability, and language grounding ability. Additionally, our novel architectures designed for continuous and 3D environments push the boundaries of navigation agent research to real-world scenarios. Notably, these advancements contribute to improved interpretability of the agent's decision-making process, offering deeper insights into the rationale behind its navigational actions.

Read