IMPROVING GROUNDING ABILITY OF VISION AND LANGUAGE NAVIGATION AGENTS By Yue Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science — Doctor of Philosophy 2025 ABSTRACT Understanding and following human instructions is crucial for intelligent agents to interact with humans in real-world environments. One formal problem setting designed to facilitate advancing this direction of research is Vision-and-Language Navigation (VLN). VLN requires the agent to carry out a sequence of actions in a photo-realistic simulated indoor environment in response to natural language instructions. Although significant progress has been achieved in this direction, navigation agents still have challenges understanding instructions and accurately grounding them in their visual perception. To elaborate, the challenges include 1) lacking explicit learning of spatial semantics in both text and vision modalities, 2) difficulties in handling ambiguous instructions and the lack of explainability, and 3) gaps in language understanding for navigation in realistic environments, such as continuous and 3D spaces. In this thesis, we develop new techniques to address these challenges. First, we explicitly model spatial semantics to improve the navigation agent’s grounding by incorporating navigation progress, the alignments between textual landmarks and visual objects, and the corresponding spatial directions. Besides, we design specialized modules to capture distinct semantic aspects through corresponding pre-training tasks, enabling the effective acquisition of the respective skills. Second, to help agents deal with ambiguous instructions, we introduce a translator to convert the original ambiguous instructions into easy-to-following instructions considering recognizable and distinctive landmarks. The designed translator bridges the gap between the instruction given by humans and the agent’s visual perception ability. Furthermore, to improve the explainability of the decisions made by the agent, we introduce a language generator for the navigation agent to equip it with the ability to generate explanations about navigation progress, navigation difficulties, and observed visual objects in the selected target view. Such explanations enable the agent to explain the situation from its own perspective, enhancing its ability to interact with humans effectively. Third, to advance navigation in a more realistic setting, we contribute to language grounding in continuous and 3D environments. For navigation in continuous environments, we introduce a dual-action-perception module that integrates a low-level action decoder, jointly trained with high-level action prediction. This design enables the VLN agent to learn and ground the selected visual view to the corresponding low-level controls. Additionally, in 3D environments, we develop techniques to enhance the agent’s situated spatial understanding, further improving its navigation capabilities in 3D scenarios. We evaluate our proposed methods across different commonly-used navigation benchmarks and provide comprehensive quantitative results and qualitative analysis. The experimental results demonstrate the effectiveness of our explicit grounding modules, the proposed pre-training tasks, and the synthesized data incorporating recognized and distinctive landmarks, significantly enhance navigation performance, generalizability, and language grounding ability. Additionally, our novel architectures designed for continuous and 3D environments push the boundaries of navigation agent research to real-world scenarios. Notably, these advancements contribute to improved interpretability of the agent’s decision-making process, offering deeper insights into the rationale behind its navigational actions. Copyright by YUE ZHANG 2025 Dedicate to my family. For their unwavering support, endless encouragement, and unconditional love. v ACKNOWLEDGMENTS My journey toward earning a doctoral degree at Michigan State University has been an incredibly precious experience, filled with countless cherished memories. Throughout this period of academic pursuit, I have not only deepened my knowledge and expanded my research horizons but also gained invaluable experiences that have profoundly shaped my intellectual growth and career aspirations.I am sincerely grateful to all those who have encouraged, challenged, and inspired me along the way, making this journey both meaningful and rewarding. First and foremost, I want to extend my gratitude to my advisor, Dr. Parisa Kordjamshidi, for her unwavering support and encouragement throughout my Ph.D. journey. Research is a path filled with challenges and setbacks, and there were moments felt disheartening. However, Dr. Kordjamshidi’s trust in me and her steadfast encouragement gave me the strength to persevere. Beyond her support, I deeply admire her exceptional work ethic and unwavering dedication to research. Her commitment to maintaining the highest standards has profoundly shaped my own approach to conducting meaningful and impactful work. Under her guidance, I have not only grown as a researcher but have also learned the importance of persistence, critical thinking, and intellectual curiosity. I am truly grateful for her mentorship and the invaluable lessons she has imparted throughout this journey. Next, I want to thank my comittee members, Arun Russ, Xiaobo Tan, Yu Kong, and Pang-Ning Tan, for their insightful feedback on my thesis. Their guidance and expertise have been invaluable in refining my research. I am also deeply thankful for my incredible labmates and collaborators. I feel truly fortunate to be a member of the Heterogeneous Learning and Reasoning (HLR) Lab, where I have received constant support and encouragement. I especially want to sincerely thank Dr. Quan Guo for his guidance at the beginning of my Ph.D. journey—his mentorship provided me with essential knowledge and direction. Additionally, I am grateful to Dr. Chen Zheng for his valuable support, both in my research and in daily life. This degree would not have been possible without the unwavering support of my parents, Mr. Xiuzhong Zhang and Mrs. Meixiang Zhang, my sister, Dr. Bin Zhang, my brother, Chao Zhang, and the rest of my family. My parents’ steadfast encouragement, sacrifices, and unwavering belief vi in my abilities have been the foundation of my academic journey. I am deeply grateful for their dedication to my education, which has shaped me both intellectually and personally. Their endless love, patience, and guidance have given me the strength to overcome challenges and the courage to pursue my dreams. Without them, this achievement would not have been possible, and I am forever thankful. Finally, I wish to express my deepest gratitude to my husband, Xiao Guo, whose presence has been a constant source of strength and inspiration in my life. His unwavering support has anchored me through every challenge, offering not only encouragement but also the reassurance that I am never alone on this journey. I am especially grateful for his belief in me, even during times when I struggled to believe in myself. His steadfast commitment and unconditional love mean the world to me. I am truly fortunate to walk this path with him by my side. vii TABLE OF CONTENTS LIST OF TABLES . . . LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . 1.2 Challenges and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis Outline . . . . . . . . . . . . . . . CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Embodied AI . 2.2 Vision and Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Vision and Language Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix xi 1 1 3 6 7 7 8 9 CHAPTER 3 . Introduction . . 3.1 . . 3.2 EXOR: Explicit Object Relation Alignment 3.3 LOViS: Learning Orientation and Visual Signals . 3.4 Experiments . . . 3.5 Conclusion . . EXPLICIT SPATIAL UNDERSTANDING AND GROUNDING . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . . CHAPTER 4 Introduction . ADDRESSING AMBIGUOUS INSTRUCTIONS AND IMPROVING EXPLAINABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 VLN-Trans: VLN Agent with a Translator . . . . . . . . . . . . . . . . . . . 4.3 NavHint: VLN Agent with a Hint Generator . . . . . . . . . . . . . . . . . . 4.4 Experiments . 4.5 Conclusion . . 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 . 33 . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . CHAPTER 5 . . . . Introduction . ADVANCING VLN FOR REAL-WORLD CHALLENGES: NAVIGATION IN CONTINUOUS AND 3D ENVIRONMENTS . . . . 54 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 Narrowing the Gap between Vision and Action for the VLN-CE Agent . . . . . 57 5.3 Spartun3D: Situated Spatial Understanding in 3D World . . . . . . . . . . . . 62 . 71 5.4 Experiments for VLN-CE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experiments for Spartun3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Conclusion . . . . . CHAPTER 6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Directions . . . 84 . 84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 viii LIST OF TABLES Table 3.1 Experimental results for EXOR compared to LSTM-based VLN agents. . . . . . 25 Table 3.2 Experimental results for LOViS compare to Transformer-based VLN agents. . . . 26 Table 3.3 Experimental Results for comparing LOViS with the baseline Models on R4R dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 3.4 Ablation Study for EXOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 3.5 Ablation study for different pre-training tasks for LOViS. . . . . . . . . . . . . . 29 Table 4.1 Landmark ambiguity. The col#1 and col#2 show the categories of landmark ambiguity and the corresponding descriptions. The col#3 shows the template for generating the hint for each category. . . . . . . . . . . . . . . . . . . . . . . 40 Table 4.2 Experimental results on R2R Benchmarks in a single-run setting. The best results are in bold font. + means we add RXR [55] and Marky-mT5 dataset [112] as the extra data to pre-train the navigation agent. ++ means we further add the SyFiS dataset to pre-train the navigation agent. ViT means Vision Transformer representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . . Table 4.3 Experimental results on R2R dataset. ViT: uses Vision Transformer representa- tions. Hint.: uses our hint generator. . . . . . . . . . . . . . . . . . . . . . . . 45 Table 4.4 Bleu score for the generated sub-instruction on the R2R dataset. . . . . . . . . . 46 Table 4.5 Ablation study for training tasks for the translator. . . . . . . . . . . . . . . . . . 46 Table 4.6 Ablation study for different parts of hint. Sub.:sub-instruction; L-A.:Landmark Ambiguity; TD-Obj: Target Distinctive Objects. Obj:Top-3 objects. . . . . . . . 47 Table 5.1 Dataset statistics of Spartun3D and human validation results. . . . . . . . . . . . 68 Table 5.2 Experimental results on high-level action evaluated on the R2R-CE validation unseen and test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 5.3 Experimental results of low-level actions on the R2R-CE validation unseen set. * means our implementation for low-level action prediction, as most VLN-CE agents do not report their low-level performance. We train the VLN agent with a low-level action classifier for fair comparison. . . . . . . . . . . . . . . . . . . 73 Table 5.4 Ablation study on different components of our method. The baseline is WP-HAMT, and the Ob-mask is the obstacle mask. . . . . . . . . . . . . . . . 74 Table 5.5 Evaluation of different visual encoders. . . . . . . . . . . . . . . . . . . . . . . 75 ix Table 5.6 Analysis of the influence of various open-area vocabularies on the waypoint predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 5.7 Experimental Results on Spartun3D Situated QA Tasks. ∗ represents the model initialized with LEO instruction-tuned weights. [Keys: C: CIDER; B-4: BLEU-4; M: METEOR; R: ROUGE; Sim: Sentence Similarity; EM: Exact Match; Bold: best results]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Table 5.8 Experimental Results on Spartun3D Situated Captioning Task. . . . . . . . . . . 80 Table 5.9 Experimental Results on SQA3D given the 3D objects from Mask3D and Ground-truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 5.10 Navigation Performance (Accuracy%). . . . . . . . . . . . . . . . . . . . . . . . 81 Table 5.11 Spatial Alignment Evaluation on Other Benchmarks. The metric is Sentence Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 x LIST OF FIGURES Figure 1.1 VLN Task Demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2 Challenges of Language Grounding in the VLN Task. . . . . . . . . . . . . . . 2 3 Figure 3.1 Spatial configuration example. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 3.2 Model architecture of EXOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 3.3 Model architecture of LOViS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 3.4 Pre-training model with specific pre-training tasks. . . . . . . . . . . . . . . . . . . 23 Figure 3.5 An example visualization of the navigation process in EXOR. The green boxes are spatial configurations; the darker green means higher weights; the yellow boxes are the selected landmarks; the orange arrows are the path. . . . . . . . . 28 Figure 3.6 A qualitative example to show each module of LOViS. . . . . . . . . . . . . . . 29 Figure 4.1 Instructions that make the grounding in the VLN task challenging. . . . . . . . 31 Figure 4.2 VLN Agent with our hint generator. . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 4.3 The overview of the proposed VLN-Trans. (a) Navigation agent with VLN- Trans. (b) The translator architecture (c) Pre-training the translator. SG:Sub- instruction Generation; DSL: Distinctive Sub-instruction Learning; SS: Sub- instruction Split. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . Figure 4.4 Motion indicator vocabulary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 4.5 Illustration of constructing the SyFiS dataset. . . . . . . . . . . . . . . . . . . 36 Figure 4.6 Navigation hint dataset. An example of a navigation hints with the landmark ambiguity of “Missing Landmarks”. The sub-instruction is“walk into the hallway”, and the landmark “hallway” in the instruction is observed in the view1 rather than target view3, which can potentially mislead the navigation agent. The target distinctive objects "wooden dining table" and "marble countertop." are then provided. "Blue walls" is non-distinctive as it appears in both view2 and view3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 4.7 Model architecture of NavHint. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 4.8 Qualitative examples to show how the translator helps the navigation agent. The red boxes and green boxes show the distinctive and the nondistinctive landmarks; the green arrow and red arrow show the target and the predicted viewpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 . . . . xi Figure 4.9 Accuracy of the landmark ambiguity in generated hints. . . . . . . . . . . . . . 50 Figure 4.10 Accuracy of the distinctive objects for each landmark ambiguity in the targeted viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 4.11 Qualitative examples for NavHint. The green and orange arrows show the ground truth and the predicted viewpoints, respectively. . . . . . . . . . . . . . 52 Figure 5.1 Illustration of situated scene understanding of Spartun3D-LLM compared to other 3D-based LLMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 5.2 Main Architecture for the VLN-CE Agent. The waypoint predictor first provides navigable viewpoints (green circle). Then, the corresponding RGB, depth images, and textual instructions are input to our dual-action module. The freezing sign means no parameters are updated. Please refer to Fig. 5.3 for a detailed architecture of the low-level action decoder. . . . . . . . . . . . . 57 Figure 5.3 Low-Level Action Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 5.4 Examples of two tasks in Spartun3D Benchmark. The green bounding box and arrow in the 3D scene demonstrate the standing point and orientation. . . . 62 Figure 5.5 Standing Point and Orientation Selection. . . . . . . . . . . . . . . . . . . . . . 64 Figure 5.6 Spatial information in Situated Scene Graph. The red dot and green arrow show the standing point and orientation, respectively. In this example, the pivot object is the “sofa”, the referent object is the “TV”, and the surrounding objects include the “table” and “cabinet”. . . . . . . . . . . . . . . . . . . . . 65 Figure 5.7 Human Evaluation of Spartun3D. . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 5.8 Spartun3D-LLM Model Architecture. . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 5.9 Examples of generated low-level actions. 0 denotes the current direction, while − means LEFT turn. The number represents the rotation degree. The yellow bounding box indicates the target. . . . . . . . . . . . . . . . . . . . . . 77 Figure 5.10 An example of a generated waypoint heatmap given an RGB image with and without obstacle mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 5.11 Scaling Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 5.12 SQA3D Labels Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 5.13 Qualitative Examples. (a), (c), (c) situated qa examples of object attribute and relation, object affordance, and situated planning. (d) situated captioning example. (e) navigation example in a zero-shot setting. . . . . . . . . . . . . . 83 xii CHAPTER 1 INTRODUCTION 1.1 Motivation Robot agents capable of interacting with humans provide significant benefits to humans daily life. Understanding and following natural language instructions is critical for an intelligent agent to interact with humans and the physical world. One of the designed formal problem settings to make advancements in this research direction is Vision-and-Language Navigation (VLN) [4]. VLN requires an agent to carry out a sequence of actions in a photo-realistic simulated indoor environment according to the natural language instructions and observed visual environment. To conduct this task, the agent should be equipped with three abilities: understanding linguistic semantics in the instructions, perceiving the visual images in the environment, connecting the two modalities, and reasoning over both [152, 114]. Unlike other Vision and Language (VL) tasks, such as Visual Question Answering (VQA), the VLN task is more challenging since the visual information dynamically changes during navigation. Besides, VLN problem setting can be seen as a Partially Observable Markov Decision Process (POMDP), where the agent relies heavily on historical information to make the next action decision. To be specific, as shown in Fig 1.1, at each navigation step, the agent observes the panoramic views of the current visual environment and selects an action based on the given instructions. The task is set in a single-turn scenario, where instructions provide initial guidance and remain unchanged during the navigation. There are two distinct VLN settings: the Discrete Environment Setting (VLN-DE) [4] and the Continuous Environment Setting (VLN-CE) [54]. Two settings employ different simulators: VLN-DE uses Matterport3D [7], while VLN-CE uses Habitat 3D [90]. Another primary difference between these two settings lies in their action space. In VLN-DE, the action space is the selection of images/views, where the agent selects candidate views from panoramic views based on the connectivity graph. In VLN-CE, the action space is low-level controls (LEFT/RIGHT/FORWARD/STOP), which are closer to real-world robot operations. We use the following formulation of this problem in this thesis: given an instruction with a 1 Figure 1.1 VLN Task Demonstration. sequence of word tokens, denoted as 𝑋 = {𝑥1, 𝑥2, · · · , 𝑥𝐿 }, where 𝐿 is the number of tokens, the agent observes a panoramic view including 36 viewpoints1 at each navigation step 𝑡. There are 𝑛 candidate viewpoints that the agent can navigate to in panoramic images, denoted as 𝐼 = {𝐼1, 𝐼2, · · · , 𝐼𝑛}. The agent infers an action 𝑎𝑡 that transfers the agent from state 𝑠𝑡 to a new state 𝑠𝑡. The state consists of navigation history and current spatial position. The process can be formulated as follows: 𝑠𝑡, 𝑎𝑡 = NAV(𝑠𝑡−1, 𝑋, 𝐼), (1.1) where NAV is the navigation agent. The agent needs to execute a sequence of actions close to a goal destination. The navigation terminates when the navigation agent selects STOP action or reaches a pre-defined maximum step. The VLN task has attracted significant attention leading to many methods being proposed to advance this direction of research [140]. The task is essentially formulated as a Sequence-to- Sequence problem to generate an action sequence. The types of techniques have evolved dramatically during the last few years. The baseline was based on Long Short-Term Memory (LSTM) [4, 106] when the task was initially proposed, and the current state-of-the-art model relies on Transformer architecture [39, 10]. At present, there is a growing trend towards exploring LLM-based navigation agents [145, 79]. In a word, the VLN task serves as a valuable test bed for evaluating the effectiveness of multimodal reasoning and embodied AI. 112 headings and 3 elevations with 30 degree interval. 2 Go to the clock on the wall. Go between the bluecouch and counter. Go to the table with a plant on it.CandidateView1VLN-DE (MP3D)CandidateView2CandidateView3step1step1step2stepsFORWARDFORWARD…FORWARDFORWARDFORWARDVLN-CE (HM3D)[FORWARD, LEFT, RIGHT, STOP]PanoramicViews:Instruction: (a) (b) (c) Figure 1.2 Challenges of Language Grounding in the VLN Task. 1.2 Challenges and Contributions Although numerous methods are proposed, most of the work focuses mainly on modeling visual information and training strategies. However, our research primarily addresses improving the navigation agent’s language understanding and grounding ability. Such ability is important to help the agent comprehend text components in the instruction and align them within the visual environment. We summarize the following three challenges for language grounding in the VLN task and introduce our corresponding solutions in Fig. 1.2, which are also detailed below. (a) Lacking Explicit Grounding with Entangled Vision and Spatial Understanding. Most of VLN agents primarily rely on attention mechanisms to implicitly learn the correlation between vision and text modalities. However, these approaches often mix different semantic elements from both modalities, leading to an entangled representation that lacks explicit alignment. As shown in Fig. 1.2 (a), it remains challenging to discern whether the failure of navigation comes from the agent incorrectly identifying the progress of the navigation, locating the wrong landmark, or heading in the wrong direction. Distinguishing these different aspects of semantics is important to help the agent better understand instructions and further improve the interpretability of the agent’s actions. Contributions. 1) We propose a method to explicitly align the spatial semantics in the linguistic instructions and the visual environment. We first split a long and complex instruction into spatial configurations, defined as the smallest spatial units containing various spatial components. We then select the landmarks the agent should focus on based on navigation progress and align them with the objects in the visual environment. We also model the spatial relations between landmarks and the agent’s position. Our experimental results demonstrate that our explicit modeling helps with 3 Walk past the door ontherightofyou, and stop beside the table with TV.Navigation Progress: which sub-instruction?Landmark: door? table with chair?Direction: Right? Left?(a)ExplicitSpatialUnderstandingandGroundingWalk past the door ontherightofyou, andstop beside the table with chair.Navigation Progress: which sub-instruction?Landmark: door? table with chair?Direction: Right? Left?Spatial InformationVision PerceptionMixed InformationReach the entrance between the kitchenand the living room.where?Enter the doorand turn right passing the wall.which?Which sub instruction need to be executedWhat are the navigation difficulties?What kind of objects in the selected view?Speaker:(a)ExplicitSpatialUnderstandingandGrounding(b)ModulatingOrientationandVisualCapability(c)AddressingInstructionAmbiguity(d)GeneratingExplicitExplanationsLanguageGrounding3DEnvironmentleftforwardforward both navigation performance and the interpretability of the navigation agent. 2) Unlike previous VLN agents that intertwine the learning of orientation and visual signals, we propose a method that employs specialized modules to learn these signals independently. To achieve this, we introduce specific pre-training tasks to distill more explicit spatial and visual knowledge, which is then effectively utilized within the corresponding modules of the navigation agent. Our modular design interacts with specialized pre-training tasks, enhancing the agent’s ability to adapt to downstream navigation. Our results surpass the SOTA model on several navigation benchmarks. (b) Ambiguous Instructions and Lack of Explainability. Although explicit spatial semantics modeling can help the agent understand instruction, two types of instructions make the grounding very challenging. First, when the instruction contains landmarks that may not be easily recognizable. For example, in Fig. 1.2 (b), identifying the “kitchen” and the “living room” in the target view may be less straightforward than the “sofa” and the “dining table”. Another challenge arises when instructions contain landmarks that could potentially apply to multiple targets, such as “door” or “wall”, which are often observed in every scene. These instructions cause the explicit and fine-grained grounding to be less effective for the VLN task. Contributions. 1) The first main idea of our work is to introduce a translator module in the VLN agent, which takes the given instruction and visual environment as inputs and then converts them to easy-to-follow sub-instruction representations focusing on two aspects. a) Recognizable landmarks chosen based on the navigation agent’s visual perception ability. b) Distinctive landmarks chosen to help the navigation agent distinguish the targeted viewpoint from the candidate viewpoints. The translator enhances the connection between the given instruction and the agent’s observation of the visual environment. We also construct a high-quality synthetic sub-instruction dataset and design specialized tasks for pre-training the translator and navigation agent. We evaluate our method on several navigation benchmarks to show its effectiveness. 2) The translator’s design relies on implicit learning, making it challenging to explicitly interpret the ambiguities encountered by the agent. To address this, we use a language model built in our VLN agent and jointly tune it to with VLN task. We aim to generate natural language explanations that describe the agent’s difficulties 4 in understanding instructions within the visual environment. Additionally, the explainer provides justifications for the agent’s actions based on the navigation process’s rationale. To train the explainer, we construct a synthetic dataset that aligns landmarks mentioned in the instructions with visible and distinctive objects in the visual environment. Our empirical results indicate the effectiveness of our approach and achieve SOTA when evaluated on VLN benchmarks. (c) Language Understanding for Navigation in Continuous and 3D Environment. The ultimate goal of VLN is to develop a navigation agent that can be deployed in real-world robot. While most task settings and simulated environments are discrete, real-world environments are continuous and require agents to develop a 3D understanding for planning low-level actions (as illustrated in Fig 1.2 (c)). Recent VLN-CE (Continuous Environment) research has made significant advancements in building more realistic simulated experimental settings by incorporating continuous environments. However, existing VLN-CE agents exhibit weak grounding abilities, often overlooking the relationship between instructions and corresponding low-level actions within the visual environment. Meanwhile, large language models (LLMs) have shown promise in reasoning over 3D information, making them a compelling choice for enhancing embodied navigation. Despite this, most current 3D-based LLMs lack situated understanding—an essential capability for effective navigation in real-world settings. Contributions. 1) To strengthen the language grounding ability of the VLN-CE agent, we introduce a dual-action-perception module in which the agent selects high-level viewpoints while generating low-level action sequences simultaneously. The high-level actions serve as guidance, facilitating the agent’s understanding of the relationships between low-level actions and navigable areas indicated by high-level actions. Our design enhances the VLN-CE agent’s spatial grounding ability to connect actions with visual perception and language understanding. 2) To enhance situated spatial understanding in 3D-based LLMs, we introduce a scalable LLM-generated dataset that incorporates diverse situated spatial information, conditioned on the agent’s standpoint and orientation. By fine-tuning existing 3D-based LLMs with our dataset, we significantly improve their ability to comprehend spatial relationships in a 3D world. Moreover, the enhanced spatial understanding shows strong generalization to navigation tasks without training on navigation datasets. 5 1.3 Thesis Outline We organize the thesis based on the challenges and contributions mentioned above. Chapter 2 provides a comprehensive literature review of the aspects of the embodied AI, Vision and Language, and the VLN task. Chapter 3 introduces two methods for modeling explicit spatial semantics for the VLN agent. First, we present an explicit grounding approach, the Explicit Object Relation Alignment Agent (EXOR), which models spatial information in both instruction and visual environment explic- itly. Next, we propose a neural navigation agent, Learning Orientation and Visual Signals (LOViS), which learns spatial orientation and visual perceptions with disentangled modules. Additionally, we design novel and specialized pre-training tasks to enhance the learning of these modules. Chapter 4 introduces two methods to address ambiguous instructions. First, we design a translator module for the VLN agent, named VLN-Trans, to transfer the original instruction to the easy-to-follow sub-instructions representations focusing on the recognizable and distinctive landmarks based on the agent’s visual abilities and observed visual environment. Second, we introduce a hint generator, named NavHint, which provides detailed natural language descriptions to assist the navigation agent. NavHint explains the agent’s reasoning process by generating the rationale behind its actions during navigation. Chapter 5 presents our advancements in developing navigation agents aimed at real-world application. We introduce a VLN-CE agent equipped with a dual-action mechanism and a 3D-based LLM model designed for situated spatial understanding. For the VLN-CE agent, we propose a low-level action decoder jointly trained with high-level action prediction, allowing the agent to learn and ground the selected visual view into low-level controls. For 3D-based LLMs, we introduce Spartun3D (Situated Spatial Understanding of the 3D World), a scalable dataset designed to enhance situated spatial reasoning in 3D world, thereby advancing navigation in 3D environment. Chapter 6 provides a comprehensive summary of our research, highlighting key contributions and insights. Additionally, we explore potential future directions, with an emphasis on leveraging large generative models and compositional learning to enhance performance in embodied tasks. 6 CHAPTER 2 LITERATURE REVIEW This chapter provides a comprehensive literature review of the background of this research, including Embodied AI, Vision and Language Learning, Vision and Language Navigation datasets, evaluation metrics, backbone architectures, and key approaches. 2.1 Embodied AI Embodied Artificial Intelligence (Embodied AI) enables AI agents to learn through interactions with their environment from an egocentric perception similar to humans [24]. Different from the traditional AI algorithms to learn from datasets of images, videos or text collected primarily from the Internet, Embodied AI aims to acquire knowledge through interaction with the environment dynamically. Embodied AI has led to significant progress in embodied AI simulators that help replicate the physical world. These simulators work as virtual testbeds to train and test embodied AI agents before deploying them into the real world. The popular embodied AI simulators include MP3D [7], Habitat3D [97, 104], AI2-THOR [50], VirtualHOME [84], iGibson [122], and etc. Embodied AI simulators have also boosted a series of embodied AI research tasks, including visual exploration, visual navigation, and embodied QA [24]. The tasks increase in complexity as they advance from exploration to QA. In visual exploration, an agent gathers information about a 3D environment through movement and visual perception, done before or concurrently with navigation tasks. The agent is free to explore the environment with a limited number of steps before the start of navigation [3] or builds the map as it navigates in an unseen environment [27, 75, 76]. In visual navigation, an agent navigates in a 3D environment following a natural language instruction [3]. The challenging aspect of visual navigation is it requires agents to make action predictions based on historical visual information and actions. Embodied QA is currently considered the most complicated task in embodied AI research since it needs the agent to possess a wired range of capabilities such as visual recognition, language understanding, question answering, commonsense reasoning, task planning, and goal-driven navigation. A common framework of Embodied QA can be divided into a navigation and a QA task, where the navigation is to explore the environment and 7 the QA module is executed based on the previous paths when the agent decides to stop. 2.2 Vision and Language Learning In the past few years, Computer Vision (CV) and Natural Language Processing (NLP) have played important roles in deep learning research [126, 31, 101, 33, 119]. In addition to the significant progress in single-modality pre-trained models, there has been an upsurge in this research focused on pre-training large-scale models on both vision and language modalies, called Vision-Language Pre-Training Models (VLMs). Such VLPs are supposed to learn universal cross- modal representations, which are beneficial for achieving strong performance in downstream VL tasks [133, 127, 32, 22, 140, 82]. Generally speaking, given image-text pairs, VLMs employ a text encoder and an image encoder to extract image and text features and then learn the vision-language correlation with pre-training or downstream training objectives. There are two types of mainstream vision and language pre-training architecture. (1) Single-stream [65, 103, 68] fuses the language and vision representations by the joint cross-modal encoder directly. Specifically, VisualBERT [65] utilize segment embedding to indicate input elements from different sources. OSCAR [68] includes object tags detected from the image rather than just focus on image-text pairs. However, single-stream architecture may neglect intra-modality interaction since it directly applies self-attention between different modalities. (2) Double-stream applies intra-modality processing to two modalities separately along with a shared cross-modal encoder. It assumes that the intra-modal interaction and cross-modal interaction are better to be separated to learn the corresponding representations. For example ViLBERT [72] utilizes two transformers to model intra-modality interaction after the cross-modal module. LXMERT [105] uses a self-attention sub-layer after cross-attention to learn internal connections for each modality. ALBEF [64] employs two transformer before cross-attention to decouple the learning of modalities before their interaction. The VLM pre-training is usually guided by certain vision-language objectives that enable to learn image-text relations from large scale vision and language dataset [89, 123, 125]. For example, CLIP [89] utilizes an image-text contrastive objective to learn the representation that pull 8 the paired images and texts close and pushing others faraway in the embedding space. This is helpful to learn representations that perform zero-shot predictions. Following CLIP, the research mainly focuses on transfer learning to adapt the pre-training VLMs towards various downstream tasks [148, 147, 26, 131], such as the methods of prompt tuning [148], visual adaption [147], etc. There is another research working on knowledge distillation to obtain knowledge from VLMs to downstream tasks for better performance in object detection, semantic segmentation, etc [19, 29, 23]. 2.3 Vision and Language Navigation VLN [3] is a task where agents navigate within the environment by following natural language instructions. The main challenging part of this task lies in the agent’s requirement to make action decisions based on visual perception, language understanding, and history memories. Unlike VL tasks like Visual Question Answering (VQA) and image captioning, which typically take a single input question and static images as input, the images in the VLN task dynamically change [130]. The VLN agent needs to ground language to new visual information while considering historical information. The VLN task is crucial for the Intelligent Agent as its broad applications such as autonomous driving, virtual assistants, and augmented reality. In the following sub-sections, we introduce VLN from the aspects of its simulator and dataset, evaluation metrics and solutions. 2.3.1 Simulator and Dataset Our primary focus is the indoor navigation task, leveraging the Matterport3D [7] and Habitat [97] simulators. Matterport3D [7] contains 10800 panoramic views of 90 scenes, including houses, apartments, hotels, and offices. It supports the agent collecting surrounding visual information, including the simulated RGB and depth images, as well as semantic segmentation. Habitat [97] provides a diverse collection of highly detailed 3D environments, including geometry, texture, and lighting. It offers the visual details for the navigation agent with RGB images, depth maps, and semantic segmentation. With the introduction of the simulators, many navigation datasets were introduced [4, 55, 48, 108]. We mainly work on three VLN datasets: Room-to-Room (R2R) [4], R4R [48] and R2R-CE [54]. R2R is built upon the Matterport3D dataset. The instructions in the R2R are fine-grained 9 navigation commands, such as "Walking pass between the living room and kitchen, and stop beside the sofa." This dataset has 7198 paths and 21567 instructions with an average length of 29 words. The whole dataset is partitioned into training, seen validation, unseen validation, and unseen test set. The seen set shares the same visual environments as the training set, while unseen sets contain different environments. There are 4675 trajectories and 340 trajectories in 61 visual scenes for the train set and validation seen set, respectively. Each path is paired with more than 3 instruction. For the validation unseen set, there are 783 trajectories in 11 scenes. R4R extends the R2R dataset with longer instructions and trajectories by concatenating two adjacent tail-to-head trajectories in R2R. Different from R2R, the trajectories in R4R are less biased as they are not necessarily the shortest path from the start viewpoint to the destination. It also contains three sets: train (61 scenes, 233, 613 instructions), validation seen (61 scenes, 1, 035 instructions), and validation unseen (11 scenes, 45, 162 instructions). R2R-CE VLN-CE uses the Habitat 3D [104] to render environment observations based on the MP3D dataset [7]. The dataset statistics are the same as the R2R. 2.3.2 Evaluation Metrics Three main metrics are used to evaluate navigation wayfinding performance [4]: (1) Navigation Error (NE): the mean of the shortest path distance between the agent’s final position and the goal destination. (2) Success Rate (SR): the percentage of the predicted final position being within 3 meters from the goal destination. (3) Success Rate Weighted Path Length (SPL): normalizes success rate by trajectory length. Another three metrics are used to measure the fidelity between the predicted and the ground-truth trajectory. (4) Coverage Weighted by Length Score (CLS) [48] (6) nDTW [47]: Normalized Dynamic Time Warping: penalizes deviations from the ground-truth trajectories. (6) Normalized Dynamic Time Warping weighted by Success Rate (sDTW) [47]: penalizes deviations from the ground-truth trajectories and also considers the success rate. 2.3.3 Main Techniques The VLN baseline model is first proposed by [4] with the R2R dataset that extends the instruction following to the photo-realistic simulated environments. Subsequent studies have emerged with 10 an emphasis on enhancing navigation performance through multi-modal learning [36, 60, 136, 1, 134], map representation learning [41, 9, 2], graph-based explorations [150, 111, 12], data augmentation [118, 61, 57, 56, 58, 117], large language modeling [146, 8, 109, 87], and auxiliary reasoning tasks or pre-training proxy tasks to guide the navigation agent to learn textual and visual representations [151, 10, 34, 88, 137]. In the following sections, we begin by presenting two basic architectures (LSTM-based and Transformer-based) of the VLN agent. Then, we introduce works that focus on enhancing the language grounding and generation ability of the VLN agent. LSTM-based VLN Agent. The earlier models mostly depend on the LSTM-based based sequence- to-sequence architecture for encoding the text and visual information, establishing the connections with the attention mechanism, and decoding the actions [4, 73, 25]. The encoder is a bidirectional LSTM-RNN with an embedding layer to obtain language representation, denoted as [𝑠1, 𝑠2, · · · , 𝑠𝑙] = 𝐵𝑖𝐿𝑆𝑇 𝑀 (𝐹 (< 𝑥1, 𝑥2, · · · , 𝑥𝑙 >), where 𝐹 represents the embedding function. The decoder is also an attentive LSTM-RNN. At each decoding step 𝑡 of navigation, the agent first attends to the panoramic image representation 𝑓 𝑝 with the previous hidden context feature ˜ℎ𝑡−1. The visual representation 𝑖 = [𝑅𝑒𝑠𝑁𝑒𝑡 (𝐼 𝑝 of 𝑖 − 𝑡ℎ panoramic image is denoted as 𝑓 𝑝 𝑖 ); 𝑑𝑖], which is the concatenation of the ResNet visual features 𝑅𝑒𝑠𝑁𝑒𝑡 (𝑣 𝑝 𝑖 ) and the corresponding 128 dimensional direction encoding 𝑑𝑖. The direction encoding for panoramic images 𝑑𝑖 is the replication of [𝑐𝑜𝑠𝜃𝑖, 𝑠𝑖𝑛𝜃𝑖, 𝑐𝑜𝑠𝜙𝑖, 𝑠𝑖𝑛𝜙𝑖] by 32 times, where 𝜃𝑖 and 𝜙𝑖 are the angles of heading and elevation of 𝑖𝑡ℎ panoramic image. The 𝑡 , 𝑉 = 𝑓 𝑝 is computed by ˜𝑓 𝑝 attentive panoramic visual feature ˜𝑓 𝑝 𝑡 ), 𝑡 𝑡 = 𝑆𝑜 𝑓 𝑡 𝐴𝑡𝑡𝑛(𝑄 = ˜ℎ𝑡−1, 𝐾 = 𝑓 𝑝 and then is used as input to the LSTM of the decoder to represent the agent’s current state as, ℎ𝑡 = 𝐿𝑆𝑇 𝑀 ( [𝑎𝑡−1; ˜𝑓 𝑝 𝑡 ], ˜ℎ𝑡−1), (2.1) where 𝑎𝑡−1 is the selected action direction of the previous navigation step, and ˜ℎ𝑡−1 is the hidden context after considering the grounded objects. Transformer-based VLN Agent. Compared with conventional methods, the Transformer-based model in VL tasks show great improvements [105, 13, 72, 68]. The VLN needs to learn the corre- spondence between language and dynamic visual observation by interacting with the environment. In the past few years, the VLN task has been formulated as a dynamic grounding problem between 11 texts and images. PRESS [59] firstly fine-tunes a pre-trained language model BERT to obtain the text representation. PREVALENT [34] trains a VL Transformer with a large amount of image-text-action triplets to learn cross representations for the navigation task. RecBERT [40] designs a state unit to store history information and train Transformer recurrently for the direct navigation. HAMT [10] proposes to explicitly encode all past observations and actions as history. Also, they improve the performance by changing the fixed vision features to the Vision Transformer, ViT [20]. VLN⟳BERT is the most popular backbone of the Transformer-based navigation agent. It is a cross-modal Transformer-based navigation agent with a specially designed recurrent state unit. At each navigation step, the agent takes three inputs: text representation, vision representation, and state representation. The text representation 𝑋 for instruction 𝑊 is denoted as 𝑋 = [𝑥1, 𝑥2, · · · , 𝑥𝐿]. The vision representation 𝑉 for candidate viewpoints 𝐼 is denoted as 𝑉 = [𝑣1, 𝑣2, · · · , 𝑣𝑛]. The recurrent state representation 𝑆𝑡 stores the history information of previous steps and is updated based on 𝑋 and 𝑉𝑡 at the current step. The state representation 𝑆𝑡 along with 𝑋 and 𝑉𝑡 are passed to cross-modal transformer layers and self-attention layers to learn the cross-modal representations and select an action, as follows: ˆ𝑋, ˆ𝑆𝑡, ˆ𝑉𝑡 = 𝐶𝑟𝑜𝑠𝑠_𝐴𝑡𝑡𝑛(𝑋, [𝑆𝑡; 𝑉𝑡]), 𝑆𝑡+1, 𝑎𝑡 = 𝑆𝑒𝑙 𝑓 _𝐴𝑡𝑡𝑛( ˆ𝑆𝑡, ˆ𝑉𝑡), (2.2) (2.3) we use ˆ𝑋, ˆ𝑆𝑡, ˆ𝑉𝑡 to represent text, recurrent state, and visual representations after cross-modal transformer layers, respectively. The action is selected based on the self-attention scores between ˆ𝑆𝑡 and ˆ𝑉𝑡. 𝑆𝑡+1 is the updated state representations and 𝑎𝑡 contains the probability of the actions. Explicit Grounding in the VLN Agent. Mainstream works use Transformer-based models to implicitly capture cross-modality information and demonstrate outstanding navigation perfor- mance [40, 34, 30, 10]. There are works modeling the semantic structure explicitly enhances the textual-visual matching [36, 37, 86, 134, 59]. RelGraph [36] builds an implicit language-visual entity relation graph to learn the connection between the text and vision modalities. SpC-NAV [134] first splits the long instructions into spatial configurations [17, 134, 51]. Then, they explicitly align 12 the landmarks and spatial relations in the spatial configuration to the corresponding information in the visual modality. OAAM [86] attempts to decompose the instruction into action and object phrases and relate them to the visual environment to make the final decisions. NvEM [1] extends OAAM to divide the object modules into subject and reference modules and fuse the information from the neighbor views. Language-Capable VLN Agent. Equipping the navigation agent with the ability to generate textural instructions is one of the primary methods to augment data and improve the agent’s generalization ability and explainability. Early works employ the Speaker-Follower framework [25] to produce synthetic VLN instructions. In this framework, a speaker is trained offline using annotated R2R instructions and generates new instructions based on sequences of panoramas along a trajectory. These generated instructions are subsequently employed as augmented data to train the follower. [106] then improves such a Speaker-Follower agent by adding noise into the environment so that the speaker can generate more diverse instructions to further improve the generalizability of the agent. [71] propose to generate cross-connected house scenes as augmented data via mixuping environment to construct difficult paths for the follower and generate the paired instructions as augmentation data. Different from the above-introduced Speaker-Follower methods that integrate the speaker and follower piplines, there are methods working on optimizing two components simultaneously. For example, [21] focus on improving the speaker model to generate higher-quality instruction by directly obtaining feedback from the follower so that the generated instruction is more suitable for the follower. LANA [113] is a language-capable navigation agent which not only executes human-written instructions but also provides route descriptions to humans at the same time. [142] propose an instruction-trajectory compatibility model that operates without reference instruction to improve instruction evaluation. 13 CHAPTER 3 EXPLICIT SPATIAL UNDERSTANDING AND GROUNDING 3.1 Introduction Many VLN agents have been developed to establish the connection between text and vision modalities using an attention mechanism to relate the tokens from a given instruction to the images in a panoramic photo [4, 25, 73, 124]. While these models enhance navigation performance, there is no clear evidence that the agent can effectively align components of the visual environment with the natural language instructions [37]. Surprisingly, prior research [45] has shown that successful navigation is still possible even in the absence of visual information, suggesting that these models may not rely on multimodal grounding to make action decisions. This observation highlights the critical need for a more comprehensive investigation into the grounding capabilities of VLN agents, particularly in their ability to establish explicit correspondences between semantic elements in navigation instructions and their visual counterparts in the environment. Two fundamental abilities are crucial for a navigation agent: spatial reasoning and visual perception. For example, spatial reasoning enables the agent to interpret directional instructions such as “90-degree left-turn” or “on your right”. Visual perception allows the agent to recognize and identify landmarks mentioned in the instructions, such as “walk to the sofa” or “pass the table”. To effectively integrate these capabilities, the agent must align motion-related and landmark- related tokens with their corresponding visual representations. This requires understanding spatial relationships to head to the accurate direction and associating objects in the environment with their textual references. In this chapter, we introduce two neural navigation agents designed to enhance a VLN agent’s grounding ability from these two perspectives. The first navigation agent, named Explicit Object Relation Alignment Agent (EXOR) [136] is developed to explicitly align the spatial semantics between linguistic instructions and the visual environment. Specifically, we first split the long instruction into spatial configurations [17, 134], and then we select the important landmarks based on such configurations. After that, in the visual environment, we retrieve the most relevant objects according to their similarity with the selected 14 landmarks in the instructions. Moreover, we obtain textual spatial relation encoding to model the spatial relations between the agent and landmarks in the textual instructions, and use visual spatial relation encoding to represent the relation between agent and the image in the visual environment. We then establish a mapping between the two encodings to achieve a better alignment. Finally, we use the representations of the aligned objects and spatial relations to enrich the vision representations. The second navigation agent, called Learning Orientation and Visual Signals (LOViS) [137], has different modules to select actions based on orientation and vision perspectives separately. Moreover, we design specific pre-training tasks to distill spatial and visual knowledge independently, which is better utilized in the corresponding modules in our navigation agent. This is different from the majority of methods employing pre-training tasks without considering the needs of the target downstream tasks. Our modular design interacts with modular pre-training, guiding the agents to generate specialized representations which can be better adapted to the downstream tasks. We evaluate our method on the R2R benchmark, and conduct comprehensive ablation studies to further validate the effectiveness of our proposed grounding components. Additionally, we provide qualitative examples to illustrate how our agents leverage different semantics to make action decisions. In summary, our contributions are summarized as follows: 1.We focus on different semantic aspects of instructions, particularly visual perception and spatial reasoning. We explicitly model these two aspects from both the instructions and the visual environment, and align them to enhance the agent’s navigation performance and the interpretability of its actions. 2. We design two separate modules to capture the orientation and visual information signals for the VLN agent. This enables the agent to select an action more effectively by leveraging both information sources. We design new pre-training tasks to emphasize (a) learning spatial reasoning and grounding the orientation information in the environment; (b) learning visual perception and grounding landmark mentions in the environment. These pre-training representations are utilized in the corresponding modules in the navigation model. 15 (a) Spatial Configuration Scheme (b) Spatial Configuration Annotation Figure 3.1 Spatial configuration example. 3. Our experimental results demonstrate that our explicit modeling enhances the VLN agent’s grounding ability and improves overall navigation performance. 3.2 EXOR: Explicit Object Relation Alignment Fig 3.2 shows the EXOR model architecture. The model has four sub-modules, (1) Spatial Configuration (2) Select top-k landmark selection (3) Landmark-Object alignment (4) Landmark- Object Spatial Relation relation alignment. The text highlighted in green and yellow in (1) shows motion indicators and landmarks, respectively. The red arrow in (4) is the initial agent heading (i.e. orientation). 3.2.1 Spatial Configuration A spatial configuration is the smallest linguistic unit that describes the location/trans-location of an object with respect to a reference or a path that can be perceived in the environment. It contains fine-grained spatial roles, such as motion indicator, landmark, spatial indicator, trajector. Essentially, each spatial configuration forms a sub-instruction in our setting. Fig. 3.1 shows an example of splitting an instruction into its corresponding spatial configurations and the extracted spatial roles. The instruction "Move to the table with chair, and stop." can be split into two spatial configurations: "move to the table with chair" and "stop". In configuration1, "move" is the motion indicator; "to" is a spatial indicator; "table" is the landmark. "table with a chair" is a nested spatial configuration 16 of configuration1. The role of "table" is trajector; "with" is a spatial indicator; and "chair" is a landmark. In configuration 2, "stop" is a motion indicator. A spatial configuration is the smallest linguistic unit that describes the location/trans-location of an object with respect to a reference or a path that can be perceived in the environment. It contains fine-grained spatial roles, such as motion indicator, landmark, spatial indicator, trajector. Essentially, each spatial configuration forms a sub-instruction in our setting. Previous research argues representing the semantic structure of the language could improve the reasoning capabilities of deep learning models [17, 143]. There are relevant works modeling the meaning of spatial semantics in probabilistic models [49, 107] and neural models [93, 28]. However, its impact on deep learning models for navigation remains an open research problem. To obtain the configurations in a navigation instruction, we first split the instructions into sentences. Then we design a parser with rules applied on an off-the-shelf dependency parser1 to extract all the verb phrases and noun phrases in each sentence. In general, each configuration contains at most one motion indicator. Since we aim to process instructions and look for motions, we split the sentences with the extracted verb phrases as motion indicators to obtain spatial configurations. We do not separate the nested configurations with no motion indicator and keep them attached to the dynamic configurations (i.e. the ones with motion-indicator). As shown in Figure 3.1, "table with chair" is the nested spatial configuration of "move to the table with chair". Here, we only consider the prepositions that are attached to verbs, and merge the spatial indicators and motion indicators such as "move to" and use them together as the motion indicator. After that, we insert a pseudo delimiter token after each configuration and identify their contained noun phrases as landmarks. 3.2.2 Landmark Selection Landmark phrases in instructions are split into groups according to the spatial configuration. We assign the attention weights of each spatial configuration to all its included landmarks. The attention weights of landmarks are the same once they appear in the same configuration. Then we sort all weighted landmarks and select the top-𝑘 important ones for the agent to focus on at each navigation 1https://spacy.io/ 17 Figure 3.2 Model architecture of EXOR. step. Formally, each configuration contains 𝑛 landmarks, denoted as 𝐿 =< 𝐿1, 𝐿2, · · · , 𝐿𝑛 >. The total number of landmarks is 𝑚 ∗ 𝑛 in 𝑚 spatial configurations. After sorting all landmarks based on the spatial configuration weights 𝛽, we can obtain top-𝑘 selected landmark representations, as ˜𝐿 =< ˜𝐿1, ˜𝐿2, · · · , ˜𝐿 𝑘 >. We obtain the best result when 𝑘 is 3. 3.2.3 Landmark-Object Alignment After selecting the top landmarks, the next step is to align them with the corresponding objects in the image. We use Faster-RCNN to detect 36 objects in each image, and the object representation of the i-th image is 𝑂𝑖 = [𝑜𝑖,1, 𝑜𝑖,2, · · · , 𝑜𝑖,36]. We compute the cosine similarity scores between the j-th landmark in top-𝑘 landmarks and all objects in the i-th image, and select the object with the highest similarity score as the most relevant object to the j-th landmark, as ˆ𝑂𝑖,𝐿 𝑗 = 𝑚𝑎𝑥(𝑐𝑜𝑠_𝑠𝑖𝑚( ˜𝐿 𝑗 , 𝑂𝑖)). The aligned objects in the i-th image are denoted as ˆ𝑂𝑖 = [ ˆ𝑂𝑖,𝐿1 𝑘 aligned objects since we have top-𝑘 landmarks. Finally, we concatenate the aligned object , · · · , ˆ𝑂𝑖,𝐿 𝑘 ]. We get , ˆ𝑂𝑖,𝐿2 representations with the candidate image features 𝑓 𝑐. The 𝑖𝑡ℎ candidate image is represented as 𝑓 𝑝 𝑖 = [𝑅𝑒𝑠𝑁𝑒𝑡 (𝑣𝑐 as ˆ𝑓 𝑐 𝑖 ); 𝑑𝑖]. After aligned with the corresponding objects, its representation is updated 𝑖 = [ 𝑓 𝑐 𝑖 ; ˆ𝑂𝑐 𝑖 ]. 18 LSTMtablewindowPass the table with chairs on the right and stop near the bookshelf.C1: passthe tablewith chairson theright; C2: standnear the window. chairchair-right"!"#"!SpatialConfigurationSelectedtop-KLandmarksleftleftrightchair-leftchair-leftInitial headingLandmark-ObjectRelationAlignmentPanoramicImagesInstructionObject Relation Alignment Modulechair-wallchair-chairLandmark-ObjectAlignmentchair-chair#!$$#!$Softattn.%&'%&Aligned objectfeatureAligned relation featureCandidateImages#!'((#!'Concat. 3.2.4 Landmark-Object Spatial Relation Alignment We model both textual spatial relations and visual spatial relations. On the text side, there are mainly three different cases of spatial relations described in the navigation instructions. • Case 1. Motions verbs, such as “turn left to the table"; • Case 2. Relative spatial relationships between agent and landmarks, such as “table on your left"; • Case 3. Spatial relationships between landmarks, such as “vase on the table". This work mainly investigates the spatial relations from the agent’s perspective, and we only model the first two cases. We extract "landmark-relation" pairs for each landmark in the instructions (based on syntactic rules). For Case 1, we pair the spatial relation with all landmarks in the configuration. For example, “turn left to the table with the chair", the extracted pairs are {table-left} and {chair-left}. For Case 2, we pair the relation with the related landmark. For example, “go to the sofa on the right.”, the extracted pair is {sofa-right}. We encode the spatial relations for the landmarks in six bits [𝑙𝑒 𝑓 𝑡, 𝑟𝑖𝑔ℎ𝑡, 𝑓 𝑟𝑜𝑛𝑡, 𝑏𝑎𝑐𝑘, 𝑢 𝑝, 𝑑𝑜𝑤𝑛] as the textual spatial relation encoding. Each bit is set to 1 for the landmark if its paired relation has the corresponding relation. On the image side, we encode the same six spatial relations as the visual spatial relation encoding. We obtain the spatial relations of objects in the visual environment based on the relative angle, the differences between the agent’s initial direction and the navigable direction. The spatial relations are the same for all objects if they are in the same image. Formally, for the obtained top-𝑘 landmarks, we denote their spatial encoding as 𝑅 ˆ𝐿 = [𝑅 ˆ𝐿 1 , 𝑅 ˆ𝐿 2 , · · · , 𝑅 ˆ𝐿 𝑘 ]. For the top-𝑘 objects aligned with those landmarks, the spatial relations in i-th navigable image are represented as 𝑅 ˆ𝑂 𝑖 = [𝑅 ˆ𝑂 𝑖,1 , 𝑅 ˆ𝑂 𝑖,2 , · · · , 𝑅 ˆ𝑂 𝑖,𝑘 ]. We compute the inner product of the spatial encoding between top-𝑘 landmarks and the top-𝑘 aligned objects to obtain the spatial similarity score between the instruction and the i-th image, that is, 𝑠𝑖𝑚 𝑅 𝑖 = 𝑅 ˆ𝐿 · 𝑅 ˆ𝑂 𝑖 . Then we concatenate each aligned object spatial encoding with the corresponding similarity score, denoted as 𝑖,𝑘 ; 𝑠𝑖𝑚 𝑅 𝑖,𝑘 ]]. Finally, we further concatenate ˆ𝑂𝑖,𝑅 with ˆ𝑂𝑖,𝑅 = [[𝑅 ˆ𝑂 𝑖,2], · · · , [𝑅 ˆ𝑂 𝑖,1], [𝑅 ˆ𝑂 𝑖,1; 𝑠𝑖𝑚 𝑅 𝑖,2; 𝑠𝑖𝑚 𝑅 19 Figure 3.3 Model architecture of LOViS. the candidate image features ˆ𝑓 𝑐 candidate images features is updated as ˆˆ𝑓 𝑐 𝑖 which is concatenated with the aligned object features , and i-th 𝑖 ; ˆ𝑂𝑖,𝑔]. The updated image representations are 𝑖 = [ ˆ𝑓 𝑐 then used to make action decisions for the agent. 3.2.5 Action Prediction After modeling alignment between landmark tokens in the instruction and visual objects, the panoramic image feature is enriched with the aligned visual objects, and candidate image feature is enriched with both visual objects and their spatial relations. Then based on the backbone sequence to sequence agent, the probability of moving to the k-th navigable viewpoint 𝑝𝑡 (𝑎𝑡,𝑘 ) is calculated as softmax of the alignment between the navigable viewpoint features and a context-aware hidden output ˜ℎ𝑡, which can be calculate as ˜ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑊 ˜𝑐ℎ [ ˜𝐶; ℎ𝑡]) 𝑝𝑡 (𝑎𝑡,𝑘 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥( ˆˆ𝑓 𝑐 𝑖 𝑊 ˆ𝑐 ˜ℎ𝑡) (3.1) (3.2) where 𝑊 ˜𝑐ℎ and 𝑊 ˆ𝑐 are learnt weights. 3.3 LOViS: Learning Orientation and Visual Signals LOViS has three main modules: history module, orientation module, and vision module, as depicted in Figure 3.3. 20 [CLS] Walk across kitchento the Stairs. Turn right and walk to the sofa.……statestateCross-Modality Attention Transformer Layer…Self-Attention Transformer Layerstate…Cross-Modality Attention Transformer Layer……Self Attention Transformer Layerstate…Cross-Modality Attention Transformer LayerState-V……Self Attention Transformer LayerAction2Action1Action3W2W3W1ActionOrientationModuleVisionModuleHistoryModuleTextRepresentationState-OPositionRepresentationVisionRepresentationVision-orientation RepresentationTextEncoderVision-orientationEncoderOrientationEncoderVisionEncoder 3.3.1 History Module History Module receives three types of inputs: state representation 𝑠𝑡 (see “state” in Figure 3.3), text representation 𝑋, and “vision-orientation” representations. To obtain “vision-orientation” representation, we feed the concatenation of vision and orientation representations to a “vision- orientation encoder" (see Figure 3.3). We denote “vision-orientation” representation as ˜𝑉𝑂 = { ˜𝑣𝑜1, ˜𝑣𝑜2, · · · , ˜𝑣𝑜𝑘 }. Then we use cross-modal attention layers and self-attention layers to obtain the cross representation. In cross-modality attention Transformer layers, one modality is used as a query and the other as the key to exchange information as follows, ˆ𝑋, ˆ𝑠𝑡 , ˆ𝑉𝑂𝑡 = 𝐶𝑟𝑜𝑠𝑠_𝐴𝑡𝑡𝑛(𝑋, [𝑠𝑡 ; 𝑉𝑂𝑡 ]), (3.3) where ˆ𝑋, ˆ𝑠𝑡, and ˆ𝑉𝑂𝑡 are respectively updated state, text and “vision-orientation” representations after cross modality attention layers. Then state and “vision-orientation” representations are fed into self-attention Transformer layers: 𝑠𝑡+1, 𝑝ℎ 𝑡 = 𝑆𝑒𝑙 𝑓 _𝐴𝑡𝑡𝑛([ ˆ𝑠𝑡 ; ˆ𝑉𝑂𝑡 ]) (3.4) where 𝑠𝑡+1 is the updated state after self-attention layers. 𝑝ℎ 𝑡 is the self attention score between state representations and “vision-orientation” representations. Note that the refinement of the state representation only happens in the history module. 3.3.2 Orientation Module Orientation information is vital for the navigation task. For example, the instruction, “turn left" can assist the agent to ignore the navigable viewpoints on the right side. In our work, we build an orientation module specifically to encourage the agent to learn the spatial information from the instructions and ground it in the visual environment. Specifically, we linearly project the orientation features 𝑂 via the “Orientation Encoder” (see Figure 3.3) to obtain its projected representation, denoted as ˜𝑂. Then we input the state representation 𝑠𝑡, text representation 𝑋, and the projected orientation representation ˜𝑂 to the cross-modality attention Transformer layer. The orientation module learns a new state representation, denoted as 𝑠𝑜 𝑡 , for orientation information (see “State-O’ 21 in Figure 3.3). For cross-model attention layers, we have: ˆ𝑋 𝑜, ˆ𝑠𝑜 𝑡 , ˆ𝑂𝑡 = 𝐶𝑟𝑜𝑠𝑠_𝐴𝑡𝑡𝑛(𝑋, [𝑠𝑜 𝑡 ; ˜𝑂𝑡 ]) (3.5) where ˆ𝑋 𝑜, ˆ𝑠𝑜 𝑡 , ˆ𝑂𝑡 are updated state, text, orientation representations after cross modality attention layers in the orientation module. Then we use the state representation enriched with the orientation information to perform self-attention with orientation representations as follows. 𝑝𝑜 𝑡 = 𝑆𝑒𝑙 𝑓 _𝐴𝑡𝑡𝑛([ ˆ𝑠𝑜 𝑡 ; ˆ𝑂𝑡 ]) (3.6) where 𝑝𝑜 𝑡 is the attention score between state representation and orientation feature. 3.3.3 Vision Module Connecting mentioned landmarks in the instruction to the scene and objects in the visual environment is also important to the navigation task. In the instruction, “enter into the bedroom and move close to TV.”, The mentioned landmarks, such as “bedroom” and “TV”, provide apparent clues for the navigation actions. Like the orientation module, we build a vision module to ground the text landmarks in the visual scene and objects. Specifically, we first project vision representations 𝑉 (refer to the notations in Section 2.3.3) using “Vision Encoder” (see Figure 3.3) to obtain the projected visual representation, denoted as ˜𝑉. Then we input the state representation 𝑠𝑡, text representation 𝑋, and projected vision representation ˜𝑉 to the cross-modal attention and self-attention layers as follows, ˆ𝑋 𝑣, ˆ𝑠𝑣 𝑡 , ˆ𝑉𝑡 = 𝐶𝑟𝑜𝑠𝑠_𝐴𝑡𝑡𝑛(𝑋, [𝑠𝑣 𝑡 ; ˜𝑉𝑡 ]), 𝑡 = 𝑆𝑒𝑙 𝑓 _𝐴𝑡𝑡𝑛([ ˆ𝑠𝑣 𝑝𝑣 𝑡 ; ˆ𝑉𝑡 ]), (3.7) (3.8) where 𝑠𝑣 𝑡 is the new state representation considering visual information (see “State-V” in Figure 3.3). ˆ𝑋 𝑣, ˆ𝑠𝑣 𝑡 , ˆ𝑉𝑡 are updated state, text, vision representations after cross modality attention layers in the vision module. 𝑝𝑣 𝑡 is the attention score between state representation and vision representations. 3.3.4 Action Selection For each navigable viewpoint, we obtain the self-attention scores from 1) orientation state representation to its orientation representation (orientation module), 2) vision state representation to 22 Figure 3.4 Pre-training model with specific pre-training tasks. the vision representation (vision module), 3) state representation to the combined orientation and visual representations (history module). We combine these scores as follows: 𝑝𝑡 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥(𝑊𝑎 [ 𝑝ℎ 𝑡 ; 𝑝𝑜 𝑡 ; 𝑝𝑣 𝑡 ]) (3.9) where 𝑤𝑎 is the trainable parameter, and 𝑝𝑡 denotes the action probability that weights different module scores. 3.3.5 Pre-training Tasks We follow the model architecture of PREVALENT [34] to obtain the joint cross representations trained on text-image-action triplets, as shown in Figure 3.4. However, the novelty of our pre-training is that we design new tasks named Vision Matching (VM) and Orientation Matching (OM) to pretrain for the vision module and orientation module designed in our navigation agent, as shown in Figure 3.3. Moreover, we improve the existing pre-training tasks of the PREVALENT, Masked Language Modeling (MLM) and Single Step Action Prediction (SSAP), to obtain a more effective initialization of our new architecture. Here, we describe the details of all the pre-training tasks. In the following tasks, we denote each instruction-trajectory pair in training set 𝐷 as < 𝑤, 𝜏 >. Masked Language Modeling (MLM) Different from PREVALENT [34] masking of random tokens, we mask direction and landmark tokens with 8% probability and replace them with special token [𝑀 𝐴𝑆𝐾]. The goal is to recover landmark or orientation tokens 𝑤𝑚 by reasoning over the 23 TextEncoderV-OEncoderCLSTransformerCLSMLMSSAP[CLS] Turn [MAS] to the [MASK][SEP]……-1800180-90-90OrientationEncoderVisionEncoderOMVMCLSCLSCLS surrounding words 𝑤\𝑚, and the orientation and visual observation at the each navigation step. We denote the combination of orientation and vision features of panorama views as 𝑉𝑂 𝑝. Landmark tokens are usually the token related to scene or objects in the visual environment, such as “ table”, “sofa”, and “bedroom”. We extract nouns as landmark tokens based on their pos-tag. The direction tokens usually convey spatial information, such as “left”, “right”, and “forward”. We obtain direction tokens using a direction dictionary built upon R2R training dataset. The loss of MLM is calculated as follows, L𝑀 𝐿 𝑀 = −E𝑉𝑂𝑝˜𝑃 ( 𝜏 ), (𝑤, 𝜏 )˜𝐷 log 𝑃(𝑤𝑚|𝑤\𝑚, 𝑉𝑂 𝑝), (3.10) Single Step Action Prediction (SSAP) PREVALENT [34] selects actions by mapping the [𝐶 𝐿𝑆] representations to the 36 classes directly, which may cause the loose connection between cross-modal representations of the viewpoints and the action space. To address this issue, we use the cross attention distribution from the [𝐶 𝐿𝑆] representation to the images in the panoramic view to select an action. We use the cross-entropy loss to compute the loss of SSAP, as follows, L𝑆𝑆 𝐴𝑃 = −E𝑂𝑉 𝑝˜𝑃 ( 𝜏 ), (𝑤, 𝜏 )˜𝐷 log 𝑃(𝑎|𝑤 [𝐶 𝐿𝑆 ], 𝑉𝑂 𝑝), (3.11) where 𝑎 is the ground-truth action. Vision Matching (VM) is our novel pre-training specific for initializing our vision module. It predicts whether the current vision information can match with the instruction. In this task, to encourage the agent to focus on learning the connection between landmarks in the instruction and the scene objects in the visual environment, we only use the vision representation (i.e. excluding the heading and elevation) of viewpoint as the input, denoted as 𝑣 𝑝. We generate the negative samples by replacing the ground-truth images with an images from another environment. We use the output representation of the [𝐶 𝐿𝑆] as the joint representation of textual and visual features to feed to a fully connected layer with a sigmoid function. This layer predicts the matching score 𝑠(𝑤, 𝑣 𝑝). The loss of SSAP is computed as follows, L𝑉 𝑀 = −E𝑣 𝑝˜𝜏,(𝑤,𝜏)˜𝐷 [𝑦 log 𝑃 + (1 − 𝑦) log 𝑃)], (3.12) 24 Method 1 Speaker-Follower [25] 2 3 4 5 6 7 Env-Drop [106] Env-Drop* [106] OAAM* [86] Entity-Relation [36] SpC-NAV [134] EXOR Val Seen Val Unseen Test(Unseen) SR ↑ SPL ↑ SDTW ↑ SR ↑ SPL↑ SDTW↑ SR ↑ SPL ↑ 0.54 0.55 0.63 0.65 0.62 0.65 0.60 0.27 0.47 0.50 0.54 0.52 0.45 0.52 - - 0.53 0.53 0.54 - 0.53 - - 0.50 0.53 0.51 0.46 0.49 - - 0.47 0.50 0.48 0.44 0.46 - 0.53 0.60 0.62 0.60 0.61 0.58 - 0.43 0.48 0.50 0.50 0.42 0.49 - - 0.37 0.39 0.46 - 0.46 Table 3.1 Experimental results for EXOR compared to LSTM-based VLN agents. where 𝑃 = 𝑠(𝑤, 𝑣 𝑝), and 𝑦 ∈ {0, 1} indicates whether the sampled viewpoint-instruction pair is matching. Orientation Matching (OM) is the second novel pre-training task designed to learn the orientation representations. We propose to predict the current orientation based on the instruction and the initial orientation. As described before, the orientation feature 𝑂 𝑝 is the combination of the heading 𝛼 and elevation 𝛽. We use the output representation of [𝐶 𝐿𝑆] as the joint representation of instruction and orientation. Then we feed this to a fully connected layer to predict 4-bits of orientation features. The loss of OM is computed as follows, L𝑂𝑀 = −E𝑜 𝑝˜𝜏, (𝑤, 𝜏 )˜𝐷 log 𝑝(𝑂′|𝑤 [𝐶 𝐿𝑆 ], 𝑂 𝑝), (3.13) where 𝑂′ is the ground-truth orientation feature. The full pre-training objective is L 𝑝𝑟 𝑒−𝑡𝑟 𝑎𝑖𝑛 = L𝑀 𝐿 𝑀 + L𝑆𝑆 𝐴𝑃 + L𝑉 𝑀 + L𝑂𝑀 . (3.14) 3.4 Experiments 3.4.1 Experimental Results Table 3.1 shows the performance of EXOR compared with LSTM-based VLN agents on unseen validation and test set. Notably, EXOR achieves significantly improved navigation performance compared to the baseline SpC-NAV, which also models explicit grounding between text and vision modalities. This demonstrates that EXOR not only enhances navigation capabilities but also strengthens grounding abilities, leading to more effective alignment between linguistic and visual 25 Method PRESS [67] PREVALENT [34] AirBERT [30] RecBERT [40] HAMT [10] RecBERT Our pretrain + RecBERT 1 2 3 4 5 6 7 8 Our pretrain + LOViS (our model) Val seen Test(Unseen) Val Unseen NE ↓ SR ↑ SPL↑ NE ↓ SR ↑ SPL↑ NE ↓ SR ↑ SPL ↑ 0.45 0.49 4.39 0.51 0.58 3.67 0.57 0.62 2.68 0.57 0.63 2.90 0.64 - - 0.57 0.61 2.99 0.57 0.63 2.90 0.58 0.65 2.40 0.58 0.69 0.75 0.72 0.69 0.71 0.74 0.77 0.49 0.54 0.62 0.63 - 0.61 0.63 0.63 5.28 4.71 4.01 3.93 - 4.03 3.75 3.71 5.49 5.30 4.13 4.09 - 4.35 4.20 4.07 0.55 0.65 0.70 0.68 0.65 0.66 0.69 0.72 0.45 0.53 0.56 0.57 0.58 0.56 0.58 0.59 Table 3.2 Experimental results for LOViS compare to Transformer-based VLN agents. Val Seen Val Unseen NE↑ SR↑ SPL↑ CLS↑ nDTW↑ sDTW↑ NE↓ SR↑ SPL↑ CLS↑ nDTW↑ sDTW↑ Method EnvDrop* [106] OAAM [86] NvEM [1] RecBERT [40] 0.52 0.41 0.56 0.49 5.38 0.54 0.47 4.82 0.56 0.46 LOViS (our model) 4.16 0.67 0.58 - - 0.53 0.54 0.51 0.50 0.56 - - 0.48 0.56 0.58 0.27 0.32 0.35 0.38 0.43 - - 0.29 0.18 0.29 0.18 6.80 0.38 0.28 6.48 0.43 0.32 6.07 0.45 0.35 0.34 0.34 0.41 0.41 0.45 - - 0.36 0.42 0.43 0.09 0.11 0.20 0.21 0.23 Table 3.3 Experimental Results for comparing LOViS with the baseline Models on R4R dataset. inputs. EXOR is better than the baseline (Env-Drop) even with their augmented data [106] (Row#3), showing our improved generalizability. Compared with OAAM (row#4), which learns object-vision matching with the augmented data, EXOR gets better SDTW, indicating that our agent can genuinely follow the instructions to the destination. However, Ent-Rel achieves better results than our method. Table 3.2 shows the experimental results of LOViS compared to other Transformer-based VLN methods on R2R benchmark. In this table, From row#1 to row#5 are Transformer-based navigation agents that largely have improved the performance of the LSTM-based agents, as shown in Table ??. PREVALENT [34] pre-trains the cross-modal representations with text-image-action triplets and replaces the encoder of Env-Drop [106] to improve its performance. AirBERT [30] is one of the SOTA methods that train a model on a large scale and diverse in-domain detests. RecBERT[40] , our baseline, is also a SOTA method that uses the attention distribution of the history information on navigation candidates to determine the next action. Row#4 is their own reported results in their paper, and row#6 shows our best reproduced results which is consistent with the reported results in [71]. Row#7 and row#8 are the performance of our LOViS model. We first show the effectiveness 26 Val Seen Val Unseen Method 1 Env-Dropout [106] 2 Lan-Obj 3 Lan-Obj+Rel 4 Lan-Obj+Rel_v SR↑ SPL↑ SDTW↑ SR↑ SPL↑ SDTW↑ 0.55 0.59 0.60 0.59 0.43 0.48 0.49 0.47 0.49 0.52 0.53 0.52 0.53 0.55 0.58 0.56 0.37 0.43 0.46 0.44 0.47 0.50 0.52 0.52 Table 3.4 Ablation Study for EXOR. of our pre-training on the baseline model. Our pre-training setting can improve the SR and SPL of baseline by about 2% in the unseen validation environment. Moreover, we further improve the performance of the baseline with our designed navigation model and the pre-training setting. The improvement is about 3% of SR and SPL in the seen environment and 2% of SR in the unseen validation and test environment. This result indicates our pre-training tasks are more suitable for our designed navigation model. We also obtain a lower NE showing that our agent navigates closer to the destination. For HAMT [10], we report their results with ResNet-152 as the vision encoder for a fair comparison. Table 3.3 shows the performance of LOViS compared various models on R4R benchmark. Same as R2R, we can better perform in all evaluation metrics. Compared to the our reproduced results of the RecBERT [36], we can improve 4% of CLS, 1% of nDTW, and 2% of sDTW in the unseen validation environment, which indicates the better fidelity of our model. 3.4.2 Ablation Study EXOR. Table 3.4 shows the ablation study results. Row#1 is the baseline model. Row#2 (Lan- Obj) shows that explicitly modeling important landmarks and aligned objects improves the per- formance compared to the baseline. Rel (row#3) is the result after modeling the spatial relation tokens describing the relative relation between agent and landmark. Rel_v (row#4) is the result after modeling the spatial relations in motions. The improved SDTW shows the modeling of spatial relations can help the agent to follow the instructions. However, the spatial terms directly describing the landmark are more helpful than the spatial terms in motions. Different Pre-training Tasks for LOViS. In Table 3.5, we show the influence of each pre- training task on both RecBERT and LOViS. For RecBERT baseline model, SSAP shows about 2% of improvement on both seen and unseen environments. Although the tasks of VM and OM 27 Figure 3.5 An example visualization of the navigation process in EXOR. The green boxes are spatial configurations; the darker green means higher weights; the yellow boxes are the selected landmarks; the orange arrows are the path. independently do not change the performance of MLM+SSAP, the combination of two tasks improves the performance by about 1%. The same phenomenon happens in LOViS. SSAP improves the performance by a large margin. Although VM and OM do not show significant improvement when used separately in the unseen environment, they improve both SR and SPL in the seen environment. The combination of VM and OM improves the performance significantly, especially in the seen environment. 28 Gostraight.Passthe pianoand the pictureson the wallHead down tothebedroom.Stopbybed.step1Viewpoints:Instructions:Gostraight.Passthe pianoand the pictureson the wallHead down tothebedroom.Stopbybed.step3Viewpoints:Instructions:step4Viewpoints:Gostraight.Passthe piano and the pictures on the wallHead down tothebedroom.Stopbybed.Instructions:step5Viewpoints:Gostraight.Passthe piano and the pictures on the wallHead down tothebedroom.Stopbybed.Instructions:step2Gostraight.Passthe pianoand the pictureson the wallHead down tothebedroom.Stopbybed.Instructions:Viewpoints: Baseline Model LOViS (Our Model) Val Seen Val Unseen Val Unseen SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑ Tasks 0.712 0.662 0.613 0.562 0.724 0.673 0.621 0.564 1 MLM 0.731 0.675 0.619 0.575 0.747 0.695 0.649 0.585 2 MLM+SSAP 0.737 0.683 0.622 0.577 0.755 0.711 0.637 0.581 3 MLM+SSAP+VM 0.730 0.672 0.617 0.574 0.766 0.724 0.629 0.579 4 MLM+SSAP+OM 5 MLM+SSAP+VM+OM 0.743 0.691 0.632 0.583 0.774 0.722 0.653 0.592 Val Seen Table 3.5 Ablation study for different pre-training tasks for LOViS. Figure 3.6 A qualitative example to show each module of LOViS. 3.4.3 Qualitative Examples EXOR. Figure 3.5 illustrates an example of the navigation process, visualized using selected landmarks based on spatial configurations. The darker green spans in the instruction indicate the spatial configuration which the agentThe darker green spans in the instruction highlight the spatial configurations that the agent pays greater attention. Notably, the model’s attention transitions from the beginning of the instruction to the end as navigation progress. The yellow boxes highlight the selected landmarks, demonstrating that as the agent navigates, its focus on landmarks dynamically shifts in response to the navigation progress. LOViS. Figure 5.10 shows a qualitative example that demonstrates the performance of each module of LOViS navigation agent. The ground-truth viewpoint is v1. The word “down” and “left” 29 V1(1.74, -0.19)V3(3.30,-0.08)V4(4.17,0.54)V5(5.06,-0.07)V6(4.95,0.20)Instruction:Continue downthe stairs, andtake a left. OrientationModuleVisionModuleHistoryModuleV2(2.44, -1.05)FinalDecision are the orientation signals. The word “stairs” is the vision signal. The attention map shows the score of different candidate viewpoints in each module. The darker color means the higher score. The numbers below each viewpoint show the orientation information with the format of . The lower value of each number means the orientation is more towards left and down respectively. It is evident that the orientation module gives a higher score to the viewpoints that are left, and their elevation is down. The vision module gives a higher score to the viewpoints that “stairs” can be seen. The history module also gives a relatively higher score to the viewpoints on the right side. The final decision is 𝑣1 with its weights of [0.02, −0.03, −0.04] to the three modules. The example shows that our designed orientation and vision modules can attend to the viewpoint with the corresponding information. 3.5 Conclusion We propose two neural agents that integrate the semantic elements of motions and landmarks for navigation. For EXOR, we first identify key landmarks based on spatial configurations and then guide the agent to focus on relevant objects in the visual environment. Additionally, we explicitly model spatial relations between the agent and the landmarks from the agent’s perspective. For LOViS, we introduce vision and orientation modules in the agent’s neural architecture. These modules effectively ground landmark mentions and spatial information related to the agent’s orientation, as expressed in natural language instructions, into the visual environment. To further enhance their effectiveness, we design new pre-training tasks that equip the agent with spatial reasoning and visual perception abilities before navigation. We evaluate our models on the R2R and R4R datasets, achieving SOTA results. Our findings demonstrate that modeling explicit spatial semantics not only improves navigation accuracy but also enhances the interpretability of navigation agents. 30 CHAPTER 4 ADDRESSING AMBIGUOUS INSTRUCTIONS AND IMPROVING EXPLAINABILITY 4.1 Introduction Although grounding methods that connect textual and visual modalities—by aligning semantic information help improve navigation performance [36, 86, 1, 136, 137], we observe that two types of instructions make the grounding in the VLN task quite challenging. First, the instruction contains landmarks that are not recognizable by the navigation agent. For example, Figure 4.1(a), the agent can only see the “sofa”, “table” and “chair” in the target viewpoint, based on the learned vision representations [35, 95, 20]. However, the instructor mentions landmarks of the “living room” and “kitchen” in the instruction, based on their prior knowledge about the environment, such as relating “sofa” to “living room”. Given the small size of the dataset designed for learning navigation, it is hard to expect the agent to gain the same prior knowledge as the instructor. Second, the instructions contain the landmarks that can be applied to multiple targets, which causes ambiguity for the navigating agent. In Figure 4.1(b), the instruction “enter the door” does not help distinguish the target viewpoint from other candidate viewpoints since there are multiple “doors” and “walls” in the visual environment. As a result, we hypothesize that these types of instructions cause the explicit and fine-grained grounding to be less effective for the VLN task, as appears in [37, 134] that use sub-instructions and in [36, 45, 86, 136] that use object-level representations. Figure 4.1 Instructions that make the grounding in the VLN task challenging. 31 Enterthedoorandturnrightpassingthewall.Entertheshutters.Reachtheentrancebetweenthekitchenandthelivingroom.Reachtheentrancebetweenthesofaandchairs.(a)(b)InstructionCandidateViewswherewhich Figure 4.2 VLN Agent with our hint generator. To address the aforementioned issues, in this chapter, we first introduce a translator module in the VLN agent, named VLN-trans [138], which takes the given instruction and visual environment as inputs and then converts them to easy-to-follow sub-instructions focusing on two aspects: 1) recognizable landmarks based on the navigation agent’s visualization ability. 2) distinctive landmarks that help the navigation agent distinguish the targeted viewpoint from the candidate viewpoints. Consequently, by focusing on those two aspects, the translator can enhance the connections between the given instructions and the agent’s observed visual environment and improve the agent’s navigation performance. Furthermore, we introduce a hint generator for the VLN agent (NavHint) [135], aiming to generate visual descriptions that serve as indirect supervision to help the navigation agent obtain a better understanding of the visual environment (as depicted in Fig. 4.2). When the agent navigates at each step, the hint generator concurrently produces visual descriptions that are consistent with the agent’s action decision. The hints are designed based on the rationale underlying the navigation process, including three aspects: Sub-instruction, Landmark Ambiguity and Targeted Distinctive Objects. Specifically, at each navigation step, first, the hint generator encourages the agent to report its navigation progress by specifying which part of the sub-instruction it is executing based on the current visual environment. As depicted in Fig. 4.2, the sub-instruction “walk towards the wall” needs to be executed. Second, the hint generator directs the agent to have a global view of the 32 Turnaroundandgostraight.Walktowardsthewallandstop.InstructionCandidateViewpointsview1view2view3ButIcansee“wall”inallcandidateviewpoints.However,thereare“large window with wooden blinds,glasstablewithwhitechairs,andaceilinglamp”thatarespecifictoview3.“Walktowardsthewall” need to be executed.(target)Sub-InstructionLandmarkAmbiguityTargetedDistinctiveObjectsActionSelectionHint Generator entire environment and recognize the landmarks mentioned in the instruction from all candidate viewpoints. The agent is tasked with identifying potential challenges by assessing the visibility of the landmarks and comparing the landmarks shared among viewpoints. For instance, in the given example, the landmark "wall" is ambiguous as it appears in multiple views. Third, in scenarios where challenges exist, the hint generator guides the agent in describing the distinctive visual objects that only appear in the targeted viewpoint, such as "large window with wooden blinds" in view3 in Fig 4.2. This aids the agent in deeply looking into the details of its selected viewpoint while globally comparing it to other candidates. In summary, our contributions are as follows: 1. We propose a translator module that helps the navigation agent generate easy-to-follow sub-instructions considering recognizable and distinctive landmarks based on the agent’s visual ability. We construct a high-quality synthetic sub-instruction dataset and design specific tasks for training the translator and the navigation agent. 2. We leverage a language model conditioned on the VLN models to design a hint generator that can be plugged into any VLN agent. This hint generator helps the agent develop a comprehensive understanding of the visual environment. We construct a synthetic hint dataset to provide the agent with visual descriptions at each navigation step. The dataset serves as an indirect supervision for jointly training the navigation agent and the hint generator. 3. We evaluate our method on R2R and R4R datasets, and our method achieves the SOTA results on all benchmarks. We also provide a detailed analysis of the agent’s grounding ability by analyzing the translator and the quality of the generated hints, thereby improving the interpretability of the agent’s decisions. 4.2 VLN-Trans: VLN Agent with a Translator Fig. 4.3 (a) provides an overall picture of our proposed architecture for the navigation agent. We use VLN⟳BERT [40] (in Sec. 2.3.3) as the backbone of our navigation agent and equip it with a novel translator module that is trained to convert the full instruction representation into the most relevant sub-instruction representation based on the current visual environment. Another key 33 sub- instruction (a) (b) (c) Figure 4.3 The overview of the proposed VLN-Trans. (a) Navigation agent with VLN-Trans. (b) The translator architecture (c) Pre-training the translator. SG:Sub-instruction Generation; DSL: Distinctive Sub-instruction Learning; SS: Sub-instruction Split. point of our method is to create a synthetic sub-instruction dataset and design the pre-training tasks to encourage the translator to generate effective sub-instruction representations. We describe the details of our method in the following sections. 4.2.1 Synthetic Sub-instruction Dataset (SyFiS) This section introduces our novel approach to automatically generate a synthetic fine-grained sub-instruction dataset, SyFiS, which is used to pre-train the translator (described in Sec. 4.2.2) in a contrastive manner. To this aim, for each viewpoint, we generate one positive sub-instruction and three negative sub-instructions. The viewpoints are taken from the R2R dataset [4], and the sub-instructions are generated based on our designed template. Fig. 4.5 shows an example describing our methodology for constructing the dataset. The sub-instruction template includes two components: a motion indicator and a landmark. For example, in the sub-instruction “turn left to the kitchen”, the motion indicator is “turn left”, and the landmark is “kitchen”. The sub-instruction template is designed based on the semantics of Spatial Configurations explained in [17]. Motion Indicator Selection. First, we generate the motion indicator for the synthesized sub-instructions. Following [134], we use pos-tagging information to extract the verbs from 34 NavigationAgentTranslatorMoveforwardtothehallway.Enterthedoorandturnrightpassingthewall.Enterintothebedroom.000…00111…Moveforwardtothehallway.Enterthedoorandturnrightpassingthewall.Enterinsidethehallway.LSTMSoft-AttnMLPWalkpassthehallway.Translator⨁ActionnegativepositiveLSTMEnterinsidethehallway.LSTMSoft-AttnMLPTranslatorLSTMEnterinsidethehallway.DSLEnterinsidethehallway.Enterinsidetheroom.Walkintothehallway.anchorSGSGSS Figure 4.4 Motion indicator vocabulary. instructions in the R2R training dataset and form our motion-indicators dictionary. We divide the motion indicators to 6 categories of: “FORWARD”, “LEFT”, “RIGHT”, “UP”, “DOWN”, and “STOP”. Each category has a set of corresponding verb phrases. We refer the Figure 4.4 for more details about motion indicator dictionary. Given a viewpoint, to select a motion indicator for each sub-instruction, we calculate the differences between the elevation and headings of the current and the target viewpoints. Based on the orientation difference and a threshold, e.g. 30 degrees, we decide the motion-indicator category. Then we randomly pick a motion verb from the corresponding category to be used in both generated positive and negative sub-instructions. Landmark Selection. For generating the landmarks for the sub-instructions, we use the candidate viewpoints at each navigation step and select the most recognizable and distinctive landmarks that are easy for the navigation agent to follow. In our approach, the most recognizable landmarks are the objects that can be detected by CLIP. Using CLIP [89], given a viewpoint image, we predict a label token with the prompt “a photo of label” from an object label vocabulary. The probability that the image with representation 𝑏 contains a label 𝑐 is calculated as follows, 𝑝(𝑐) = 𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑏, 𝑤𝑐)/𝜏1) 𝑖=1(𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑏, 𝑤𝑖))/𝜏1) (cid:205)𝑀 , (4.1) where 𝜏1 is the temperature parameter, 𝑠𝑖𝑚 is the cosine similarity between image representation and phrase representation 𝑤𝑐 which are generated by CLIP [89], 𝑀 is the vocabulary size. The top-𝑘 objects that have the maximum similarity with the image are selected to form the set of recognizable landmarks for each viewpoint. We filter out the distinctive landmarks from the recognizable landmarks. The distinctive landmarks are the ones that appear in the target viewpoint and not in 35 MotionsVocabFORWARDgoforwardto;goforwardpast;pass;walkpass,walkforward,etc.DOWNgodown;walkstraightdown;walkdown;moveforwarddown,etc.UPgoup;walkup;climb;leadingupwards;travelup,etc.RIGHTturnrightto;makearightturnto,gorightto,veerrightto,etc.LEFTwalkleftto,turnleftto,goleftto,makealeftturnto,makealeftto,etc.STOPstayat,standby,stopby,waitby,stopat,waiton,etc. v1 (target) v2 v3 Figure 4.5 Illustration of constructing the SyFiS dataset. any other candidate viewpoints. For instance, in the example of Fig. 4.5, “hallway” is a distinctive landmark because it only appears in the v1 (target viewpoint). Forming Sub-instructions. We use the motion verbs and landmarks to construct sub-instructions based on our template. To form contrastive learning examples, we create positive and negative sub-instructions for each viewpoint. A positive sub-instruction is a sub-instruction that includes a distinctive landmark. The negative sub-instructions include easy negatives and hard negatives. An easy negative sub-instruction contains irrelevant landmarks that appear in any candidate viewpoint except the target viewpoint, e.g., in Fig. 4.5, “bed frame” appears in v3 and is not observed in the target viewpoint. A hard negative sub-instruction includes the nondistinctive landmarks that appear in both the target viewpoint and other candidate viewpoints. For example, in Fig. 4.5, “room” can be observed in all candidate viewpoints; therefore, it is difficult to distinguish the target from other candidate viewpoints based on this landmark. Statistic of the SyFiS dataset. We construct SyFiS dataset using 1, 076, 818 trajectories, where 7198 trajectories are from the R2R dataset, and 1, 069, 620 trajectories are from the augmented data [34]. Then we pair those trajectories with our synthetic instructions to construct the SyFiS dataset based on our pre-defined motion verb vocabulary and CLIP-generated landmarks (in Sec4.2.1). When we pre-train the translator, we use the sub-instruction of each viewpoint in a trajectory. There 36 CLIPCLIPCLIP“fireplace”“artwork”“room”“headboard”“bed frame”“room”“hallway”“doorframe”“room”DirectionTurnrighttoθhallwayroombedframeTurnrighttohallwayTurnrighttobedframeTurnrighttoroomPositiveEasyNegativeHardNegativeMotionVocabTable are usually 5 to 7 viewpoints in a trajectory; each viewpoint is with one positive sub-instruction and three negative sub-instructions. 4.2.2 Translator Architecture The translator takes a set of candidate viewpoints and the corresponding sub-instruction as the inputs and generates new sub-instructions. The architecture of our translator is shown in Fig. 4.3(b). This architecture is similar to the LSTM-based Speaker in the previous works [106, 25]. However, they generate full instructions from the whole trajectories and use them as offline augmented data for training the navigation agent, while our translator adaptively generates sub-instruction during the agent’s navigation process based on its observations at each step. Formally, we feed text representations of sub-instruction 𝑋 and the visual representations of candidate viewpoints 𝑉 into the corresponding LSTM to obtain deeper representation ˜𝑋 and ˜𝑉. Then, we apply the soft attention between them to obtain the visually attended text representation ˜𝑋′, as: ˜𝑋′ = 𝑆𝑜 𝑓 𝑡 𝐴𝑡𝑡𝑛( ˜𝑋; ˜𝑉; ˜𝑉) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥( ˜𝑋𝑇𝑊 ˜𝑉) ˜𝑉, (4.2) where 𝑊 is the learned weights. Lastly, we use an MLP layer to generate sub-instruction 𝑋′ from the hidden representation ˜𝑋′, as follows, 𝑋′ = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(𝑀 𝐿𝑃( ˜𝑋′)) (4.3) We use the SyFiS dataset to pre-train this translator. We also design two pre-training tasks: Sub-instruction Generation and Distinctive sub-instruction Learning. Sub-instruction Generation (SG). We first train the translator to generate a sub-instruction, given the positive instructions paired with the viewpoints in the SyfiS dataset as the ground-truth. We apply a cross-entropy loss between the generated sub-instruction 𝑋′ and the positive sub-instruction 𝑋𝑝. The loss function for the SG task is as follows, 𝐿𝑆𝐺 = − 1 𝐿 ∑︁ 𝐿 𝑋𝑝𝑙𝑜𝑔𝑃(𝑋′) (4.4) Distinctive Sub-instruction Learning (DSL). To encourage the translator to learn sub-instruction representations that are close to the positive sub-instructions with recognizable and distinctive 37 landmarks, and are far from the negative sub-instructions with irrelevant and nondistinctive landmarks, we use triplet loss to train the translator in a contrastive way. To this aim, we first design triplets of sub-instructions in the form of . For each viewpoint, we select one positive and three negative sub-instructions forming three triplets per viewpoint. We obtain the anchor sub-instruction by replacing the motion indicator in the positive sub-instruction with a different motion verb in the same motion indicator category. We denote the text representation of anchor sub-instruction as 𝑋𝑎, positive sub-instruction as 𝑋𝑝, and negative sub-instruction as 𝑋𝑛. Then we feed them to the translator to obtain the corresponding hidden representations ˜𝑋′ 𝑎, ˜𝑋′ 𝑝, and ˜𝑋′ 𝑛 using Eq. 4.2. The triplet loss function for the DSL task is computed as follows, 𝐿 𝐷𝑆𝐿 = 𝑚𝑎𝑥(𝐷 ( ˜𝑋′ 𝑎, ˜𝑋′ 𝑝) − 𝐷 (𝑋′, ˜𝑋′ 𝑛) + 𝑚, 0), (4.5) where 𝑚 is a margin value to keep negative samples far apart, 𝐷 is the pair-wise distance between representations. In summary, the total objective to pre-train the translator is: 𝐿 𝑝𝑟𝑒−𝑡𝑟𝑎𝑖𝑛 = 𝛼1𝐿𝑆𝐺 + 𝛼2𝐿 𝐷𝑆𝐿 (4.6) where 𝛼1 and 𝛼2 are hyper-parameters for balancing the importance of the two losses. 4.2.3 VLN Agent with Translator We place the pre-trained translator module on top of the backbone navigation agent to perform the navigation task. Fig.4.3(a) shows the architecture of our navigation agent. At each navigation step, the translator takes the given instruction and the current candidate viewpoints as input and generates new sub-instruction representations, which are then used as additional input to the navigation agent. Since the given instructions describe the full trajectory, we enable the translator module to focus on the part of the instruction that is in effect at each step. To this aim, we design another MLP layer in the translator to map the hidden states to a scalar attention representation. Then we do the element-wise multiplication between the attention representation and the instruction representation to obtain the attended instruction representation. In summary, we first input the text representation of given instruction 𝑋 and visual representation of candidate viewpoints 𝑉 to the translator to obtain the translated sub-instruction representation 38 ˜𝑋′ using Eq. 4.2. Then we input ˜𝑋′ to another MLP layer to obtain the attention representation 𝑚, 𝑋′ 𝑋′ 𝑚 = 𝑀 𝐿𝑃( ˜𝑋′). Then we obtain the attended sub-instruction representation as 𝑋′′ = 𝑋′ 𝑚 ⊙ 𝑋, where ⊙ is the element-wise multiplication. Lastly, we input text representation 𝑋 along with translated sub-instruction representation ˜𝑋′ and the attended instruction representation 𝑋′′ into the navigation agent. In such a case, we update the text representation 𝑋 of VLN⟳BERT as [𝑋; ˜𝑋′; 𝑋′′], where ; is the concatenation operation. 4.3 NavHint: VLN Agent with a Hint Generator The hint generator is designed as a Transformer-based decoder that leverages visual output from the navigation agent to produce corresponding hints. This hint generator can be plugged into any VLN agent as a language model conditioned on the VLN models. To train the hint generator, we propose a synthetic navigation hint dataset based on Room2Room (R2R) [4] dataset. Our dataset provides hints for each step of the trajectory in the R2R dataset. Each hint description includes sub-instruction, landmark ambiguity, and targeted distinctive objects introduced above. The dataset serves as an extra supervision to train the navigation agent and the hint generator jointly. Besides, our constructed dataset can be utilized to explicitly analyze the navigation agent’s grounding ability by assessing the quality of generated hints. In the following section, we first present our constructed navigation hint dataset. Then, we introduce the hint generator. The navigation hint dataset is used to train the navigation agent and the hint generator jointly. 4.3.1 Navigation Hint Dataset The purpose of constructing the navigation hint dataset is to provide supervision for the hint generator to generate detailed visual description. The navigation hint dataset is automatically generated based on instruction and trajectory pairs from the R2R dataset [4]. For every step of the trajectory, we provide hints that mainly include three key elements, as described below. Sub-instruction is the first part of the hint that pinpoints to the relevant part of the instruction (sub-instruction) to be processed at the current step. We obtain the sub-instructions and their corresponding viewpoints from the FGR2R [37] dataset, which provides human annotations of sub-instructions and the aligned viewpoints. After obtaining the sub-instruction at each step, we 39 Figure 4.6 Navigation hint dataset. An example of a navigation hints with the landmark ambiguity of “Missing Landmarks”. The sub-instruction is“walk into the hallway”, and the landmark “hallway” in the instruction is observed in the view1 rather than target view3, which can potentially mislead the navigation agent. The target distinctive objects "wooden dining table" and "marble countertop." are then provided. "Blue walls" is non-distinctive as it appears in both view2 and view3. Ambiguity Category Target Landmarks Multiple Landmarks Missing Landmarks Invisible Landmark No Landmarks Description Landmarks only appear in the target. Landmarks are visible in multiple viewpoints including the target viewpoint. Landmarks are visible in other viewpoints except for the target viewpoint. Landmarks are not visible in all viewpoints No landmarks in sub-instruction. (e.g. “make a right turn”, “turn left”, and “go straight”) Hints The {landmarks} are observed. The {landmarks} are observed in multiple viewpoints. The{landmarks} are misleading. The{landmarks} are not observed. ∅ Table 4.1 Landmark ambiguity. The col#1 and col#2 show the categories of landmark ambiguity and the corresponding descriptions. The col#3 shows the template for generating the hint for each category. insert it into our hint template, which is "The {sub-instruction} needs to be executed.". Guiding the navigation agent to detect the related sub-instruction at each step is crucial since it effectively assists the agent in tracking its navigation progress. Landmark Ambiguity is the second part of the hint that describes the commonalities across multiple views that can result in ambiguity during navigation. This part of hint is achieved by examining the shared landmarks mentioned in the instruction among the candidate viewpoints. To automatically generate this part of the hint for building the dataset, we first use spaCy1 to extract noun phrases from sub-instruction and use them as landmarks. Then, we extract visual objects in each candidate viewpoint using MiniGPT-4 [149] with a two-step textual prompting. We choose visual objects generated by MiniGPT-4 instead of Matterport3D object annotations because 1https://spacy.io/ 40 Matterport3D objects are pretty limited, with only 40 object categories like “doors”, “walls”, and “floors”. These generic objects are not sufficient for resolving landmark ambiguity. Moreover, the absence of attribute annotations in Matterport3D poses a challenge for landmark disambiguation, such as the differences between “wooden table” and “glass table”. In contrast, MiniGPT-4 can generate such detailed attribute descriptions. Specifically, for each candidate viewpoint, we feed MiniGPT-4 with the viewpoint image, asking “Describe the details of the image.” and then “List the objects in the image”. The generated text is in free form, and we post-process it to retrieve a list of extracted object descriptions. After obtaining textual landmark names and visual objects, we examine the shared landmarks among the candidate viewpoints. The presence of shared landmarks can pose ambiguity for the navigation agent. We categorize the ambiguity into: Target Landmarks, Multiple Landmarks, Missing Landmarks, Invisible Landmarks and No Landmark, and their descriptions are in Table 4.1. After identifying the category of landmark ambiguity, we construct this part of the hint using the corresponding templates in col #3 of Table 4.1. Identifying landmark ambiguity requires the navigation agent to ground the mentioned landmark names in the instruction to the visual objects in all candidate viewpoints. Guiding the navigation agent to identify such detailed ambiguities can help enhance its understanding of the connection between the instruction and the entire visual environment. Targeted Distinctive Objects is the third part of the hint that describes the distinctive visual objects specific to the targeted view. The agent should be able to justify its decision by describing the distinction of the targeted view. We follow the approach of obtaining distinctive objects in the VLN-Trans [138] that compares the visual objects in the targeted and other candidate viewpoints. The distinctive objects are the ones that exclusively appear in the targeted viewpoint and do not appear in other views. The hint template for targeted distinctive objects is “However, {the comma-separated list of distinctive object names} are in the targeted view.”. We use 3 distinctive objects at most. If the cases belong to the challenge of “Target Landmark”, there is no need to provide extra distinctive objects since the landmark is already exclusive to the targeted viewpoint. Describing distinctive objects is important to obtain a global understanding of the visual environment by highlighting the 41 Figure 4.7 Model architecture of NavHint. differences between the targeted viewpoint and other candidate viewpoints. 4.3.2 VLN Agent with a Hint Generator We propose a hint generator that can be plugged into any navigation agent easily. We use VLN⟳BERT [39] as the base model to illustrate our method but noted that the hint generator is compatible with most of the current agents. Fig. 4.7 shows the model architecture. Text Encoder. We use BERT [110] to obtain initial text representation of instruction, denoted as 𝑋 = [𝑥1, 𝑥2, · · · , 𝑥𝑙]. Vision Encoder. We follow previous works to concatenate image and relative orientation features as vision features for each candidate viewpoint. Specifically, we extract the image features from ResNet-152 [35] pre-trained on the Places365 dataset [144]. The orientation features are derived from the relative heading denoted as 𝛼 and the elevation denoted as 𝛽. The orientation features are represented as [sin 𝛼; cos 𝛼; sin 𝛽; cos 𝛽]. The vision features are then passed through an MLP (Multilayer Perception) of Vision Encoder to obtain vision representation for each candidate viewpoint, denoted as [𝑣1, 𝑣2, · · · , 𝑣𝑛]. Navigation Agent. VLN⟳BERT is a cross-modal Transformer model. Besides text and vision representations, a state representation is introduced in the model to store history information recurrently, which is denoted as 𝑆. At the 𝑡-th navigation step, the text representation 𝑋, the visual representation 𝑉𝑡 and state representation 𝑆𝑡 are input into cross-modal Transformer layers, as follows, ˆ𝑋, ˆ𝑆𝑡, ˆ𝑉𝑡 = 𝐶𝑟𝑜𝑠𝑠_𝐴𝑡𝑡𝑛(𝑋, [𝑆𝑡; 𝑉𝑡]), (4.7) 42 VisionEncoderTextEncoderNavigationModuleNavigationAgentMLPHint GeneratorGPT-2Auto-regressionHintTokensActionExecute“walkintothehallway”.Thelandmark“hallway”ismisleadingobservation.Thedistinctiveobjectsare“wooden dining table’’and“marblecountertop”…Hints where ˆ𝑋, ˆ𝑆𝑡, and ˆ𝑉𝑡 are the learnt contextual text, state representation, and visual representations, respectively. Then we apply attention layer between state representation ˆ𝑆𝑡 contextual vision representations ˆ𝑉𝑡 as follows, 𝑆𝑡+1, 𝑎𝑡 = 𝐴𝑡𝑡𝑛(𝑘 = ˆ𝑉𝑡, 𝑞 = ˆ𝑆𝑡, 𝑣 = ˆ𝑉𝑡), (4.8) where 𝑆𝑡+1 is the updated state representation that is passed to the next steps to show the history. 𝑎𝑡 is the attention score over the navigable views to show action probability of the current step. Hint Generator. Inspired by the idea of prefix engineering [77] that uses the image representation as the prefix of the text for the image captioning task, we employ a decoder language model (LM) and use the contextual visual representation of the navigation agent and the original instruction as the prefix. However, unlike the previous work, rather than just using one image as the prefix, we input all images of candidate viewpoints to encourage the hint generator to learn the global relations among views. Formally, we denote the hint at the 𝑖-th navigation step as 𝐶𝑖 = {𝑐𝑖 1 , 𝑐𝑖 2 , · · · , 𝑐𝑖 𝑗 }, where 𝑗 is the length of the hint. Different from LANA [113] that generates route description after navigation, our hint generator provides a more in-depth visual description at each step. Our approach requires the agent to possess a global and deep visual understanding, which can be learnt through the supervision from our navigation hint dataset explained in Section 4.3.1. We obtain the LM representation of the original instruction 𝑊 and the hint 𝐶 as 𝑋′ = {𝑥′ 1 , 𝑥′ 2 , · · · , 𝑥′ 𝑙 } and 𝑐 = {𝑐1, 𝑐2, · · · , 𝑐 𝑗 } respectively. Since the semantic structure of our auto-generated dataset can be easily captured, we use a 1.5B-parameters decoder LM (GPT-2 large) in the hint generator. Note that any larger decoder language model in the GPT series can be employed. We use the instruction text representation 𝑋′ as the instruction prefix representation. We use the weighted vision representations output from the navigation agent as the image prefix representation. The weighted vision representation is obtained using action probability and the contextual vision representations as ˆˆ𝑉𝑡 = 𝑎𝑡 ∗ ˆ𝑉𝑡. Then we simply employ an MLP to map ˆˆ𝑉𝑡 to LM token space. We denote such MLP as 𝐹. We obtain prefix embedding that is mapped from visual representation ˆ𝑉 as 43 Method Env-Drop [106] RelGraph [36] NvEM [1] PREVALENT [34] HAMT (ResNet) [10] HAMT (ViT) [10] CITL [69] ADAPT [70] LOViS [137] VLN⟳BERT [40] VLN⟳BERT+(ours) VLN⟳BERT++ (ours) VLN-Trans-R2R (ours) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 VLN-Trans-FG-R2R (ours) Val seen Test Unseen Val Unseen NE ↓ SR ↑ SPL↑ NE ↓ SR ↑ SPL↑ NE ↓ SR ↑ SPL ↑ 0.47 0.47 3.99 0.52 0.57 3.47 0.54 0.60 3.44 0.51 0.58 3.67 0.64 − − 0.60 0.66 2.51 0.59 0.63 2.65 0.57 0.66 2.70 0.58 0.65 2.40 0.57 0.63 2.90 0.57 0.65 2.72 0.58 0.67 2.51 0.59 0.67 2.40 0.60 0.69 2.45 0.62 0.67 0.69 0.69 0.69 0.76 0.75 0.74 0.77 0.72 0.75 0.77 0.78 0.77 5.23 4.75 4.37 5.30 − 3.93 3.94 4.11 4.07 4.09 4.09 4.02 3.94 3.94 0.59 0.65 0.65 0.65 0.65 0.72 0.70 0.69 0.72 0.68 0.70 0.72 0.73 0.72 0.51 0.55 0.58 0.54 − 0.65 0.64 0.63 0.63 0.63 0.63 0.63 0.65 0.66 5.22 4.73 4.27 4.71 − − 3.87 3.66 3.71 3.93 3.65 3.40 3.37 3.34 0.43 0.53 0.55 0.53 0.58 0.61 0.58 0.59 0.59 0.57 0.60 0.61 0.63 0.63 Table 4.2 Experimental results on R2R Benchmarks in a single-run setting. The best results are in bold font. + means we add RXR [55] and Marky-mT5 dataset [112] as the extra data to pre-train the navigation agent. ++ means we further add the SyFiS dataset to pre-train the navigation agent. ViT means Vision Transformer representations. follows, 𝑝1, · · · , 𝑝𝑘 = 𝐹 ( ˆˆ𝑉𝑡), (4.9) where 𝑘 is the prefix length, and 𝑝 is the image prefix representation. We concatenate the representation of image prefix 𝑝 and instruction prefix 𝑋′, and combine them with the text representation of hint 𝐶. The hint generator only decodes the hint in an auto-regressive manner at each step. During training, the parameters of both of MLP and the LM in the hint generator and the navigator are updated. The training objective is to maximize the likelihood of the next hint token. The following equation shows the loss of generating the 𝑗-th token of the hint at the 𝑖-th step. 𝐿 ℎ𝑖𝑛𝑡 = − ∑︁ log 𝑝𝜃 (𝑐𝑖 𝑗 | 𝑝𝑖 1 , · · · , 𝑝𝑖 𝑘 , (4.10) 𝑖, 𝑗 𝑥′ 1 , · · · , 𝑥′ 𝑙, 𝑐𝑖 𝑗 , · · · , 𝑐𝑖 𝑗−1). 44 Validation Unseen Test Unseen Method Seq-to-Seq [4] Self-Monitor [73] AuxRN [73] VLN⟳BERT [39] HAMT (ViT) [10] LANA [113] VLN-SIG (ViT) [56] VLN-trans [138] EDrop∗ [106] EDrop + Hint. (NavHint) VLN⟳BERT++ [138] 1 2 3 4 5 6 7 8 9 10 11 12 VLN⟳BERT++ + Hint. (NavHint) NE ↓ SR ↑ SPL↑ 0.22 7.81 0.45 5.52 0.51 5.63 0.63 3.93 0.66 3.97 0.66 − 0.68 3.37 0.69 3.34 0.55 5.49 0.55 5.44 0.67 3.40 0.69 3.23 − 0.32 0.46 0.57 0.61 0.60 0.62 0.63 0.47 0.47 0.61 0.65 sDTW↑ nDTW↑ NE ↓ SR ↑ SPL ↑ 0.18 0.35 − 0.57 0.60 0.59 0.60 0.60 0.49 0.49 0.58 0.60 0.20 0.48 − 0.63 0.65 0.64 0.65 0.66 0.51 0.53 0.63 0.65 7.85 5.67 − 4.09 3.93 − − 3.94 5.60 5.47 4.02 4.00 − − − − − − 0.70 0.70 0.58 0.60 0.69 0.72 − − − − − − 0.59 0.60 0.42 0.44 0.58 0.61 Table 4.3 Experimental results on R2R dataset. ViT: uses Vision Transformer representations. Hint.: uses our hint generator. 4.4 Experiments 4.4.1 Experimental Results Table 4.2 shows the model performance of VLN-Trans on the R2R benchmarks. Row #4 to row#9 are Transformer-based navigation baseline with pre-trained cross-modality representations, and such representations greatly improve performance of LSTM-based VLN models (row #1 to row#3). It is impressive that our VLN-Trans model’s performance (row #13 and row #14) on both validation seen and unseen performs 2%-3% better than HAMT [10] when it even uses more advanced ViT [20] visual representations compared with ResNet. Our performance on both SR and SPL are still 3%-4% better than the VLN agent using contrastive learning: CITL [69] (row #7) and ADAPT [70] (row #8). LOViS [137] (row #9) is another very recent SOTA improving the pre-training representations of the navigation agent, but we can significantly surpass their performance. Lastly, compared to the baseline (row #10), we first significantly improve the performance (row #11) by using extra augmented data, Room-across-Room dataset (RXR) [55] and the Marky-mT5 [112], in the pre-training of navigation agent. The performance continues to improve when we further include the SyFiS dataset in the pre-training, as shown in row #12, proving the effectiveness of our synthetic data. Row #13 and row #14 are the experimental results after incorporating our pre-trained translator into the navigation model. First, for a fair comparison with 45 Model Val Seen Val Unseen EDrop + Hint. (ours) VLN⟳BERT+++ Hint. (ours) Bleu-1 Bleu-4 Bleu-1 Bleu-4 0.60 0.74 0.62 0.76 0.62 0.64 0.72 0.74 Table 4.4 Bleu score for the generated sub-instruction on the R2R dataset. Dataset Method R2R VLN⟳BERT++ 1 2 3 ✔ ✔ ✔ Tasks SG DSL SS SPL↑ Val Seen Val Unseen SPL↑ SR↑ SR↑ 0.767 0.722 0.672 0.611 0.764 0.721 0.673 0.623 ✔ 0.780 0.728 0.674 0.627 ✔ ✔ 0.772 0.720 0.690 0.633 Table 4.5 Ablation study for training tasks for the translator. other models, we follow the baseline [40] to train the navigation agent using the R2R [4] dataset and the augmented data from PREVALENT [34]. As shown in row #13, our translator helps the navigation agent obtain the best results on the seen environment and improves SPL by 2% on the unseen validation environment, proving that the generated sub-instruction representation enhances the model’s generalizability. However, FG-R2R [37] provides human-annotated alignments between sub-instructions and viewpoints for the R2R dataset, and our SyFiS dataset also provides synthetic sub-instructions for each viewpoint. Then we conduct another experiment using FG-R2R and SyFiS datasets to train the navigation agent. Simultaneously, we optimize the translator using the alignment information with our designed SG and SS losses during the navigation process. As shown in row #13, we further improve the SR and SPL on the unseen validation environment. This result indicates our designed losses can better utilize the alignment information. Table 4.3 shows the performance of NavHint on validation unseen and test of the R2R dataset. To verify the adaptability of our approach, we evaluate it using both LSTM-based and Transformer-based navigation agents. Since Transformer-based methods are pre-trained on large vision-language datasets and have a more complex model architecture, they achieve a higher performance than LSTM-based methods. For the LSTM-based model, we use EDrop [106] which uses CLIP [89] visual representations without augmented data during training. For the Transformer-based model, we use the VLN⟳BERT++ (row#11) as the baseline. 46 Hints Sub. L-A. TD-Obj. Obj. Method VLN⟳BERT++ 1 2 3 4 5 6 7 ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ SR↑ 0.665 0.671 0.673 0.677 ✔ 0.676 0.674 ✔ 0.681 0.692 Val Unseen SPL↑ nDTW↑ 0.685 0.607 0.690 0.612 0.687 0.613 0.702 0.624 0.698 0.621 0.709 0.614 0.694 0.632 0.724 0.647 Table 4.6 Ablation study for different parts of hint. Sub.:sub-instruction; L-A.:Landmark Ambiguity; TD-Obj: Target Distinctive Objects. Obj:Top-3 objects. Row#1 to row#3 in Table 4.3 show other LSTM-based methods and row#4 to row#8 are the SOTA Transformer-based methods. Row#9 shows the performance of the LSTM baseline EDrop. Row#10 shows the results after equipping the EDrop with our designed hint generator. The improved sDTW and nDTW on the validation unseen proves that the hint generator helps the navigation agent follow the instructions. Moreover, our hint generator on top of the VLN⟳BERT++ (row#12) significantly improves both wayfinding metrics (SP and SPL) and fidelity metrics (sDTW and nDTW) of the baseline model, indicating that our hint generator not only assists the agent in reaching the correct destination but also encourages the agent to follow the original instructions. Improving both LSTM-based and Transformer-based navigation agents shows the generalization ability of the navigation agent with our designed hint generator. We use Bleu score [80] as an evaluation metric to assess whether the navigation agent can identify sub-instruction accurately. We conduct experiments on both LSTM-based and Transformer-based navigation agents, as shown in Table 4.4. The generated sub-instruction from the Transformer-based navigation agent can obtain a relatively high Bleu score compared to the LSTM-based agent. This result demonstrates that a more robust navigation agent achieves a stronger alignment between the instruction and visual modality for identifying the relevant part of the instruction to track the progress. 47 Figure 4.8 Qualitative examples to show how the translator helps the navigation agent. The red boxes and green boxes show the distinctive and the nondistinctive landmarks; the green arrow and red arrow show the target and the predicted viewpoints. 4.4.2 Ablation Study VLN-Trans. In Table 4.5, we show the performance after ablating different tasks in the baseline model on the R2R and R2R-Last datasets. We compared with VLN⟳BERT++, which is our improved baseline after adding extra pre-training data to the navigation agent. First, we pre-train our translator with SG and DSL tasks and incorporate the translator into the navigation agent without further training. For both the R2R dataset and R2R-Last, SG and DSL pre-training tasks can incrementally improve the unseen performance (as shown in method 1 and method 2 for R2R and R2R-Last). Then we evaluate the effectiveness of the SS task when we use it to train the translator together with the navigation agent. For the R2R dataset, the model obtains the best result on the unseen environment after using the SS task. However, the SS task causes the performance drop for the R2R-Last dataset. This is because the R2R-Last dataset merely has the last single sub-instruction in each example and 48 livingroomcupboard(b)Walktowardsthekitchenarea.Keepwalkingalongthekitchenareatowardsthedoorway.(a)Walk left past the table and chairs and through the doorway.wallpapersnowkitchenkitchenpatiocolumnsv1v2v3v3v2v1v1v2v3(c)Walk through the kitchenand go into the hallwaywith a marble floor.countertopovenkitchenkitchenarches there is no other sub-instructions our model can identify and learn from. NavHint. Table 5.4 reports the ablation analysis. From row#1 to row#3, we individually include sub-instruction, landmark ambiguity, and targeted distinctive objects to the hint. All navigation performance metrics improve gradually compared to the baseline. In another experiment (row#4), we attempt to describe the visual environment by identifying only top-3 recognized objects (using MiniGPT-4) in the targeted viewpoint without differing them from other viewpoints. The navigation results still improve, indicating that visual descriptions of the objects benefit the overall navigation performance. Row#5 shows that combining sub-instruction and landmark ambiguity further improves the baseline, particularly in the nDTW metric. In row#6, when we combine sub-instruction, landmark ambiguity and top-3 objects, we observe improvement in the goal-related metrics (SR and SPL), but the model’s ability to faithfully follow the instruction is somewhat compromised (lower nDTW). The best result is obtained when we replace the above top-3 objects with distinctive ones (row#7), indicating our designed hint’s effectiveness in describing the targeted view from a global perspective. 4.4.3 Translator Analysis Our translator can relate the mentioned landmarks in the instructions to the visible and distinctive landmarks in the visual environment. In Fig. 4.8 (a), “tables” and “chairs” are not visible in three candidate viewpoints (v1-v3). However, our navigation agent can correctly recognize the target viewpoint using the implicit instruction representations generated by the translator. We assume the most recognizable and distinctive landmark, that is, the "patio" here in the viewpoint v3 has a higher chance to be connected to a “table” and a “chair” based on our pre-training, compared to the landmarks in the other viewpoints. In Fig. 4.8 (b), both candidate viewpoints v2 and v3 contain kitchen (green bounding boxes); hence it is hard to distinguish the target between them. However, for the translator, the most distinctive landmark in v3 is the “cupboard” which is more likely to be related to the “kitchen”. Fig. 4.8(c) shows a failure case, in which the most distinctive landmark in candidate viewpoint v1 is “oven”. It is more likely for the translator relates “oven” to the “kitchen” compared to “countertop”, and the agent selects the wrong viewpoints. In fact, we 49 (a) EDrop+Hint. (b) VLN⟳BERT+++Hint. Figure 4.9 Accuracy of the landmark ambiguity in generated hints. (c) Correct Sub. observe that the R2R validation unseen dataset has around 300 instructions containing “kitchen”. For corresponding viewpoints paired with such instructions, our SyFiS dataset generates 23 and 5 sub-instructions containing “oven” and “countertop”, respectively, indicating the trained translator more likely relates “oven” to “kitchen”. 4.4.4 Generated Hints Analysis Landmark Ambiguity Analysis. We assess the accuracy of four categories of landmark ambiguity in the generated hints. Specifically, We extract the part of the landmark ambiguity from the generated hint and check its accuracy in the visual environment. In Fig. 4.9, the TOTAL in the y-axis shows the total number of navigation steps that include each ambiguity category, shown on the x-axis. The TRUE (green) indicates the percentage of navigation steps when the corresponding ambiguity truly exists. We evaluate both LSTM-based and Transformer-based agents, and the result shows that Transformer-based agents can achieve higher accuracy of landmark ambiguity. We conclude that accurate landmark ambiguity detection is positively correlated with better navigation performance. In Fig. 4.9 (c), we evaluate the generated hint for the examples in which the sub-instruction is generated correctly, as indicated by a Bleu-4 score of 1.0. In those examples, the accuracy of identifying each category of landmark ambiguity is also higher. This result shows accurately locating the sub-instruction positively impacts landmark ambiguity detection. Targeted Distinctive Objects Analysis. We report the accuracy of identifying the targeted distinctive objects in the generated hints when landmark ambiguity exists, as shown in Fig. 4.10. The generated hints are from the model of VLN⟳BERT++ with our designed hint generator. We provide two 50 31%54%40%72%05001000150020002500targetvisiblemultivisiblemissingvisiblenotvisible TRUETOTAL50%61%45%70%0100200300400500600700targetvisiblemultivisiblemissingvisiblenotvisibleTRUETOTAL30%36%32%42%05001000150020002500targetvisiblemultivisiblemissingvisiblenotvisibleTRUETOTAL (a) Exact Matching (b) Object Matching Figure 4.10 Accuracy of the distinctive objects for each landmark ambiguity in the targeted viewpoint. types of comparisons, exact phrase matching and object token matching while performing both wrong and right actions. Exact matching evaluates the detection of distinctive object tokens and the attribute descriptions in the whole referring phrase. Object matching only evaluates the detection of distinctive object tokens. The result shows that the accuracy in generating distinctive objects is generally higher when the action is correct than when it is wrong. Also, the agent tends to generate distinctive objects that align with its targeted viewpoint, as indicated by an accuracy exceeding 90%, even when the action is incorrect. The lower accuracy of exact matching also aligns with the fact that generating the whole referring expression, including the correct attributes, is more challenging. Generated Hints. Fig. 4.11 demonstrates a few examples of the generated descriptions. The first two examples show successful cases where the agent makes a correct decision. The first example shows the agent can accurately identify the sub-instruction and notice the ambiguous landmark “kitchen”. Then, it correctly pinpoints the distinctive object “stove”, which only appears in the target viewpoint. In fact, our targeted distinctive object design can help connect the specific object (e.g. stove, refrigerator, counter table) to more general scene objects (e.g. kitchen). Also, the second example shows the agent accurately points out the “table” in the instruction that appears in multiple viewpoints and refers to the “sideboard” in the target viewpoint. The third example shows a failure case in which the agent makes a wrong decision. The sub-instruction is correctly identified, but the agent should turn around towards the counter table and proceed to the sofa rather than walk to the sofa directly. This further indicates that our descriptor pushes the model to focus on 51 00.10.20.30.40.5multiplelandmarksmissinglandmarksinvisiblelandmarksCorrect ActionWrong Action0.80.850.90.951multiplelandmarksmissinglandmarksinvisiblelandmarksCorrect ActionWrong Action Figure 4.11 Qualitative examples for NavHint. The green and orange arrows show the ground truth and the predicted viewpoints, respectively. landmarks directly and ignore the directions and motions in the instruction. Despite this, our model can generate a description consistent with its selection. 4.5 Conclusion In the VLN task, instructions often include landmarks that are not recognizable to the agent or are not distinctive enough to specify the target based on the agent’s vision perception. Our novel idea to solve these issues is to include a translator module in the navigation agent that converts the given 52 instruction representations into effective sub-instruction representations at each navigation step. To train the translator, we construct a synthetic dataset and design pre-training tasks to encourage the translator to generate the sub-instruction with the most recognizable and distinctive landmarks. Our method achieves the SOTA results on navigation dataset. We also provide a comprehensive analysis to show the effectiveness of our translator. Furthermore, we enhance the navigation agent with a hint generator that provides explicit explanations for its actions, grounded in its visual perception. During navigation, the agent generates natural language descriptions about its visual environment at each step, including comparing various views and explaining ambiguities in recognizing the target destination. Empirical results show that visual description generation improves both navigation performance and the interpretability of actions taken by the navigation agent. 53 CHAPTER 5 ADVANCING VLN FOR REAL-WORLD CHALLENGES: NAVIGATION IN CONTINUOUS AND 3D ENVIRONMENTS 5.1 Introduction The ultimate goal of VLN agent agent is to be deployed on a real robot for practical, everyday use in realistic environments. However, most existing experimental setups focus on navigation within discrete, graph-based environments, which are often far from real-world scenarios. In practical applications, a VLN agent must operate in a continuous, unstructured environment, where it relies on low-level control commands to navigate instead of predefined waypoints. Moreover, real-world navigation requires a robust understanding of 3D spatial relationships. Unlike simulated environments that often rely on 2D image-based navigation, a real-world VLN agent must perceive and reason about depth, obstacles, and spatial relationships to make action decisions. Bridging the gap between simulation and real-world challenges is crucial for advancing VLN research toward practical deployment. In this chapter, we address these challenges from two key perspectives: transitioning from discrete to continuous navigation and enhancing spatial reasoning in 3D environment. For VLN-CE (continuous environment), recent research has made significant progress in improving navigation within continuous environments. Current VLN-CE navigation agents are typically equipped with a waypoint predictor, which is primarily trained to focus on high-level viewpoint selection while relying on an offline controller to execute low-level action execution within the environment. However, this approach overlooks including low-level actions as part of the training signal. Consequently, the navigation agent misses spatial information embedded in low-level actions, thereby affecting the grounding of different modalities of textual instructions, visual images, and physical spatial motions. Furthermore, existing waypoint predictors mainly use raw RGB and depth images, overlooking a thorough exploration of object semantic attributes, which are important for assessing the feasibility of the physical actions, such as recognizing that walls are impassable. To address the above-mentioned issues in the VLN-CE agents with a waypoint predictor, we first introduce a dual-action module in which the agent selects high-level viewpoints 54 Figure 5.1 Illustration of situated scene understanding of Spartun3D-LLM compared to other 3D-based LLMs. while generating low-level action sequences simultaneously [139]. The high-level actions serve as guidance, facilitating the agent’s understanding of the relationships between low-level actions and navigable areas indicated by high-level actions. This enhances the agent’s spatial grounding ability to connect action with visual perception and language understanding. Second, to address the issue of the waypoint predictor neglecting object semantic information, we incorporate visual representations with rich object semantics and explicit obstacle masking based on prior knowledge of object possibility. For spatial understanding in 3D environment, existing studies mainly focus on integrating various 3D scene representations into LLMs, enabling the models to perform 3D grounding and spatial reasoning through natural language. For example, 3D-LLM [42] utilizes multi-view images to represent 3D scenes, pioneering a new direction in this field, while LEO [46] further pushes the boundary by directly injecting 3D point clouds into LLMs, aiming to develop a generalist embodied agent capable of 3D grounding, embodied reasoning, and action planning. Despite the promising progress, current 3D-based LLMs still fall short in situated understanding, a fundamental capability for completing embodied tasks, such as Embodied Question Answering [18], Vision and Language Navigation [5], robotic manipulation [99], and many others. Situated understanding refers to the ability to interpret and reason about a 3D scene from a dynamic egocentric perspective, where the agent must continuously adjust understanding based on its changing position and evolving 55 Situation1: You are standing beside a trash bin while there is a toilet in front of you.Question:what should you do to wash hands? 3D LLMs: Scrub your hands with soap and water.Spartun3D: You can use the sink on the right. Situation2: You are standing beside a trash bin while there is a toilet behind you.3D LLMs: Go to the sink and wash your hands.Spartun3D: You can use the sink on the left. Trash binToiletsinksinkToiletTrash bin environment around it. This capability is crucial because an agent’s reasoning and response to the same question can vary depending on its current situation [83]. For example, as shown in Fig 5.1, given the same question “What should you do to wash hands?”, the agent might need to answer “use the sink on the right” or “use the sink on the left”, depending on the agent’s current perspective and location relative to the “sink”. To address the aforementioned issues, we propose two key innovations: we first introduce a scalable, LLM-generated dataset named Spartun3D [141], consisting of approximately 133k examples. Different from datasets used by previous 3D-based LLMs [153, 46, 42], Spartun3D incorporates various situated spatial information conditioned on the agent’s standing point and orientation within the environment, consisting of two situated tasks: situated captioning and situated QA. Situated captioning is our newly proposed task that requires generating descriptions of the surrounding objects and their spatial direction based on the agent’s situation. Situated QA is designed with different types of questions targeting various levels of spatial reasoning ability for embodied agents. Furthermore, based on Spartun3D, we propose a new 3D-based LLM, Spartun3D-LLM, which is built on the most recent state-of-the-art 3D-based LLM, LEO [46], but integrated with a novel situated spatial alignment module that explicitly aligns 3D visual objects, their attributes and spatial relationship to surrounding objects with corresponding textual descriptions, with the goal of better bridging the gap between the 3D and text spaces. In summary, our contributions is summarized as follows. 1. We introduce a dual-action module for VLN-CE agents to ground high-level visual perception into low-level spatial actions. This design empowers the agent with the flexibility to select high-level viewpoints and generate low-level action sequences. We enhance the waypoint predictor with visual representations containing rich object semantics and explicit prior knowledge about objects’ passability attributes. We adapt our method to several VLN-CE agents. The experimental results show the effectiveness of our approach in waypoint predictor, as well as high-level and low-level navigation performance. 2. We propose a method to address the limitation of situated understanding of the 3D-based LLMs from two perspectives. We construct an LLM-generated dataset based on our designed situated scene 56 Figure 5.2 Main Architecture for the VLN-CE Agent. The waypoint predictor first provides navigable viewpoints (green circle). Then, the corresponding RGB, depth images, and textual instructions are input to our dual-action module. The freezing sign means no parameters are updated. Please refer to Fig. 5.3 for a detailed architecture of the low-level action decoder. graph. Then, we propose an explicit situated spatial alignment on the 3D-LLM to encourage the model to learn alignment between 3D object and their textual representations directly. We provide comprehensive experiments to show our proposed Spartun3D improve situated understanding of SQA3D and navigation. We also provide analysis to show our proposed explicit alignment module helps spatial understanding. 5.2 Narrowing the Gap between Vision and Action for the VLN-CE Agent We enhance the existing VLN-CE agent by introducing an obstacle-aware waypoint predictor and a dual-action module for navigation. The waypoint predictor enables the agent to generate waypoints in open areas rather than obstacles, while the dual-action module allows for flexible execution of both high-level and low-level navigation actions simultaneously. In the following sections, we first introduce the Transformer-based VLN-CE agent, which serves as our baseline. We then detail our improvements, beginning with the obstacle-aware waypoint predictor, followed by the dual-action module. Fig. 5.2 shows the main architecture. 57 Obstacle-awareWaypoint PredictorRGB Visual EncoderDepth Visual EncoderHigh-Level Action NavigatorLow-level Action Decoder…Instruction:Turn right, and stop near the stairs.rightrightcontrollerLow-Level Action High-Level Action Initial headingLow-level Action LabelsRGB and Depth ImagesDual-action moduleAction rightforwardforwardHiddenStates 5.2.1 VLN-CE Agent The backbone comprises two primary components: the waypoint predictor and the navigator. The waypoint predictor is trained offline to generate navigable viewpoints, which are applied to the navigator for view selection. Text Encoder. We use BERT [110] to obtain initial text representation of instruction 𝑤, denoted as 𝑋 = [𝑥1, 𝑥2, · · · , 𝑥𝑙]. Vision Encoder. In the baseline, different RGB vision encoders are used for the waypoint predictor and navigator. The waypoint predictor utilizes ResNet-152 [35] as its vision encoder, pre-trained on the ImageNet dataset [96], Meanwhile, the navigator uses the InternVideo [115] as the vision encoder, pre-trained on large video-text datasets. Formally, the obtained visual representations of RGB images are denoted as 𝑣𝑟𝑔𝑏 = {𝑣𝑟𝑔𝑏 , 𝑣𝑟𝑔𝑏 2 ResNet-50 [120], which is trained for point-goal navigation, represented as 𝑣𝑑 = {𝑣𝑑 1 }. The depth images are fed into DD-PPO , · · · , 𝑣𝑟𝑔𝑏 12 , · · · , 𝑣𝑑 12 , 𝑣𝑑 2 1 }. Waypoint Predictor. We follow the waypoint predictor designed in [38], which is a multi-layer Transformer with a non-linear classifier. All 12 RGB image representations 𝑣𝑟𝑔𝑏 and depth image representations 𝑣𝑑 are concatenated and then input into the waypoint predictor to predict a heatmap of 120 angles-by-12 distances. Each angle is 3 degrees, and distances range from 0.25 to 3.00 meters with an interval of 0.25 meters. The heatmap is represented as a Gaussian distribution with a variance of 1.75m and 15◦ to expand the prediction range. The waypoint predictor is pre-trained based on the navigable connectivity graph from MP3D [7]. During inference, non-maximum suppression (NMS) is used to sample 𝐾 neighboring waypoints, which are utilized as the candidate views for the following navigator. Navigator. Our method is designed to be model-agnostic, allowing it to be applied to any VLN-CE navigation agent based on a waypoint predictor. In this paper, we utilize HAMT [10] as the navigator backbone. HAMT is a multimodal Transformer-based navigator that can memorize history information. Specifically, HAMT uses a sequence of panorama images as the navigation history during the navigation trajectory; then it applies Transformers to encode the observations on the trajectory to memorize temporal information. Formally, at each navigation step 𝑡, the waypoint 58 predictor generates 𝐾 candidate views, and their observation representations are denoted as 𝑂𝑡. Also, the history representation is denoted as 𝐻𝑡. HAMT concatenates history and observation as the vision modality and uses a cross-modal transformer to learn the connection between text presentation 𝑋 and visual representation [𝐻𝑡; 𝑂𝑡]. The actions are predicted by selecting the highest similarity score between observation encoding 𝑂𝑡 and token containing instruction-trajectory information. 5.2.2 Obstacle-Aware Waypoint Predictor Object semantics in the visual environment is expected to play an important role in predicting navigable viewpoints, especially in the sense of open and obstacle areas. Objects have attributes that determine whether they should be labeled as passable or impassable. For example, the agent is not supposed to traverse beneath a “table” or on the “bed”. However, the current methods mainly leverage visual information from RGB and depth images and neglect further exploration of the object semantics and their attributes related to passibility. To overcome this limitation, we first enhance the current waypoint predictor with vision representation from the VLPMs [89, 91, 115], which contain much more comprehensive object semantics than ResNet in the baseline’s waypoint predictor. We employ vision representations from different VLPMs to assess their influence on the waypoint predictor’s performance, and please see Table 5.5 for our detailed analysis. Second, we introduce an obstacle mask mechanism based on semantic segmentation within the visual environment and our prior knowledge about impassable objects. We utilize semantic segmentation provided by MP3D with approximately 40 object categories. To identify open areas, we define a vocabulary with open areas objects, such as "floor", "stairs", and "door", and we mask semantic segments that are not within this vocabulary. Table 5.6 in the Appendix shows how different open-area vocabularies affect the performance of the waypoint predictor. Formally, we denote the visual representation from VL pre-trained models of panoramic images 𝑐 = {𝑣𝑟𝑔𝑏 𝑐1 }. For each image, we obtain the corresponding obstacle mask based , · · · , 𝑣𝑟𝑔𝑏 𝑐12 , 𝑣𝑟𝑔𝑏 𝑐2 as 𝑣𝑟𝑔𝑏 on semantic segmentation. We assign a label of 1 to object areas in the open vocabulary and 0 59 otherwise. The resulting obstacle masks are represented as 𝑚 = {𝑚1, 𝑚2, · · · , 𝑚12}. Subsequently, we apply obstacle masks to the RGB images and obtain the masked RGB representation, 𝑣𝑐𝑚 = 𝑣𝑐 ∗ 𝑚, which is then concatenated with depth visual representation 𝑣𝑑. This combined representation is input to the waypoint predictor to generate views at each navigation step. We train the waypoint predictor with enhanced visual representation and obstacle-masked image. Then, we employ it in the navigator for offline usage to generate navigable views. 5.2.3 Dual-Action Prediction for the Navigator High-level Action is to select a view based on the similarity between the observation 𝑂𝑡 and the hidden states from the cross-modal Transformer, which is represented as follows, 𝑝ℎ 𝑡 = Softmax( [𝐻𝑡; 𝑂𝑡] ∗ ℎcls 𝑡 ), (5.1) where ℎcls is the token representation from navigator at step 𝑡, and 𝑝ℎ 𝑡 is the probability of high-level action. Once the most similar viewpoint is selected, the agent employs an offline controller to navigate to the corresponding position. While high-level actions effectively boost navigation performance, the training mechanism primarily focuses on view selection, neglecting the spatial information in the low-level action sequence. Additionally, it is challenging for a real-world robot to navigate to a precise angle and distance in a realistic environment. Real-world robots typically operate with very limited action sets, such as FORWARD 0.25m. Low-level Action Challenges. The previous approach uses a non-linear classifier to predict an action class at each navigation step as follows, 𝑝𝑙 𝑡 = Softmax(ℎcls 𝑡 𝑊𝑐), (5.2) where 𝑊𝑐 projects the token representation to four low-level actions, and 𝑝𝑙 𝑡 is the probability of low-level actions. While low-level actions are closer to real-world robotic behavior, directly modeling the agent to generate such actions results in a cost-training process. This is because the episodes for low-level actions are around 10 times longer than high-level actions (around 56 steps for low-level action steps, compared to 4 − 6 for high-level episodes). Additionally, the performance drops substantially when training the agent with a low-level action classifier [38, 41]. 60 Figure 5.3 Low-Level Action Decoder. Dual-Actions. We enable the VLN-CE agent to navigate simultaneously using high-level and low- level actions. Built upon the existing VLN-CE agent that predicts high-level actions, we introduce a decoder to simultaneously generate the corresponding low-level action sequence. Specifically, instead of generating one low-level action using an action classifier at each navigation step, our agent is trained to generate a low-level action sequence. We formulate low-level action prediction as the textual sequence generation task. As shown in Fig. 5.3, we introduce a Transformer-based text decoder to generate the low-level action sequences with the prompt of “low actions:”. At each navigation step, the agent selects a high-level view, and at the same time, it is trained to generate the corresponding low-level sequence of action tokens auto-repressively. We input the text decoder with representation from the navigator and textual prompt representation. The training objective is to maximize the likelihood of the next low-level action token. The following equation shows the loss of generating 𝑗-th token in the action sequence. L𝑙𝑜𝑤 = − ∑︁ 𝑗 log 𝑝𝜃 (𝑎𝑙 𝑗 |𝑎𝑙 1 , · · · , 𝑎𝑖 𝑚), (5.3) where 𝑚 is the length of the sequence. We jointly train low-level action decoders and high-level selection. The labels for the low-level action sequence are obtained from the heading and distance differences between the initial view and the selected views in the high-level action. When the agent selects a high-level view at each 61 Navigation Agentlow actions: Causal Self-AttentionpromptText DecoderLeft Left Left ForwardAuto-regressionLow-Level Action Labelsleft, left, left, forward15°15°15°10°Heading Label[Left,Left,Left]0.25mDistance Label[Forward]0.08mInitial heading Figure 5.4 Examples of two tasks in Spartun3D Benchmark. The green bounding box and arrow in the 3D scene demonstrate the standing point and orientation. navigation step, the corresponding low-level sequence label is created. Specifically, we initially calculate the degree difference in headings and divide it by 15 to determine the number of rotation steps the agent takes. The direction (LEFT or RIGHT) is determined based on the smaller rotations. A similar process is applied to distance: we calculate the distance between the start point and the selected view and divide it by 0.25. We ignore the remaining if the heading or distance is not perfectly divisible. For instance, as shown in Fig. 5.3, the distance between the start point and the target view is 0.33m, and the low action is just one FORWARD (0.25m). 5.3 Spartun3D: Situated Spatial Understanding in 3D World We first introduce a scalable, LLM-generated dataset named Spartun3D, incorporates various situated spatial information. Furthermore, based on Spartun3D, we propose a new 3D-based LLM, Spartun3D-LLM, which is built on the most recent state-of-the-art 3D-based LLM, LEO [46], but integrated with a novel situated spatial alignment module that explicitly aligns 3D visual objects, their attributes and spatial relationship to surrounding objects with corresponding textual descriptions, with the goal of better bridging the gap between the 3D and text spaces. 5.3.1 Spartun3D Dataset Construction To better equip 3D-based LLMs with the capability of understanding situated 3D scenes, we introduce Spartun3D, a diverse and scalable situated 3D dataset. To ensure the scalability of Spartun3D, we carefully design an automatic pipeline that leverages the strong capabilities of 62 Q: Where is the desk located?A: Behind youQ: I need to check my appearance. Where should I go?A: You should go to the mirror behind you.Q: I want to make the room brighter and then sit down to relax. What should I do?A: Turn to your left and head towards the window. Open the curtains to let in more natural light. After adjusting the curtains, head towards the sofa chair on your left side to sit and relax. You may need to pass other chairs and a desk in your path.Situation: Standing beside blue octagon wide bed that is messy while there is a window on the left.Object Attribute and RelationObject AffordanceSituated PlanningSituated CaptioningQ: Describe the scene from your current position.A: In front, there's a rectangular box and a big picture that can be hung or moved. Nearby, there's a tall nightstand beside the bed. To the right, there's a tall pillow close to another pillow and a blanket. Behind, a polished mirror and a flexible bag are near a closed door with hanging clothes. On the left, there's an artificial lamp and a desk close to a sofa chair and trash bin, a curtain, another lamp, and a tall wardrobe near a cabinet.Situated QAOther Views GPT-4o [78], with three key stages: (1) Designing diverse situations that specify the agent’s standing point and orientation given a 3D scene as input (Sec. 5.3.1.1); (2) Constructing situated scene graphs to describe the spatial relationships between the agent and objects in the environment conditioned on the agent’s situations (Sec. 5.3.1.2) ; and (3) Prompting LLMs to generate dataset based on situated scene graphs (Sec. 5.3.1.3). 5.3.1.1 Situation Design The 3D scenes in Spartun3D are taken from 3RScan [121], which provides a diverse set of realistic 3D environments. Given a particular 3D scene with all the objects labeled by humans from 3RScan, our first step is to generate diverse situations for the agent. To construct the situation, we begin by identifying the standing point and orientation and then complete a situation description accordingly using the following template: “You are standing beside {pivot object name}, and there is {referent object name} on the {left/right/front/backward}.” The elements within {} specify the key components that together define the situation. Below, we define the agent’s standing point and orientation and explain how these elements are obtained to construct diverse and reliable situations. Standing Point and Orientation. We begin with determining the agent’s standing point and orientation within the 3D scene. Our approach is to place the agent beside an object, ensuring a clear reference for orientation when interacting with the environment. Specifically, we project all objects from 3D space onto a 2D plane, focusing only on the x and y coordinates to construct a bird-eye-view of the scene. From this projected 2D scene, we randomly select an object from the set of segmented objects within the 3D scene. To ensure the constructed situation remains realistic, we exclude objects that are positioned too high to avoid unnatural situations like “standing beside the lamp on the ceiling”. As a result, we limit the selection to objects whose z-axis is below the average height of all objects in the scene. Then, we choose a midpoint from two sides of the selected object’s bounding box that are closest to the center of the scene, as shown in Fig. 5.5. By prioritizing the side closest to the center, we minimize this risk and keep the agent within the scene’s boundaries. Finally, the selected midpoint will be used as the agent’s standing point. In addition, we need to determine the agent’s orientation. We assume the agent’s orientation is always facing forward to the center of 63 Figure 5.5 Standing Point and Orientation Selection. the selected object. This guarantees that the selected object remains within the agent’s field of view. Pivot and Referent Object. Once the agent’s standing point and orientation are determined, we refer to the object that the agent stands beside as the “pivot object”, and other objects surrounding the pivot object are potential referent objects. A referent object is then randomly selected, and its relative position (left/right/front/backward), with respect to the agent’s standing point and orientation, is used to generate the description of the situation. 5.3.1.2 Situated Scene Graph Construction Building on the agent’s situation, we further construct a situated scene graph that captures the comprehensive spatial relationships between the agent and its surrounding objects. Existing 3D- based LLMs [153, 46] represent scenes in a structured manner using JSON-formatted scene graphs, including detailed scene context of object attributes and relative spatial relationships between objects. However, their spatial relations are based on a global view, such as a bird-view-eye perspective (as shown in Fig. 5.6). To enable situated understanding, we introduce a situated-scene-graph adapted from the original global scene graph to capture all relative spatial relationships between the agent’s standing point and surrounding objects as follows: Rotation Angles. We calculate rotation angles that reorient the agent from its orientation to the surrounding objects. Specifically, we first calculate the horizontal angle between the standing point and the center of the pivot object. Next, we calculate the horizontal angle between the standing point and a surrounding object. The rotation angle is determined by the difference between these two angles. We further normalize the rotation angles such that larger values correspond to a greater 64 Front (315-360,0-45)Right (45,135)Back(135,225)Left (225,315)(a) Standing Point and OrientationXYPivot Object (b) Spatial Coordinatesθ=45Scene CenterStanding PointsStanding Pointsorientationorientation Figure 5.6 Spatial information in Situated Scene Graph. The red dot and green arrow show the standing point and orientation, respectively. In this example, the pivot object is the “sofa”, the referent object is the “TV”, and the surrounding objects include the “table” and “cabinet”. degree of rightward rotation. Direction. We classify the object’s rotation angles to the agent into four directional categories according to a predefined standard: [front, right, backward, left] (see Fig. 5.6 (b)). For instance, an object is categorized as “right” if the turn angle falls within the range of [45-135] degrees relative to the agent’s forward-facing orientation. Distance. We compute the Euclidean distance between the agent’s standing point and the center of the bounding boxes of surrounding objects. Passby Objects. We assess whether the agent can move freely from its standing point to other objects. We draw a straight line from the agent’s standing point to the center of the referenced object. If this line intersects any other objects in the scene, those objects are considered “passby objects”. For example, as illustrated in Fig 5.6 (d), the “table” is a passby object between the agent and the “kitchen cabinet”. We explictly include the information of passby objects to help the agent build awareness of objects that might influence its path while navigating. After gathering the spatial information described above, we organize it into a JSON format (shown in Fig. 5.6 (e)), which is then used as input to prompt LLMs to generate our datasets. 5.3.1.3 LLMs prompting We design specific instructions to prompt GPT-4o [78] for two situated tasks: Situated Captioning and Situated QA. For both tasks, we ask GPT-4o to provide responses considering 65 Global 3D Scene SofaTVCabinetKitchen CounterRIGHTa)Rotation Angle Cabinetd)Passby Objects OursTraditional MethodTableTableCabinetTablePassbyc)Distance Cabinet Sofa1.5m1.2m Sofa3D Sceneb)Direction Sofa165 degree160 degreeFRONT(315-360,0-45)RIGHT (45,135)BACKWARD(135,225)LEFT (225,315)θ=45TVRIGHT SofaSofa: {Backwards:{cabinet’: Angle: 165 degree Distance: 1.5 Passby: [table]}}Situation: “Standing beside a {sofa} while there is {a tv} on your {right}.”Tablee)Situated Scene Graph situated spatial information, and examples of the generated dataset are shown in Fig. 5.4. Situated Captioning is our newly introduced task, aiming to generate brief situated descriptions of the surrounding objects as the agent performs a 360◦ clockwise rotation starting from its standing point and orientation. The motivation for introducing this task stems from its crucial role in embodied tasks, such as navigation, where the agent must interpret and reason about its environment from 360◦ panoramic views to make decisions about movement and interaction [146]. Therefore, we guide GPT-4o to generate descriptions progressively, starting from lower rotation angles and moving toward higher angles in each direction. Situated QA. We design three types of questions for the Situated QA task, each targeting a different aspect of spatial reasoning for embodied agents. Unlike previous works that rely on a single generic prompt for all question types, we develop tailored prompting strategies for each question type, encouraging LLM to generate QA pairs focusing on different levels of reasoning. Object Attribute and Relations include questions about objects attributes, such as color, shape, and size, while also incorporating situated spatial information. For instance, the questions to identify “the color of the table positioned to the left”, and determine “how many pictures are hanging on the wall to the right”. Object Affordance focuses on the function utility of the objects, often based on common sense knowledge about how objects are used. Similarly, we require situated spatial information to be part of the answer. For example, when asked “Where can you check your appearance?", the correct answer should be “mirror on your left”, specifying both the object name (mirror) and its spatial location from the agent. Situated Planning is the most challenging task, as it requires the agent to perform multi-hop situated spatial reasoning. The agent must not only recognize its surroundings but also plan and execute a series of actions across multiple steps, where each subsequent action depends on the outcome of the previous one. In our dataset, we implement 2-hop reasoning, which requires the agent to perform a sequence of two continuous actions. For example, given the example in Fig. 5.4, “make the room brighter and then sit down to relax.”, the agent needs to first turn left from its 66 Figure 5.7 Human Evaluation of Spartun3D. orientation to face and move toward the window, open it to brighten the room, then based on its new position, the agent continues turning left toward the sofa chairs and sits down. 5.3.1.4 Dataset Statistics and Quality Control In total, we collect approximately 10k situated captions and 123k QA pairs. For the tasks of object attribute and relation and the tasks of affordance, we sampled around 10 situations per scene. For captioning and planning tasks, we sample around 5 situations per scene due to the increasing cost of longer token sequences required for these tasks. For each task, we split the data instances into a Training and Test set. Table 5.1 shows the statistics of our dataset. We conduct a comprehensive human evaluation to manually assess the quality of Spartun3D, introducing human scores based on two key criteria: language naturalness, which evaluates whether the text reads as if it were naturally written by a human, and spatial fidelity, which ensures that the data accurately reflects the 3D scene with correct spatial relationships. Each criterion is rated on a scale from 1 to 5, and the average of these two scores is the overall human score. We randomly select 50 examples from each task and compute human scores of situation, question, and answer, respectively. As shown in Fig. 5.7 (a), the average scores align with the complexity of each task, with relatively lower scores for captioning and planning tasks. To assess how our generated data compares to human-annotated data, we sampled 50 examples from SQA3D and mix them with our dataset. We focus on the human score of different types of questions, as shown in Fig. 5.7 (b). We also evaluate how different prompting strategies influence the quality of the data. We experiment with two types of prompts for representing spatial information to prompt GPT-4o: Cord-prompt, which consists of object center coordinates, standing point, orientation, and instructions for calculating 67 4.814.714.414.524.654.734.624.574.314.454.514.6412345WhatIsHowCanWhichOthersSQA3DSpartun3D0.270.570.40.330.860.90.870.8700.20.40.60.81CaptioningAttr. & Rel.AffordancePlanningCord-promptSpa-Prompt4.814.744.724.754.634.514.224.174.564.474.0512345CaptioningAttr. & Rel.AffordancePlanningSituationQuestionAnswer(a) Avg. human score for Spartun3D.(b) ComparisonbetweenSQA3D and Spartun3D.(c) Percentageof valid examples generated from two different prompts. Tasks # of Examples Captioning Attr. & Rel. Affordance Planning ∼ 10K ∼ 62K ∼ 40K ∼ 21K Train/Test 8, 367/1, 350 61, 254/8, 168 35, 070/5, 017 19, 434/2, 819 Table 5.1 Dataset statistics of Spartun3D and human validation results. distances and rotation angles, and Spa-prompt, consisting of the calculated angles and distance based on the approaches we described in Sec. 5.3.1.3. Fig.5.7 (c) shows the percentage of examples with high human scores (≥ 4) for each prompt across tasks. The results indicate that Cord-prompt yields unsatisfactory results, revealing that LLMs lack strong 3D spatial reasoning capabilities when interpreting raw spatial coordinates. Our Spa-prompt significantly improves the quality of the generated dataset by providing qualitative spatial relations (e.g. distance, direction). 5.3.2 Model Architecture In addition to enhancing the situated understanding of 3D-based LLMs with Spartun3D, we also propose a new 3D-based LLM, named Spartun3D-LLM, which integrates a novel Situated Spatial Alignment module to strengthen the alignment between the situated 3D visual features and their corresponding textual descriptions. Spartun3D-LLM is built upon LEO [46], which represents the most recent and state-of-the-art 3D-based LLM, and directly takes 3D point cloud data as input, making it well-suited for spatial reasoning tasks in 3D environments. Fig. 5.8 illustrates the overview architecture of Spartun3D-LLM. 5.3.2.1 Background Problem Formulation. We formally define the input as a triple < 𝐶, 𝑆, 𝑄 >, where 𝐶 is the 3D scene context, 𝑆 is the situation, and 𝑄 is a question. The situation 𝑆 can be further denoted as 𝑆 =< 𝑆𝑡, 𝑆 𝑝, 𝑆𝑟 >, where 𝑆𝑡 is a textual situation description, and 𝑆 𝑝 and 𝑆𝑟 are the standing points and orientation, respectively. Specifically, 𝑆 𝑝 is a 3D coordinate in the form < 𝑥, 𝑦, 𝑧 > and 𝑆𝑟 is the quaternion < 𝑞𝑥, 𝑞𝑦, 𝑞𝑧, 𝑤 >, where < 𝑞𝑥, 𝑞𝑦, 𝑞𝑧 > is the rotation axis and 𝑤 is the rotation angle. For simplicity, we define 𝑧 = 0 to calculate the rotation angle on a 2D plane. The task is to generate a textual answer, denoted as 𝐴, given scene context 𝐶, situation 𝑆, and question 𝑄. During training, 68 Figure 5.8 Spartun3D-LLM Model Architecture. 𝑆 𝑝 and 𝑆𝑟 are provided to the agent to rotate and translate the environment, while during testing, only questions and situations are provided. Backbone. LEO [46] takes the text, 2D image (optional), and 3D point clouds as input and formulate comprehensive 3D tasks as autoregressive sequence generation. Specifically, data from different modalities are converted into a sequence of tokens as input to the LLM. The text tokens include system messages (e.g., “You are an AI visual assistant situated in a 3D scene.”), situations, and questions. These tokens are then embedded into vector representations using an embedding look-up table. For 3D point clouds, LEO first applies segmentation masks to extract the point clouds of individual objects in the 3D scenes. Then, the sampled points of each object are input into a object-centric point cloud encoder, PointNet++ [85] pre-trained on ScanNet [16], to obtain the object-level representations. Formally, we denote the representation of input text tokens as W = [w1, w2, ..., w𝑀] ∈ R𝑀×𝐷 , where 𝑀 denotes the number of input tokens, and 𝐷 represents the dimensionality of each token’s embedding. Additionally, the input object visual representations are expressed as O = [o1, o2, ..., o𝐾] ∈ R𝐾×𝐷, where 𝐾 is the number of extracted objects from the scene. Finally, the output answer are represented as A = [a1, a2, ..., a𝑁 ], where 𝑁 is the number of tokens in the response. The model’s objective is to generate the answer given these combined inputs. The loss for 69 CVL Computer Vision Lab!!!"2Situation: I standbehinda chair facing the table and don't have a monitor on mytable.TokenizerQuestion: What is the first object inmyleft?TokenizerPointNet++SpatialSelf-attentionTextEncoderT1: Standingbesidethelarge window and facing the center of the window…!#!"!!!#!!!""#"!""SituatedSpatialAlignment…………T2:StandingbesideShelf...Infront,therearepictureonthewall…T3: Standingbesidethewhitechair...Behind,thereareshelf…!#…LargeLanguageModelWindow generating the 𝑖-th token of the output answer is formulated as follows: LLM(𝜃) = ∑︁ 𝑖 𝑙𝑜𝑔 𝑝𝜃 (a𝑖 |a𝑖−1, WS, o). (5.4) LEO can integrate various LLM backbones, including OPT1.3B [132] and Vicuna-7B [15]. In our experiments, we fine-tune LEO with different LLM backbones on our proposed dataset via LoRA [44]. 5.3.2.2 Situated Spatial Alignment Module Situated tasks require robust spatial reasoning abilities to comprehend the position, orientation, and spatial relationships of objects within a 3D environment. Existing 3D-based LLMs typically process inputs by concatenating output representations from various modality encoders. While this method facilitates the integration of data across different modalities, it does not inherently ensure that the 3D visual representations encode situated spatial information or effectively align with textual descriptions, which potentially limits the model’s ability to perform tasks that require precise spatial understanding. To tackle this challenge, we introduce a novel Situated Spatial Alignment Module to improve the alignment between the object-centric 3D visual representations and their situated textual descriptions. The process begins by generating detailed situated textual descriptions for each object. Subsequently, an alignment loss is introduced, which directs the model in effectively learning the 3D visual representations based on these situated textual descriptions. Situated Textual Descriptions. For each object, we construct a comprehensive situated textual description based on a template that captures the object’s name, attributes, and spatial relations with nearby objects, as “Stand besides {object name} and facing the center of the {object name}, in front, there are {a list of nearby objects}; on the right, ...; behind ...; and on the left...”. The object’s attributes are also considered (e.g., “white chair”). We consider up to five objects per direction. If no object is present in a specific direction, the description explicitly states this, ensuring to provide complete information about the 3D environment. 3D Object-Text Alignment. Inspired by the success of 2D Visual-Language models, which effectively leverage semantically aligned text and visual features to excel in downstream tasks [89, 63, 62, 115], we aim to enhance the 3D visual representations so that they can better encode the 70 situated spatial information and effectively align with the textual descriptions. Specifically, we introduce a 3D object-text alignment loss to guide the learning process of point cloud encoders within 3D-based LLMs, leveraging the robust language representations captured by pre-trained text encoders. We experiment with various text encoders, and CLIP achieved the best performance. More formally, we obtain the text representations of situated textual description for each object from pre-trained text encoders, denoted as W = [w1, w2, ..., w𝑘 ] ∈ R𝐾×𝐷. For object visual representations O, we employ spatial self-attention layers [11] to learn spatial-aware object representations. Specifically, a pairwise spatial feature matrix F ∈ R𝐾×𝐾×5 is introduced to represent relative spatial relations between objects. For example, for each object pairs o𝑖 and o 𝑗 , we construct pairwise spatial feature as f𝑖 𝑗 = [𝑑𝑖 𝑗 , 𝑠𝑖𝑛(𝜃ℎ), 𝑐𝑜𝑠(𝜃ℎ), 𝑠𝑖𝑛(𝜃𝑣), 𝑐𝑜𝑠(𝜃𝑣)] ∈ R1×5, where 𝑑𝑖 𝑗 is Euclidean distance between two objects, and 𝜃ℎ and 𝜃𝑣 are horizontal and vertical angles connecting bounding box centers of o𝑖 and o 𝑗 , respectively. Then, we inject F into the self-attention of the object as, O′ = softmax( QK𝑇 √ 𝑑ℎ + MLP(F))V, where 𝑄 = 𝑊𝑄𝑂; 𝐾 = 𝑊𝐾𝑂, 𝑉 = 𝑊𝑉𝑂, (5.5) where O′ ∈ R𝐾×𝐷 denotes the spatial-aware object representation; Then we use a Mean Squared Error (i.e., MSE) as the objective function to minimize the distance between the object representation O′ and the corresponding situated textual embedding W, denoted as Lalign = MSE(o′, w𝑡). The model is trained to jointly optimize both the alignment loss and the language modeling loss (Eq. 5.4) as L = L𝐿𝑀 + Lalign. 5.4 Experiments for VLN-CE 5.4.1 Experimental Results on High-Level and Low-Level Action Performance We evaluate our method on top of three VLN-CE agents: WP-VLN-BERT [38], WP-HAMT [115], and ETPNav [2]. These VLN-CE agents are all Transformer-based models and employ a waypoint predictor to discretize the visual environment for navigation in the continuous setting. WP-VLN- BERT represents historical information using implicit state representations, whereas WP-HAMT uses explicit panoramic images in the traversed path. ETPNav is a graph-based VLN agent, which 71 Model Waypoint Models [52] CWP-CMA [38] Sim2Sim [53] 1 2 3 4 VLN-BERT+Ego2-Map [41] 5 6 7 8 9 10 WP-VLN-BERT [38] WP-VLN-BERT+Ours WP-HAMT [115] WP-HAMT+Ours ETPNav [2] ETPNav+Ours Validation Unseen Test Unseen nDTW↑ SR↑ SPL↑ SR↑ SPL ↑ 0.30 0.34 0.33 0.36 0.37 0.36 0.41 0.46 0.36 0.39 0.38 0.41 0.45 0.47 0.47 0.49 0.48 0.49 0.48 0.49 0.32 0.38 0.44 0.47 0.42 0.44 0.49 0.52 0.55 0.56 0.36 0.41 0.43 0.52 0.44 0.46 0.52 0.54 0.57 0.58 - 0.55 - 0.60 0.54 0.55 0.60 0.62 - 0.62 Table 5.2 Experimental results on high-level action evaluated on the R2R-CE validation unseen and test dataset. differs slightly from the standard navigation setting. The other two agents can only select local navigable viewpoints connected to the current viewpoint. However, graph-based agents like ETPNav can jump back to the previously explored viewpoints, often resulting in a higher success rate. We integrate our dual-action module and enhanced waypoint predictor into the three baseline backbones introduced above and evaluate navigation performance on both high-level and low-level actions as follows. High-Level Action Performance. Table 5.2 presents the navigation performance using high-level actions on the validation unseen and test unseen sets. All VLN-CE agents in Table 5.2 first employ a waypoint predictor to generate navigable viewpoints and then select a view from these viewpoints. We improve high-level action performance for all baselines by incorporating our obstacle-aware waypoint predictor and dual-action modules. Specifically, we improve WP-VLN-BERT almost 2% on all navigation metrics on both validation and test unseen sets, as shown in row#6. WP-HAMT utilizes visual representation from InternVideo [115] to strengthen the model’s performance. We mainly compare with InternVideo base weights because of the computation cost limitation. In terms of the navigation performance of this baseline, we especilly improve 3% of the success rate on the test unseen (row#8). In addition to enhancing the standard Transformer-based navigation agent, our method can also increase the success rate of the graph-based agent ETPNav, as shown in row#10. Low-Level Action performance. Table 5.3 shows the navigation performance with low-level 72 Methods CMA+PM+DA+Aug [54] LAW [92] WS-MGMap [9] CWP-CMA [38] 1 2 3 4 5 VLN-BERT+Ego2-Map [41] 6 7 7 9 10 WP-VLN-BERT [38] WP-VLN-BERT+Ours WP-HAMT* [115] WP-HAMT+Ours ETPNav+Ours nDTW↑ SR ↑ SPL ↑ 0.30 0.32 0.31 0.35 0.34 0.39 0.25 0.27 0.29 0.30 0.22 0.23 0.27 0.28 0.32 0.35 0.38 0.44 0.42 0.48 0.51 0.54 - 0.49 0.52 0.48 0.54 0.54 0.55 0.58 Table 5.3 Experimental results of low-level actions on the R2R-CE validation unseen set. * means our implementation for low-level action prediction, as most VLN-CE agents do not report their low-level performance. We train the VLN agent with a low-level action classifier for fair comparison. movement actions. The existing methods mainly compare the results of the validation unseen for low-level actions. The models in row#1 to row#3 are based on LSTM architecture to frame the navigation as a sequence-to-sequence task and predict low-level actions directly. The models from row#4 to row#6 are models using a waypoint predictor. They add an action classifier to the navigator and train the model to select one low-level action at each navigation step. Another noticeable difference between the models from row#1 to row#3 and others is that the agent in these methods can only observe the current view rather than the whole panoramic view at each navigation step. As shown in Table 5.3, we observe that some Transformer-based navigators with much more powerful pre-trained visual representations and complex model architecture, such as the approaches in row#5 and row#6, their low-level action navigation performance could not compete with LSTM-based models (row#1 to row#3). Specifically, for the baseline model WP-VLN-BERT, although our method can significantly improve it (row#7), it is still far behind LSTM-based models. However, we can achieve SOTA after applying our method on WP-HAMT and ETPNav, as shown in row#9 and row#10, respectively. It is worth noting that the majority of VLN-CE agents do not report their low-level action performance. To ensure a fair comparison, we follow the method of low-level action prediction in WP-VLN-BERT to add a non-linear classifier on top of WP-HAMT [115] to adapt it to low-level action prediction. In general, our method’s performance is aligned with the VLN-CE 73 Method Baseline 1 2 3 4 High Low CLIP Ob-Mask Dual-Action nDTW↑ SR↑ SPL↑ nDTW↑ SR↑ SPL↑ 0.35 0.32 0.43 0.36 ✔ 0.60 0.60 0.61 0.61 0.62 0.52 0.47 0.52 0.47 0.53 0.47 0.53 0.48 0.54 0.49 0.54 0.55 - - 0.55 - - - - 0.44 0.38 ✔ ✔ ✔ ✔ ✔ ✔ Table 5.4 Ablation study on different components of our method. The baseline is WP-HAMT, and the Ob-mask is the obstacle mask. navigator’s performance. This correlation is expected, as the low-level action sequence is trained using the hidden state representation from the corresponding baseline, and stronger representations yield better performance when training low-level actions. 5.4.2 Ablation Study In this section, we conduct an ablation analysis of the waypoint predictor and for waypoint predictor and the effectiveness of different components in our method. Waypoint Predictor Performance. The waypoint predictor is trained offline to generate navigable viewpoints. We enhance the baseline waypoint predictor from the aspects of stronger visual representations and explicit object masks, and Table 5.5 shows our results on R2R-CE validation unseen set. The main metrics to evaluate the waypoint predictor’s performance are as follows: |Δ| measures the difference in the number of target waypoints and predicted waypoints. %Open is the ratio of predicted waypoints in open space. 𝑑𝑐 and 𝑑𝐻 are the Chamfer and Hausdorff distances, respectively, to measure the distance between point clouds. We experiment with visual representations from different Vision and Language Pre-trained Models (VLMs) and test their influence on the performance of the waypoint predictor. As shown in Table 5.5, the waypoint predictor achieves the best performance when utilizing CLIP vision representations. However, we cannot conclude that more powerful vision representations lead to better waypoint-predicting performance since the representations from InterVideo seem to hurt the waypoint predictor. This result suggests that different pre-trained visual encoders possess varying capacities to influence the agent’s ability to recognize open and obstacle areas. The better result is achieved when both CLIP representation and our designed obstacle mask are applied. Compared 74 Waypoint Predictor Visual Encoder ResNet [38] InternVideo [115] DenseCLIP [91] CLIP [89] 1 2 3 4 5 CLIP+ Obstacle Mask |Δ| %Open↑ 1.40 1.44 1.41 1.38 1.38 0.80 0.65 0.81 0.83 0.85 𝑑𝐶 ↓ 1.07 1.15 1.05 1.04 1.04 𝑑𝐻 ↓ 2.00 2.04 2.01 2.00 1.94 Table 5.5 Evaluation of different visual encoders. to ResNet (row#1), CLIP improves the open area prediction by about 3%, demonstrating that rich semantics in visual representation in CLIP aids the waypoint predictor in learning the open and obstacle objects. After applying our designed obstacle mask, the accuracy of open area prediction gained an additional improvement of 2%, emphasizing the effectiveness of the prior knowledge in encouraging the waypoint predictor to better focus on open area spaces. Different Components. The results of the ablation study in Table 5.4 demonstrate the influence of each component of our proposed method on both high-level and low-level navigation. The components in our method include dual-action for the navigator and the enhanced waypoint predictor with CLIP visual representations and the obstacle mask. The analyzed navigator is WP-HAMT, which uses visual representation obtained from InternVideo base weights. We report results on the R2R-CE validation unseen dataset. In row#1, we integrate the low-level action decoder with the baseline navigator and jointly train it with high-level actions, and the low-level action navigation performance is significantly improved (about 4% on SPL). In row#2, we train the waypoint predictor with only CLIP representations and apply it to the baseline navigator without dual-action training. Notably, the enhanced waypoint predictor already contributes to better high-level navigation performance, indicating that CLIP’s rich object semantic representation boosts overall navigation. Row#3 shows the effectiveness of the obstacle mask in enhancing the SPL for high-level action. In row#4, we train the waypoint predictor with both CLIP and obstacle mask, and we apply this enhanced predictor to the navigator with the dual-action module. We achieve the best results in this setting. Compared to row#3, we conclude that the spatial information incorporated into the low-level action benefits the high-level viewpoint selection. Similarly, compared to row#1, enhanced waypoint prediction not 75 Waypoint Predictor Ob-Mask Vocab No Mask Floor Stairs Doors 1 2 3 4 Floor+Stairs+Doors |Δ| %Open↑ 1.38 1.38 1.40 1.40 1.38 0.83 0.84 0.82 0.80 0.85 𝑑𝐶 ↓ 1.04 1.07 1.04 1.04 1.04 𝑑𝐻 ↓ 2.00 2.00 2.00 1.94 1.94 Table 5.6 Analysis of the influence of various open-area vocabularies on the waypoint predictor. only enhances high-level viewpoint selection but also benefits low-level action generation. 5.4.3 Qualitative Analysis In this section, we provide a qualitative analysis from the perspectives of low-level actions and obstacle masks. Low-Level Actions Generation. In Fig. 5.9 (1), we show an example of our generated low-level action sequences that lead the agent to the destination. However, we have observed cases where the agent generates a low-level action sequence that reaches the destination but does not fully follow the instruction, as shown in Fig. 5.9 (2). This issue also occurs in other VLN-CE agents when modeling low-level action predictions. We assume the reason is the inherent challenges in building the VLN-CE dataset. It transfers the instructions and trajectories from VLN-DE. When the simulator executes the low-level actions, especially rotations, there is no human evaluation process to confirm whether the low-level actions align with the instruction. The actions are generated based on minimal required rotations when the selected view is given to the simulator. Training the agent to learn low-level actions in these scenarios is more challenging compared to directly training with view selection. Open-area Vocabulary. In Table 5.6, we provide an analysis of object semantics and their relation with the performance of the waypoint predictor. We select open-area vocabularies based on our prior knowledge. The baseline we used is the waypoint predictor trained with CLIP visual representations. Given the semantic segmentation from the simulated environment, we mask other object areas except the semantic areas in open-area vocabularies. Then, we input the masked image into the waypoint predictor. The experimental results demonstrate that different object semantics 76 Figure 5.9 Examples of generated low-level actions. 0 denotes the current direction, while − means LEFT turn. The number represents the rotation degree. The yellow bounding box indicates the target. show varying influences on the waypoint predictor. For instance, the %open is low when we mask objects other than “door”, indicating the presence of closed or blocked doors (row#3). We can get the best results when our open-area vocabularies contain “floor”, “stairs”, and “doors” (row#4). Qualitative Examples for Obstacle Mask. In Fig. 5.10, we show an example to demonstrate the different generated waypoint heatmaps between an RGB image with (Fig. 5.10 (2)) and without (Fig. 5.10 (1)) obstacle mask. The image shows the corresponding views based on the headings of the highlighted areas in the heatmap. It is evident that the waypoint predictor samples more viewpoints (5 viewpoints) from image (a) and image (b) when an obstacle mask is applied, both of which contain large open areas. In contrast, RGB images without obstacle masks sample relatively fewer viewpoints on images (a) and (b), but they sample viewpoints from (c) and (d), although (c) is not included in the ground truth. This example illustrates that the obstacle mask aids the waypoint predictor in concentrating mainly on large open areas but falls short in narrow open areas. However, based on the final navigation result in Table 5.4, the obstacle mask ultimately contributes to navigation performance. 77 03060120-30-60(1)Instruction: Turn around stairs, and walk towards the living room.Low Action: forward, forward, forward, forward, forward.-30060120-60-90(2) Instruction: Turn around the table and turn left to the kitchen.Low Action: left, left, left, left, forward, forward, forward, forward, forward, forward, forward, forward. (a) (b) (c) (d) (a) (b) (c) (d) Figure 5.10 An example of a generated waypoint heatmap given an RGB image with and without obstacle mask. Models LLMs Attributes and Relations Affordance Situated Planning LEO LEO+Spartun3D B-4 M C 0.00 17.4 100.3 zero-shot 7.0 OPT1.3B 121.3 Vicuna7B 125.4 10.1 10.4 23.0 LEO*+Spartun3D Vicuna7B 129.2 9.2 OPT1.3B 124.1 Vicuna7B 131.2 10.3 Spartune3D-LLM Spartune3D-LLM* Vicuna7B 135.4 10.7 24.9 R 39.1 20.1 45.3 22.1 46.7 48.1 21.0 47.3 24.3 48.8 EM 42.7 47.7 52.1 53.2 49.4 53.7 51.3 56.9 C 13.3 224.6 238.9 211.3 227.2 240.4 254.7 B-4 M 3.00 0.00 24.9 30.6 24.4 32.1 24.6 32.1 26.3 31.4 25.0 32.1 26.7 32.9 R 5.00 53.2 55.0 55.0 54.1 55.3 57.3 S 32.3 66.9 68.3 67.8 68.2 68.7 69.7 B-4 M C 7.00 0.00 0.00 32.1 44.8 229.7 35.2 46.5 242.1 36.2 47.5 247.1 33.2 45.2 232.3 36.4 244.0 47.1 36.2 252.1 47.6 R 15.3 60.9 63.1 65.1 62.1 64.0 65.4 S 59.2 83.8 84.3 85.8 85.4 86.8 88.7 Table 5.7 Experimental Results on Spartun3D Situated QA Tasks. ∗ represents the model initialized with LEO instruction-tuned weights. [Keys: C: CIDER; B-4: BLEU-4; M: METEOR; R: ROUGE; Sim: Sentence Similarity; EM: Exact Match; Bold: best results]. 78 (2) RGB Image with Obstacle Mask(1) RGB Image 5.5 Experiments for Spartun3D 5.5.1 Experimental Setup To demonstrate the effectiveness of our proposed Spartun3D-LLM, we conduct experiments on two situated understanding datasets, including Spartun3D and SQA3D [74]. For SQA3D, we evaluate under two conditions: object proposals in 3D are derived either from Mask3D [98] or ground-truth annotations. Also, we assess the transferability of our method on the navigation task using MP3D ObjNav [97]. Following LEO [46], we report the performance using standard generation metrics, including CIDEr, METEOR, BLEU-4, and ROUGE_L, sentence similarity [94] for captioning task. For SQA3D and situated QA tasks of questions about attributes and relations, we also report an additional metric of exact-match accuracy. We leverage LEO as baseline. Since the training stage in LEO has covered most of the evaluation tasks, we experiment with models initialized from scratch to ensure a fair comparison in the zero-shot setting. For other settings, we report the performance of models initialized both from scratch and from the instruction-tuned LEO. To distinguish between the two, models initialized from the instruction-tuned LEO are marked with an asterisk (∗). 5.5.2 Experimental Results on Different Tasks Spartun3D Benchmark We evaluate the performance of both the LEO model and Spartun3D-LLM after fine-tuning them on our proposed Spartun3D dataset. The fine-tuned LEO model is referred to as LEO+Spartun3D. Table 5.7 and Table 5.8 show the experimental results on Situated Captioning and Situated QA tasks, respectively. We experiment with two different LLM backbones: Opt1.3B and Vicuna7B. Our experiments show that Spartun3D-LLM consistently outperforms LEO+Spartun3D across all question types (around 2% − 3% across all metrics), regardless of the LLM backbone used, indicating the effectiveness of our explicit alignment module. We observe that initializing our model with LEO pre-trained weights improves performance. Notably, without fine-tuning, LEO performs reasonably well on attribute and relation questions in a zero-shot setting but struggles with other situated tasks. SQA3D Performance. We evaluate our method on the SQA3D dataset, whose scenes are derived 79 LEO LLMs Models zero-shot LEO+Spartun3D C 0.00 5.9 OPT1.3B Vicuna7B 6.7 LEO+Spartun3D* Vicuna7B 14.1 6.4 OPT1.3B Vicuna7B 8.5 Spartun3D-LLM* Vicuna7B 14.6 Spartun3D-LLM B-4 M 9.00 0.00 17.7 15.3 18.7 15.8 22.6 17.2 18.5 15.7 19.6 16.4 23.3 19.3 R 15.3 31.2 32.3 32.1 31.2 32.5 33.4 S 51.9 67.3 70.4 76.3 68.6 72.5 78.1 Table 5.8 Experimental Results on Spartun3D Situated Captioning Task. # 1 2 3 Zero-shot Fine-tune Methods Mask3D [98] GT C 14.2 LEO [46] 82.3 LEO+Spartun3D Spartun3D-LLM 83.5 3D-Vista [153] 3D-LLM [42] LEO [46] - 4 - 5 132.0 6 7 LEO*+Spartun3D 134.0 138.2 8 Spartun3D-LLM* M 6.4 14.2 15.7 - - 33.0 34.6 35.3 R 8.2 32.8 34.7 - - 49.2 52.2 53.4 EM 12.4 34.7 36.2 48.5 50.2 52.4 53.5 54.8 C 15.3 83.1 85.6 - - 132.3 135.3 138.3 M 6.7 15.2 16.6 - - 34.3 34.2 35.4 R 8.6 33.7 35.8 - - 51.4 52.1 53.7 EM 13.9 35.9 37.1 - - 52.5 54.2 55.0 Table 5.9 Experimental Results on SQA3D given the 3D objects from Mask3D and Ground-truth. from ScanNet [16]. Their scenes differ from those in Spartun3D, which are sourced from 3RScan. We experiment with two settings: zero-shot and fine-tuning. In the zero-shot setting, we re-trained LEO on their dataset (row#1) only constructed from 3RScan excluding all dataset constructed from ScanNet to ensure a fair comparison with our method. As shown in Table 5.9, LEO performs poorly on SQA3D in the zero-shot setting, suggesting its limitations in learning situated understanding from its original dataset. In contrast, LEO trained on Spartun3D (row#2) shows significant improvement, demonstrating the effectiveness of our dataset. Further comparisons of Spartun3D-LLM with LEO+Spartun3D demonstrate a better zero-shot learning (i.e., generalization) capability of our model. In the fine-tuning setting, Spartun3D-LLM continues to outperform LEO across all metrics. Navigation Performance. To demonstrate the effectiveness of our approach on downstream embodied tasks, we evaluate it on the object navigation tasks. Specifically, we randomly select 5 scenes that contain around 1000 examples from the MP3D ObjNav dataset. In this task, we additionally input 2D ego-centric images to both LEO and Spartun3D-LLM for comparison. There 80 LEO Spartun3D-LLM Zero-shot 0 20.3 Table 5.10 Navigation Performance (Accuracy%). Methods Scan2Cap ScanQA LEO + Spartun3D Spartun3D-LLM 54.2 55.7 46.3 48.6 Table 5.11 Spatial Alignment Evaluation on Other Benchmarks. The metric is Sentence Similarity. are four types of navigation actions: turn left, turn right, move forward, and stop. We evaluate whether the model generates correct action at each step. We conduct the experiment in a zero-shot setting, and Table 5.10 shows the accuracy of the model’s performance. The baseline model, LEO, struggles to generate the required action-related text to guide navigation steps without fine-tuning specifically for navigation tasks. In contrast, our model demonstrates strong transferability to generate correct actions. Fig 5.13 (e) showcases a qualitative example, illustrating how our model effectively generates accurate navigation actions without task-specific fine-tuning. 5.5.3 Ablation Study and Extra Analysis Explicit Alignment Enhances General Spatial Understanding. We evaluate the effectiveness of our proposed situated spatial alignment module on general scene understanding tasks, such as Scan2Cap [14] and ScanQA [6]. In line with our approach for situated tasks, we construct textual descriptions for each object based on its attributes and spatial relations to others from a top-view perspective. As shown in Tab. 5.11, by incorporating the explicit spatial alignment module, our model shows better results, indicating that our proposed alignment module not only improves situated understanding but also enhances general 3D scene understanding. Figure 5.11 Scaling Effects. Figure 5.12 SQA3D Labels Distribution. 81 Ground-truth LEOLEO+Spartun3DSpartun3D-LLM Improved Situated Understanding. To analyze the model’s situated understanding ability further, we visualize the distribution of model responses generated for questions requiring strong spatial understanding from SQA3D. Specifically, we extract questions starting with “which direction". Fig. 5.12 illustrates the distribution of generated “directions”, including “left”, “right”, “forward” and “backward”. We observe that LEO is biased towards generating “left” 97% of the time. However, the ground-truth distribution of “left” and “right” should be balanced, suggesting that LEO may have a limited understanding of situated spatial relationships. The bias is significantly mitigated when LEO is trained on our dataset (LEO+Spartun3D). While adding our alignment loss (Spartun3D-LLM) helps futher, our dataset is the primary factor in addressing the bias. Scaling Performance. We conduct scaling experiments to demonstrate how model performance improves with the addition of Spartun3D datasets. As shown in Fig. 5.11, we evaluate performance on SQA3D and observe consistent improvement as the dataset scales, highlighting the potential for dataset expansion using our proposed method. Qualitative Examples. In Fig. 5.13, we showcase several successful examples to demonstrate the effectiveness of Spartun3D-LLM across various situated tasks. Notably, in Fig 5.13 (c), the model without an explicit alignment module tends to generate more general or vague spatial descriptions, such as “turn around”. In contrast, with the alignment module, the model produces more specific details, including terms like “turn slightly right”. To verify this, we examine 30 examples from both situated planning and situated captioning tasks and observe this phenomenon in 17 of them. This highlights how the proposed spatial alignment module enhances the generation of fine-grained spatial information, leading to more precise and contextually accurate outputs. 5.6 Conclusion In this chapter, we bridge the gap between simulation and real-world challenges from two key perspectives: transitioning from discrete to continuous navigation and improving spatial reasoning in 3D environments. For VLN-CE agents, we focus on narrowing the gap between visual perception and action grounding. We introduce a dual-action module that enables the current VLN-CE agent, equipped with the waypoint predictor, to train jointly for both high-level and low-level actions. 82 (a) (b) (d) (c) (e) Figure 5.13 Qualitative Examples. (a), (c), (c) situated qa examples of object attribute and relation, object affordance, and situated planning. (d) situated captioning example. (e) navigation example in a zero-shot setting. This joint optimization encourages the agent to learn to ground the high-level visual perception and view selection into physical actions and spatial motions. Second, we enhance the existing waypoint predictors by incorporating rich object semantic representations and knowledge about object properties. This helps the model to consider the feasibility of actions in their navigation decisions. For spatial understanding in 3D world, we tackle the limitations of 3D-based LLMs in situated understanding from two perspectives. First, we propose a method to construct an LLM-generated dataset based on our designed situated scene graph. Then, we propose an explicit situated spatial alignment on the 3D-LLM to encourage the model to learn alignment between 3D object and their textual representations directly. Finally, we provide comprehensive experiments to show our own benchmark improve situated understanding of SQA3D and navigation. 83 LEO: In the laundry basket.Situation: You are standing beside sink while there is a potted plant on your left.Question: Where can you put your dirty laundry?LEO: TwoSpa3D-LLM: Three, a door and two cabinets.Situation: You are standing beside tv stand while there is a pillow on your left.Question: how many objects are white to your right?In front of me, there is no object. To my right, there is no object. Behind me, there is a wardrobe, followed by a closed rectangular glass window for looking outside. To my left, there's a desk for placing items in or on.Situation: You are standing beside toilet while there is a trash bin behind you.Question: You need to clean the sink and then adjust the shower curtain. how should you proceed ?Spa3D-LLM: In the laundry basket on your left.Spa3D-LLM (w/o align.): Turn to your right and head towards the sink to clean it. After that, turn around and head towards the shower curtain to adjust it.Spa3D-LLM: Turn to your right and head towards the sink to clean it. Then, turn slightly to your right and head towards the shower curtain to adjust itTurn to your left, head towards the chest of drawer. You may pass chair on your way.Turn to your left, head towards the chest of drawer.You may pass chair on your way.Turn to your right, head towards the chest of drawer.Situation: You ou are standing beside stool while there is a printer on your left.Describe the scene from your position.…LEFTLEFTRIGHTTarget: Find the chest of drawer. CHAPTER 6 CONCLUSION AND FUTURE WORK In this thesis, we aim to enhance the language grounding capabilities of VLN agents. These agents interpret natural language instructions and align them with their visual observations to make accurate action decisions. Effective language grounding is essential for improving both navigation performance and the interpretability of the agent’s decision-making process. The key contributions of this thesis are summarized as follows. 6.1 Summary of Contributions Enhancing VLN Agent Grounding via Explicit Modulation of Spatial Semantics. Most VLN agents overlook explicit spatial semantics modeling, relying primarily on implicit representation learning to align semantics across different modalities. While these methods significantly enhance navigation performance, they compromise the interpretability of the agent’s decision-making process. To address this challenge, we primarily focus on the spatial semantics of motion-related and landmark-related information, and introduce two neural navigation agents with modular designs tailored to effectively learn these key semantics. The first method segments long instructions into spatial-semantic units, each consisting of motions and landmarks. We then identify key landmarks based on navigation progress and align them with the most relevant objects in the environment. Additionally, we model spatial relations between the landmark and the agent across both textual and visual modalities. The second method involves designing two independent modules to separately learn orientation (motion) and vision (landmarks). To achieve this, we introduce novel pre-training tasks tailored for orientation learning and visual perception. These tasks enhance the model’s ability to understand different semantics effectively, which are subsequently leveraged within their respective modules to improve overall performance. Aligning Instructions with Agent Perception and Explaining Decision-Making. Although our explicit grounding methods enhance the navigation agent’s language understanding and align corresponding semantics with the visual environment, we observe ambiguities in instructions, particularly when they contain landmarks that are either unrecognized or indistinctive from the 84 agent’s vision perception. Such ambiguities negatively impact the language grounding ability, leading to challenges in improving navigation performance. To address this challenge, we first introduce a translator module to convert the original ambiguous instructions into easy-to-follow sub-instruction representations. Our design encourages the agent to interpret instructions in alignment with its visual perception. However, the translator’s design relies on implicit learning, making it challenging to explicitly understand the agent’s difficulties in interpreting human instructions. To address this, we further design an explainer module for the VLN agent, utilizing a language model to generate explanations that describe the ambiguity in instructions and the rationale behind the agent’s action decisions. Advancing VLN for Real-World Challenges: Navigation in Continuous and 3D Environments. We advance research on enhancing the applicability of navigation agents to real-world robotic systems from two key perspectives. First, we address navigation in continuous and unstructured environments, where agents must operate using low-level control commands rather than predefined high-level actions. Despite recent progress, existing navigation agents in continuous environments often overlook the crucial role of language grounding, particularly in the execution of low-level actions. To address this gap, we introduce a dual-action-perception module that connects linguistic instructions to the agent’s low-level action space. Second, we extend navigation capabilities to 3D environments, where agents must reason about spatial relationships in three-dimensional space rather than relying solely on 2D image. While LLMs have demonstrated strong reasoning capabilities in 3D spatial contexts, they lack the essential ability for situated spatial understanding—a key requirement for navigation tasks. To address this limitation, we propose a scalable, LLM-generated dataset enriched with situated spatial information and introduce a spatial alignment module to improve the correspondence between 3D visual representations and their textual descriptions. Our method significantly enhances the LLM’s situated spatial understanding ability in the 3D world, ultimately improving navigation performance. 85 6.2 Future Directions This section highlights several promising avenues for future research that extend our findings and methodologies. Beyond the work presented in this thesis, we outline potential future directions from the following aspects. Structured and Interpretable Planning with Foundation Models. Large generative models have demonstrated strong generalization capabilities across various domains. Integrating these models with planning strategies has significantly improved performance across various embodied tasks, such as robotic navigation [146], object manipulation [66], and interactive task execution [116]. However, while foundation models excel at generating natural language outputs, they inherently struggle to produce structured representations such as graphs, decision trees, or task flows. These structured outputs are crucial for downstream tasks, as they facilitate transparent, interpretable, and hierarchically organized reasoning. Therefore, we should focus on equipping large generative models with the ability to generate structured representations, thereby addressing key challenges in decision support for embodied navigation agents. Specifically, developing models that can decompose high-level goals into executable subtasks—represented as hierarchical workflows or decision trees—could significantly improve task planning and execution [129]. Such models can be explored and validated using various embodied benchmarks like ALFRED [100], BEHAVIOR [102], and VirtualHome [84], which offer controlled and diverse environments for evaluating embodied task performance. Adaptive Usage of Foundation Models. Inspired by [81], which use LLMs to generate plans by dynamically decomposing complex sub-tasks as needed, we emphasize the importance of leveraging foundation models more strategically in embodied agents. An interesting future direction would be using foundation models as auxiliary systems that provide guidance and decision-making support when necessary. Specifically, future efforts could focus on two key aspects: 1) Difficulty Analysis: Developing systematic methods to analyze the challenges faced by the model in specific scenarios, such as ambiguous instructions/descriptions. 2) Query Mechanisms: Designing intelligent mechanisms that enable the agent to “ask for help" from the foundation model only when necessary. 86 This includes determining when and how to formulate queries and integrating responses seamlessly into the model’s decision-making pipeline. Compositional Learning. Another interesting direction is compositional learning, where the goal is to decompose complex tasks into smaller and reusable skills [43, 128]. Instead of learning complex embodied tasks, the model can be designed to acquire key concepts in both language and vision. Specifically, it converts language into structured representations, such as programming languages, and uses more formal methods to represent visual concepts like objects and their attributes. We can also develop “action concepts" to learn fundamental units of action policies. This line of research holds a potential to bridge high-level reasoning with low-level execution, enabling the model to tackle different tasks with greater adaptability and robustness. 87 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, and Tieniu Tan. Neighbor-view In Proceedings of the 29th ACM enhanced model for vision and language navigation. International Conference on Multimedia, pages 5101–5109, 2021. Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. arXiv preprint arXiv:2304.03047, 2023. Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022. Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017. Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. arXiv preprint arXiv:2401.07314, 2024. Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Advances in Neural Information Processing Systems, 35:38149–38161, 2022. [10] Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34, 2021. [11] Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding. Advances in neural information processing systems, 35:20522–20535, 2022. 88 [12] Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022. [13] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. 2019. [14] Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021. [15] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. [16] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. [17] Soham Dan, Parisa Kordjamshidi, Julia Bonn, Archna Bhatia, Zheng Cai, Martha Palmer, and Dan Roth. From spatial relations to spatial configurations. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5855–5864, Marseille, France, May 2020. European Language Resources Association. [18] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. [19] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. [20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [21] Zi-Yi Dou and Nanyun Peng. Foam: A follower-aware speaker model for vision-and-language navigation. arXiv preprint arXiv:2206.04294, 2022. [22] Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022. [23] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 89 [24] Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. [25] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker- follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pages 3314–3325, 2018. [26] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024. [27] Georgios Georgakis, Yimeng Li, and Jana Kosecka. Simultaneous mapping and target driven navigation. arXiv preprint arXiv:1911.07980, 2019. [28] Mehdi Ghanimifard and Simon Dobnik. What goes into a word: generating image descriptions with top-down spatial knowledge. In Proceedings of the 12th International Conference on Natural Language Generation, pages 540–551, 2019. [29] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021. [30] Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021. [31] Xiao Guo, Xiaohong Liu, Iacopo Masi, and Xiaoming Liu. Language-guided hierarchical fine-grained image forgery detection and localization. IJCV, 2024. [32] Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision- language model in face forensics: Multi-modal interpretable forged face detector. arXiv preprint arXiv:2503.20188, 2025. [33] Xiao Guo, Manh Tran, Jiaxin Cheng, and Xiaoming Liu. Dense-face: Personalized face generation model via dense annotation prediction. arXiv preprint arXiv:2412.18149, 2024. [34] Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020. [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [36] Yicong Hong, Cristian Rodriguez, Yuankai Qi, Qi Wu, and Stephen Gould. Language and visual entity relationship graph for agent navigation. Advances in Neural Information Processing Systems, 33:7685–7696, 2020. 90 [37] Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, and Stephen Gould. Sub-instruction aware vision-and-language navigation. arXiv preprint arXiv:2004.02707, 2020. [38] Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022. [39] Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. A recurrent vision-and-language bert for navigation. arXiv preprint arXiv:2011.13922, 2020. [40] Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021. [41] Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023. [42] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023. [43] Joy Hsu, Jiayuan Mao, Josh Tenenbaum, and Jiajun Wu. What’s left? concept grounding with logic-enhanced foundation models. Advances in Neural Information Processing Systems, 36, 2024. [44] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [45] Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Kate Saenko, et al. Are you looking? grounding to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347, 2019. [46] Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023. [47] Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446, 2019. [48] Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. arXiv preprint arXiv:1905.12255, 2019. 91 [49] Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 259–266. IEEE, 2010. [50] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017. [51] Parisa Kordjamshidi, Marie-Francine Moens, and Martijn van Otterlo. Spatial Role Labeling: In Proceedings of the Seventh conference on Task definition and annotation scheme. International Language Resources and Evaluation (LREC’10), pages 413–420. European Language Resources Association (ELRA), 2010. [52] [53] [54] Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021. Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environments. In European Conference on Computer Vision, pages 588–603. Springer, 2022. Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the In European nav-graph: Vision and language navigation in continuous environments. Conference on Computer Vision (ECCV), 2020. [55] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across- room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954, 2020. [56] [57] [58] [59] [60] [61] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future- view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023. Jialu Li and Mohit Bansal. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. Advances in Neural Information Processing Systems, 36:21878–21894, 2023. Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, and Mohit Bansal. Vln-video: Utilizing driving videos for outdoor vision-and-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18517–18526, 2024. Jialu Li, Hao Tan, and Mohit Bansal. Improving cross-modal alignment in vision language navigation via syntactic information. arXiv preprint arXiv:2104.09580, 2021. Jialu Li, Hao Tan, and Mohit Bansal. Clear: Improving vision-language navigation with cross-lingual, environment-agnostic representations. arXiv preprint arXiv:2207.02185, 2022. Jialu Li, Hao Tan, and Mohit Bansal. Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15407–15417, 2022. 92 [62] [63] [64] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. [65] Liunian Harold Li, Mark Yatskar, D Yin, CJ Hsieh, and KW Chang. Visualbert: A simple and performant baseline for vision and language. arxiv 2019. arXiv preprint arXiv:1908.03557, 3, 1908. [66] Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061–18070, 2023. [67] Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019. [68] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020. [69] Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, and Xiaodan Liang. Contrastive instruction-trajectory learning for vision-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1592–1600, 2022. [70] Bingqian Lin, Yi Zhu, Zicong Chen, Xiwen Liang, Jianzhuang Liu, and Xiaodan Liang. Adapt: Vision-language navigation with modality-aligned action prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15396–15406, 2022. [71] Chong Liu, Fengda Zhu, Xiaojun Chang, Xiaodan Liang, Zongyuan Ge, and Yi-Dong Shen. Vision-language navigation with random environmental mixup. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1644–1654, 2021. [72] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 93 [73] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035, 2019. [74] Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022. [75] Lina Mezghani, Sainbayar Sukhbaatar, Arthur Szlam, Armand Joulin, and Piotr Bojanowski. Learning to visually navigate in photorealistic environments without any supervision. arXiv preprint arXiv:2004.04954, 2020. [76] Dmytro Mishkin, Alexey Dosovitskiy, and Vladlen Koltun. Benchmarking classic and learned navigation in complex 3d environments. arXiv preprint arXiv:1901.10915, 2019. [77] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021. [78] OpenAI. Hello gpt-4o, 2024. [79] Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889, 2023. [80] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. [81] Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4226–4252, 2024. [82] Tanawan Premsri and Parisa Kordjamshidi. Neuro-symbolic training for reasoning over spatial language. arXiv preprint arXiv:2406.13828, 2024. [83] Tanawan Premsri and Parisa Kordjamshidi. Forest: Frame of reference evaluation in spatial reasoning tasks. arXiv preprint arXiv:2502.17775, 2025. [84] Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. [85] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. [86] Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. Object-and- action aware model for visual language navigation. In European Conference on Computer Vision, pages 303–317. Springer, 2020. 94 [87] Yanyuan Qiao, Qianyi Liu, Jiajun Liu, Jing Liu, and Qi Wu. Llm as copilot for coarse-grained vision-and-language navigation. In European Conference on Computer Vision, pages 459–476. Springer, 2024. [88] Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. Hop: History-and-order aware pre-training for vision-and-language navigation. arXiv preprint arXiv:2203.11591, 2022. [89] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [90] Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. [91] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18082–18091, 2022. [92] Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel X Chang. Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous envi- ronments. arXiv preprint arXiv:2109.15207, 2021. [93] Terry Regier. The human semantic potential: Spatial language and constrained connectionism. MIT Press, 1996. [94] N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. [95] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016. [96] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [97] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 95 [98] Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask transformer for 3d semantic instance segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8216–8223. IEEE, 2023. [99] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022. [100] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. [101] Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xiaoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection. In NeurIPS, 2024. [102] Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behav- ior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on robot learning, pages 477–490. PMLR, 2022. [103] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019. [104] Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in neural information processing systems, 34:251–266, 2021. [105] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. [106] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195, 2019. [107] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-fifth AAAI conference on artificial intelligence, 2011. [108] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR, 2020. [109] Yao-Hung Hubert Tsai, Vansh Dhar, Jialu Li, Bowen Zhang, and Jian Zhang. Multimodal large language model for visual navigation. arXiv preprint arXiv:2310.08669, 2023. 96 [110] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [111] Hanqing Wang, Wenguan Wang, Wei Liang, Caiming Xiong, and Jianbing Shen. Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8455–8464, 2021. [112] Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, and Peter Anderson. Less is more: Generating grounded navigation instructions from landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15428–15438, 2022. [113] Xiaohan Wang, Wenguan Wang, Jiayi Shao, and Yi Yang. Lana: A language-capable navigator for instruction following and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19048–19058, 2023. [114] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6629–6638, 2019. [115] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022. [116] Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. ArXiv, abs/2302.01560, 2023. [117] Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, et al. Bootstrapping language-guided navigation learning with self-refining data flywheel. arXiv preprint arXiv:2412.08467, 2024. [118] Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023. [119] Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video generation with retrieval-augmented motion adaptation. arXiv preprint arXiv:2411.16657, 2024. [120] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357, 2019. [121] Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nassir Navab, and Federico Tombari. Scene- graphfusion: Incremental 3d scene graph prediction from rgb-d sequences. In Proceedings of 97 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7515–7525, 2021. [122] Fei Xia, William B Shen, Chengshu Li, Priya Kasimbeg, Micael Edmond Tchapmi, Alexander Toshev, Roberto Martín-Martín, and Silvio Savarese. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020. [123] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. [124] Haonan Yu, Xiaochen Lian, Haichao Zhang, and Wei Xu. Guided feature transformation (gft): A neural language grounding module for embodied agents. arXiv preprint arXiv:1805.08329, 2018. [125] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022. [126] Shoubin Yu, Jacob Zhiyuan Fang, Jian Zheng, Gunnar Sigurdsson, Vicente Ordonez, Robinson Piramuthu, and Mohit Bansal. Zero-shot controllable image-to-video animation via motion decomposition. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3332–3341, 2024. [127] Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. Veggie: Instructional editing and reasoning video concepts with grounded generation. arXiv preprint arXiv:2503.14350, 2025. [128] Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, and Mohit Bansal. Envgen: Generating and adapting environments via llms for training embodied agents. COLM, 2024. [129] Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. Advances in Neural Information Processing Systems, 36:31466–31523, 2023. [130] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. [131] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021. [132] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv. org/abs/2205.01068, 3:19–0, 2023. [133] Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense reasoning for deepfake detection. In European Conference on Computer Vision, pages 399–415. Springer, 2024. 98 [134] Yue Zhang, Quan Guo, and Parisa Kordjamshidi. Towards navigation by reasoning over spatial configurations. In Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics, pages 42–52, 2021. [135] Yue Zhang, Quan Guo, and Parisa Kordjamshidi. Navhint: Vision and language navigation agent with a hint generator. In Findings of the Association for Computational Linguistics: EACL 2024, pages 92–103, 2024. [136] Yue Zhang and Parisa Kordjamshidi. Explicit object relation alignment for vision and language navigation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 322–331, 2022. [137] Yue Zhang and Parisa Kordjamshidi. Lovis: Learning orientation and visual signals for vision and language navigation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5745–5754, 2022. [138] Yue Zhang and Parisa Kordjamshidi. Vln-trans: Translator for the vision and language In The 61st Annual Meeting Of The Association For Computational navigation agent. Linguistics, 2023. [139] Yue Zhang and Parisa Kordjamshidi. Narrowing the gap between vision and action in navigation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 856–865, 2024. [140] Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. Transactions on Machine Learning Research. [141] Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, and Lifu Huang. Spartun3d: Situated spatial understanding of 3d world in large language models. arXiv preprint arXiv:2410.03878, 2024. [142] Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alexander Ku, Jason Baldridge, and Eugene Ie. On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504, 2021. [143] Chen Zheng and Parisa Kordjamshidi. Srlgrn: Semantic role labeling graph reasoning network. arXiv preprint arXiv:2010.03604, 2020. [144] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. [145] Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986, 2023. [146] Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024. 99 [147] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022. [148] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. [149] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. [150] Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021. [151] Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10012–10022, 2020. [152] Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. Babywalk: Going farther in vision-and-language navigation by taking baby steps. arXiv preprint arXiv:2005.04625, 2020. [153] Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023. 100